Machine Learning

Differences in Precision Representations in Deep Learning: Float32, Float16, Float8, and BFloat16

Clay
2024-09-252024-09-25
AI, Machine Learning

In the process of training and fine-tuning deep neural networks, the most important and scarce resource is undoubtedly the GPU’s VRAM. Therefore, making every bit perform at its best is a critical task.

Troubleshooting Accelerated Inference of Gemma-2 on V100 GPUs Using vLLM

Clay
2024-09-142024-09-14
AI, Machine Learning

Problem Description

Recently, I’ve achieved some good application results by fine-tuning Gemma-2. However, I encountered various errors when deploying it on the client’s equipment, which was quite frustrating. Currently, there isn’t a systematic troubleshooting guide online, so I’m documenting it here.

[PyTorch] Traversing Every Layer of a Neural Network in a Model

Clay
2024-09-102024-09-10
Machine Learning, PyTorch

Introduction

Recently, due to some serendipitous events, I had a chance to modify the architecture of a model slightly. I took this opportunity to explore how to iterate and print the layers of neural networks in PyTorch.

OpenAI Triton Note (2): Fused Softmax

Clay
2024-09-092024-09-09
Machine Learning, PyTorch

Introduction

Softmax is a commonly used activation function, and it is often employed as the last layer in multi-class classification.

OpenAI Triton Note (1): Vector Addition

Clay
2024-09-082024-09-08
Machine Learning, PyTorch

Introduction

Triton is an open-source GPU programming language compiler released by OpenAI in 2021. Over recent years, it has become increasingly popular among developers for writing and optimizing parallel programs on GPUs. Compared to traditional libraries such as CUDA or OpenCL, Triton offers a Python-like syntax, making it more readable and easier to learn.

[PyTorch] BERT Architecture Implementation Note

Clay
2024-09-072024-09-07
Machine Learning, PyTorch

Introduction

My advisor used to tell me, “Don’t just use other people’s libraries; you have to write your own to truly understand.” Back then, I didn’t have much time to implement various technologies I was interested in since I was fully occupied with my dissertation. However, I often recall his earnest advice even now, and it prompted me to finally attempt the implementation of BERT, a classic encoder-only transformer model.

Using the Integrated Outlines Tool for Decoding Constraints in the vLLM Inference Acceleration Framework

Clay
2024-09-062024-09-07
Machine Learning, Python

Recently, I integrated several applications of Outlines into my current workflow. Among them, the one I use most frequently is with vLLM. However, for some reason, its documentation has not been merged into the vLLM GitHub repository, so while designing the process, I had to constantly refer to the source code of a rejected PR for guidance XD

Implementation of Using Finite-State Machine to Constrain Large Language Model Decoding

Clay
2024-09-052024-09-05
Machine Learning, Python

This is a simple Python implementation, used to test Finite-State Machine (FSM) constraints for a Large Language Model (LLM) to decode responses in a specific format. It also serves as an introduction to the concept behind the Outlines tool. Of course, my implementation is far simpler compared to the actual Outlines tool.

Structuring Model Outputs Using the Outlines Tool

Clay
2024-09-032024-09-03
AI, Machine Learning

When applying Large Language Models (LLMs) in real-world scenarios, it’s often not just about letting the model generate text freely. We might want the model to return specific structures, such as multiple-choice questions or providing a rating. In such cases, transformers-based models can directly use the outlines tool.

Implementing Streamed Output Token Generation Using TextStreamer and TextIteratorStreamer in HuggingFace Transformers

Clay
2024-09-012024-09-01
Machine Learning

Introduction

Generative models are becoming increasingly powerful, and independent researchers are deploying one open-source large language model (LLMs) after another. However, when using LLMs for inference or generating responses, waiting for a longer output can be quite time-consuming.

« Previous
1
2
3
4
5
…
16
Next »

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31