Skip to content

Troubleshooting Accelerated Inference of Gemma-2 on V100 GPUs Using vLLM

Last Updated on 2024-09-14 by Clay

Problem Description

Recently, I've achieved some good application results by fine-tuning Gemma-2. However, I encountered various errors when deploying it on the client's equipment, which was quite frustrating. Currently, there isn't a systematic troubleshooting guide online, so I'm documenting it here.

In short, my requirements are as follows:

  1. I need to run the model on V100 GPUs (the client's equipment is a V100 cluster)
  2. I need to use the Gemma-2 architecture (this is my best-performing model)
  3. I need to use the vLLM accelerated inference framework (this is the acceleration solution that best meets the client's needs)

The conflict lies in the fact that the model architecture required by Gemma-2 relies on the implementation of FlashInfer or FlashAttention, but these two frameworks do not support V100 GPUs; the only attention mechanism backend that supports V100, xFormers, does not support Gemma-2's soft capping.


Solutions

Note: The version I am using is vLLM v0.5.4.


Problem 1: V100 does not support bfloat16

First of all, since the GPUs I must deploy are V100s, the first problem I encountered is that V100 does not support bfloat16. (bfloat16 is a floating-point data format specifically designed to improve the computational efficiency of deep learning models. It was developed by Google and is often used in hardware and software like TensorFlow and TPUs. Compared to the standard float32 (32-bit floating-point numbers), bfloat16 reduces storage space and computational cost without significantly reducing precision.)

Therefore, when starting vLLM, you need to add the parameter --dtype=float16 or --dtype=half.

However, in my tests, if Gemma-2 is originally saved as bfloat16 but converted to float16 when starting vLLM, its behavior shows differences. In my real-world application scenarios, the problem is quite severe, resulting in strange formats (since my formatting cannot have any errors).

A more reliable solution is to fine-tune Gemma-2 on my specific domain tasks using float16 and save the model weights in float16.


Problem 2: FlashInfer and FlashAttention do not support V100

In Gemma-2's attention implementation, FlashInfer and FlashAttention are the most suitable, but they also do not support V100. I checked the GitHub issues for both, and it seems there is no follow-up or plans to develop V100 support (as of 2024/09/10).

So when choosing the attention backend, we have to exclude these two. The only attention backend that supports V100 left is xFormers.

In vLLM versions after a certain point, it automatically detects the GPU's Compute Capability (see https://developer.nvidia.com/cuda-gpus) and automatically sets the attention backend to xFormers.

However, in vLLM versions below a certain point, we need to manually switch the attention backend. When starting vLLM, add the environment variable export VLLM_ATTENTION_BACKEND=FLASHINFER. (https://github.com/vllm-project/vllm/issues/6173#issuecomment-2214759644)

But then, we encounter the next error: xFormers does not support Gemma-2's soft capping.


Problem 3: xFormers does not support Gemma-2's soft capping

The only temporary solution to this problem is to go into Gemma-2's configuration file (config.json) and set all soft capping values to null.

The two values are:

  • "attn_logit_softcapping": null
  • "final_logit_softcapping": null


At this point, most of the problems have been resolved, and Gemma-2 should be able to start smoothly. However, I encountered one last small issue, which I will also record here.


Extra Problem: RAM Usage

When deploying on the client's devices using a K8s cluster, a reliable backend engineer colleague set up RAM limits. However, since I hadn't tested it properly beforehand, the colleague in charge of deployment limited the RAM usage of each Pod, causing my vLLM + Gemma-2 to hang and fail to start properly. (https://github.com/vllm-project/vllm/issues/7303#issuecomment-2348058866)

Finally, we identified the problem and relaxed the RAM usage limit to 10GB. This time, we successfully ran our LLM on the client's devices.


References


Read More

Leave a Reply