tttoaster commited on
Commit
57cc8c8
·
verified ·
1 Parent(s): f87a9b8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -0
README.md CHANGED
@@ -199,6 +199,30 @@ pip uninstall vllm
199
  pip install git+https://github.com/geyuying/vllm.git@arc-qwen-video
200
  ```
201
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
202
  ### Inference
203
 
204
  ```bash
 
199
  pip install git+https://github.com/geyuying/vllm.git@arc-qwen-video
200
  ```
201
 
202
+ #### An 'Ugly' Workaround for vLLM Installation
203
+ If you are unable to install our provided vllm package, we offer an alternative "ugly" method:
204
+
205
+ 1. Install vllm with Qwen2.5-VL support.
206
+
207
+ 2. Modify config.json. In your model weights directory, open config.json and change the architectures field to "Qwen2_5_VLForConditionalGeneration".
208
+
209
+ 3. Patch the vllm source code. Locate the file vllm/model_executor/models/qwen2_5_vl.py in your vllm installation path. Add the following code inside the __init__ method of the Qwen2_5_VLForConditionalGeneration class:
210
+
211
+ ```
212
+ whisper_path = 'openai/whisper-large-v3'
213
+ speech_encoder = WhisperModel.from_pretrained(whisper_path).encoder
214
+ self.speech_encoder = speech_encoder
215
+ speech_dim = speech_encoder.config.d_model
216
+ llm_hidden_size = config.vision_config.out_hidden_size
217
+ self.mlp_speech = nn.Sequential(
218
+ nn.LayerNorm(speech_dim),
219
+ nn.Linear(speech_dim, llm_hidden_size),
220
+ nn.GELU(),
221
+ nn.Linear(llm_hidden_size, llm_hidden_size)
222
+ )
223
+ ```
224
+ **Why this works**: Our model is based on the Qwen-VL-2.5 architecture, with the addition of an audio encoder and a corresponding MLP. During vllm inference, the multi-modal encoder processes inputs sequentially, while the LLM performs batch inference. Since we only need to pass the final multi-modal embeddings to the LLM, we can reuse the existing code for Qwen-VL-2.5.
225
+
226
  ### Inference
227
 
228
  ```bash