🌐 I've just published Sentence Transformers v5.4 to make the project fully multimodal for embeddings and reranking. The release also includes a modular CrossEncoder, and automatic Flash Attention 2 input flattening. Details:
You can now use SentenceTransformer and CrossEncoder with text, images, audio, and video, with the same familiar API. That means you can compute embeddings for an image and a text query using model.encode(), compare them with model.similarity(), and it just works. Models like Qwen3-VL-Embedding-2B and jinaai/jina-reranker-m0 are supported out of the box.
Beyond multimodal, I also fully modularized the CrossEncoder class. It's now a torch.nn.Sequential of composable modules, just like SentenceTransformer has been. This unlocked support for generative rerankers (CausalLM-based models like mxbai-rerank-v2 and the Qwen3 rerankers) via a new LogitScore module, which wasn't possible before without custom code.
Also, Flash Attention 2 now automatically skips padding for text-only inputs. If your batch has a mix of short and long texts, this gives you a nice speedup and lower VRAM usage for free.
This release has set up the groundwork for more easily introducing late-interaction models (both text-only and multimodal) into Sentence Transformers in the next major release. I'm looking forward to it!
🚀 TRL v0.29.0 introduces trl-training: an agent-native training skill.
This makes the TRL CLI a structured, agent-readable capability, allowing AI agents to reliably execute training workflows such as: - Supervised Fine-Tuning (SFT) - Direct Preference Optimization (DPO) - Group Relative Policy Optimization (GRPO)
We’re excited to see what the community builds on top of this.
If you’re working on AI agents, alignment research, or scalable RL training infrastructure: give TRL v0.29.0 a try! 🤗