Enhance model card for SRUM
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,47 +1,75 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
| 5 |
metrics:
|
| 6 |
- accuracy
|
| 7 |
-
base_model:
|
| 8 |
-
- ByteDance-Seed/BAGEL-7B-MoT
|
| 9 |
pipeline_tag: text-to-image
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
<p align="center">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
<a href="https://waynejin0918.github.io/srum_web/">
|
| 14 |
<img
|
| 15 |
src="https://img.shields.io/badge/SRUM-Website-blue"
|
| 16 |
alt="SRUM Website"
|
| 17 |
/>
|
| 18 |
</a>
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
</p>
|
| 21 |
|
| 22 |
# SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models
|
| 23 |
> [Weiyang Jin*](https://github.com/WayneJin0918), [Yuwei Niu*](https://purshow.github.io/), Jiaqi Liao, [Chengqi Duan](https://scholar.google.com/citations?user=r9qb4ZwAAAAJ&hl=en), Aoxue Li, [Shenghua Gao](https://scholar.google.com/citations?user=fe-1v0MAAAAJ&hl=en), [Xihui Liu :email: ](https://xh-liu.github.io/)
|
| 24 |
>
|
| 25 |
> contact: [email protected]
|
| 26 |
-
>
|
| 27 |
-
>
|
| 28 |
-
The figure below showcases SRUM's qualitative performance compared with SFT and Base Model.
|
| 29 |
-
|
| 30 |
-
<!-- ## ๐ง Method
|
| 31 |
-
BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the modelโs capacity to learn from richly diverse multimodal information. Following the same principle of capacity maximization, it utilizes two separate encoders to capture pixel-level and semantic-level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target.
|
| 32 |
-
|
| 33 |
-
BAGEL scales MoTโs capacity through Pre-training, Continued Training, and Supervised Finetuning on trillions of interleaved multimodal tokens spanning language, image, video, and web data. It surpasses open models on standard understanding and generation benchmarks and demonstrates advanced in-context multimodal abilities like free-form image editing, future frame prediction, 3D manipulation, world navigation, and sequential reasoning.
|
| 34 |
|
| 35 |
-
|
|
|
|
| 36 |
|
|
|
|
| 37 |
|
| 38 |
-
|
| 39 |
-
<p align="center"><img src="assets/emerging_curves.png" width="95%"></p>
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
## ๐ฎ Notice
|
| 44 |
-
<!-- **Call for Bad Cases:** If you have encountered any cases where the model performs poorly, we would greatly appreciate it if you could share them in the [issue#11](https://github.com/ByteDance-Seed/Bagel/issues/11) or [Discord](https://discord.gg/Z836xxzy). -->
|
| 45 |
Follow the Bagel's original settings, you should focus:
|
| 46 |
|
| 47 |
**About Inference Hyperparameters:**
|
|
@@ -57,6 +85,62 @@ Follow the Bagel's original settings, you should focus:
|
|
| 57 |
- `text_channel`: Like `channel`, but only applies to text condition (good for editing, may cause blur).
|
| 58 |
- **If edited images appear blurry, try `global` CFG-Renorm, decrease `cfg_renorm_min` or decrease `cfg_scale`.**
|
| 59 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
## ๐ Benchmarks
|
| 61 |
|
| 62 |
### 1. Composition
|
|
@@ -91,5 +175,16 @@ Follow the Bagel's original settings, you should focus:
|
|
| 91 |
|
| 92 |
*Performance comparison of Bagel models across four categories and their average scores. **Bold values** indicate the best performance in each column.*
|
| 93 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
## ๐ License
|
| 95 |
SRUM is licensed under the Apache 2.0.
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- ByteDance-Seed/BAGEL-7B-MoT
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
+
license: apache-2.0
|
| 7 |
metrics:
|
| 8 |
- accuracy
|
|
|
|
|
|
|
| 9 |
pipeline_tag: text-to-image
|
| 10 |
+
library_name: transformers
|
| 11 |
---
|
| 12 |
|
| 13 |
<p align="center">
|
| 14 |
+
<img src="https://github.com/WayneJin0918/SRUM/raw/main/assets/srum_log_2.png" alt="SRUM" width="220"/>
|
| 15 |
+
</p>
|
| 16 |
+
|
| 17 |
+
<p align="center">
|
| 18 |
+
<a href="https://huggingface.co/papers/2510.12784">
|
| 19 |
+
<img
|
| 20 |
+
src="https://img.shields.io/badge/SRUM-Paper-red"
|
| 21 |
+
alt="SRUM Paper on Hugging Face"
|
| 22 |
+
/>
|
| 23 |
+
</a>
|
| 24 |
+
<a href="https://github.com/WayneJin0918/SRUM">
|
| 25 |
+
<img
|
| 26 |
+
src="https://img.shields.io/badge/GitHub-Code-black"
|
| 27 |
+
alt="GitHub Repository"
|
| 28 |
+
/>
|
| 29 |
+
</a>
|
| 30 |
<a href="https://waynejin0918.github.io/srum_web/">
|
| 31 |
<img
|
| 32 |
src="https://img.shields.io/badge/SRUM-Website-blue"
|
| 33 |
alt="SRUM Website"
|
| 34 |
/>
|
| 35 |
</a>
|
| 36 |
+
<a href="https://huggingface.co/Wayne-King/SRUM_BAGEL_7B_MoT">
|
| 37 |
+
<img
|
| 38 |
+
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue"
|
| 39 |
+
alt="SRUM Model"
|
| 40 |
+
/>
|
| 41 |
+
</a>
|
| 42 |
+
<a href="https://huggingface.co/datasets/Wayne-King/SRUM_6k_CompBench_Train">
|
| 43 |
+
<img
|
| 44 |
+
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Datasets-blue"
|
| 45 |
+
alt="SRUM Data"
|
| 46 |
+
/>
|
| 47 |
+
</a>
|
| 48 |
+
<a href="https://huggingface.co/spaces/Wayne-King/SRUM_Bagel_MoT-7B">
|
| 49 |
+
<img
|
| 50 |
+
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
| 51 |
+
alt="SRUM Demo"
|
| 52 |
+
/>
|
| 53 |
+
</a>
|
| 54 |
</p>
|
| 55 |
|
| 56 |
# SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models
|
| 57 |
> [Weiyang Jin*](https://github.com/WayneJin0918), [Yuwei Niu*](https://purshow.github.io/), Jiaqi Liao, [Chengqi Duan](https://scholar.google.com/citations?user=r9qb4ZwAAAAJ&hl=en), Aoxue Li, [Shenghua Gao](https://scholar.google.com/citations?user=fe-1v0MAAAAJ&hl=en), [Xihui Liu :email: ](https://xh-liu.github.io/)
|
| 58 |
>
|
| 59 |
> contact: [email protected]
|
| 60 |
+
>
|
| 61 |
+
> **Abstract:** Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a significant gap exists where a model's strong visual understanding often fails to transfer to its visual generation. A model might correctly understand an image based on user instructions, yet be unable to generate a faithful image from text prompts. This phenomenon directly raises a compelling question: Can a model achieve self-improvement by using its understanding module to reward its generation module? To bridge this gap and achieve self-improvement, we introduce SRUM, a self-rewarding post-training framework that can be directly applied to existing UMMs of various designs. SRUM creates a feedback loop where the model's own understanding module acts as an internal `evaluator`, providing corrective signals to improve its generation module, without requiring additional human-labeled data. To ensure this feedback is comprehensive, we designed a global-local dual reward system. To tackle the inherent structural complexity of images, this system offers multi-scale guidance: a **global reward** ensures the correctness of the overall visual semantics and layout, while a **local reward** refines fine-grained, object-level fidelity. SRUM leads to powerful capabilities and shows strong generalization, boosting performance on T2I-CompBench from 82.18 to **88.37** and on T2I-ReasonBench from 43.82 to **46.75**. Overall, our work establishes a powerful new paradigm for enabling a UMMs' understanding module to guide and enhance its own generation via self-rewarding.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
We present **SRUM**, a post-training reward fine-tuning method based on Unified Multimodal Models (UMMs) leverages UMMs' inherent understanding capabilities to boost their generative abilities, bridging the gaps in performance caused by conflicts during the previous training phase. SRUM demonstrates exceptional generalization across both common positions and world knowledge.
|
| 64 |
+
The figure below showcases SRUM's qualitative performance compared with SFT and Base Model.
|
| 65 |
|
| 66 |
+
## ๐ข News
|
| 67 |
|
| 68 |
+
We sincerely thank all contributors from the open community for their valuable support.
|
|
|
|
| 69 |
|
| 70 |
+
- **Nov. 15, 2025:** We released the official [website](https://waynejin0918.github.io/srum_web/), [model](https://huggingface.co/Wayne-King/SRUM_BAGEL_7B_MoT), and [report](https://arxiv.org/abs/2510.12784) for SRUM. And please upvote for our [huggingface daily paper](https://huggingface.co/papers/2510.12784) as well as try the [demo](https://huggingface.co/spaces/Wayne-King/SRUM_Bagel_MoT-7B)
|
| 71 |
|
| 72 |
## ๐ฎ Notice
|
|
|
|
| 73 |
Follow the Bagel's original settings, you should focus:
|
| 74 |
|
| 75 |
**About Inference Hyperparameters:**
|
|
|
|
| 85 |
- `text_channel`: Like `channel`, but only applies to text condition (good for editing, may cause blur).
|
| 86 |
- **If edited images appear blurry, try `global` CFG-Renorm, decrease `cfg_renorm_min` or decrease `cfg_scale`.**
|
| 87 |
|
| 88 |
+
## ๐ฅ Quick Start
|
| 89 |
+
|
| 90 |
+
1๏ธโฃ Set up environment
|
| 91 |
+
```bash
|
| 92 |
+
git clone https://github.com/WayneJin0918/SRUM
|
| 93 |
+
cd SRUM
|
| 94 |
+
conda env create -f environment.yaml
|
| 95 |
+
conda activate SRUM
|
| 96 |
+
pip install -r requirements.txt
|
| 97 |
+
```
|
| 98 |
+
if flash attention is hard to pip, please follow:
|
| 99 |
+
|
| 100 |
+
```bash
|
| 101 |
+
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.0.post2/flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 102 |
+
pip install flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
Or you can follow the settings of Bagel
|
| 106 |
+
|
| 107 |
+
2๏ธโฃ Download Bagel pretrained or our SRUM checkpoint
|
| 108 |
+
```python
|
| 109 |
+
#bagel
|
| 110 |
+
from huggingface_hub import snapshot_download
|
| 111 |
+
|
| 112 |
+
save_dir = "models/BAGEL-7B-MoT"
|
| 113 |
+
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"
|
| 114 |
+
cache_dir = save_dir + "/cache"
|
| 115 |
+
|
| 116 |
+
snapshot_download(cache_dir=cache_dir,
|
| 117 |
+
local_dir=save_dir,
|
| 118 |
+
repo_id=repo_id,
|
| 119 |
+
local_dir_use_symlinks=False,
|
| 120 |
+
resume_download=True,
|
| 121 |
+
allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
|
| 122 |
+
)
|
| 123 |
+
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
```python
|
| 127 |
+
#SRUM
|
| 128 |
+
from huggingface_hub import snapshot_download
|
| 129 |
+
|
| 130 |
+
save_dir = "models/SRUM_BAGEL_7B_MoT"
|
| 131 |
+
repo_id = "Wayne-King/SRUM_BAGEL_7B_MoT"
|
| 132 |
+
cache_dir = save_dir + "/cache"
|
| 133 |
+
|
| 134 |
+
snapshot_download(cache_dir=cache_dir,
|
| 135 |
+
local_dir=save_dir,
|
| 136 |
+
repo_id=repo_id,
|
| 137 |
+
local_dir_use_symlinks=False,
|
| 138 |
+
resume_download=True,
|
| 139 |
+
allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
|
| 140 |
+
)
|
| 141 |
+
|
| 142 |
+
```
|
| 143 |
+
|
| 144 |
## ๐ Benchmarks
|
| 145 |
|
| 146 |
### 1. Composition
|
|
|
|
| 175 |
|
| 176 |
*Performance comparison of Bagel models across four categories and their average scores. **Bold values** indicate the best performance in each column.*
|
| 177 |
|
| 178 |
+
## โ๏ธ Citation
|
| 179 |
+
|
| 180 |
+
```bibtex
|
| 181 |
+
@article{deng2025bagel,
|
| 182 |
+
title = {SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models},
|
| 183 |
+
author = {Jin, Weiyang and Niu, Yuwei and Liao, Jiaqi and Duan, Chengqi and Li, Aoxue and Gao, Shenghua and Liu, Xihui},
|
| 184 |
+
journal = {arXiv preprint arXiv:2510.12784},
|
| 185 |
+
year = {2025}
|
| 186 |
+
}
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
## ๐ License
|
| 190 |
SRUM is licensed under the Apache 2.0.
|