Testerpce
's Collections
Multimodal
updated
Qwen2.5-Omni Technical Report
Paper
•
2503.20215
•
Published
•
166
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper
•
2505.22453
•
Published
•
46
UniRL: Self-Improving Unified Multimodal Models via Supervised and
Reinforcement Learning
Paper
•
2505.23380
•
Published
•
22
More Thinking, Less Seeing? Assessing Amplified Hallucination in
Multimodal Reasoning Models
Paper
•
2505.21523
•
Published
•
13
Visual Embodied Brain: Let Multimodal Large Language Models See, Think,
and Control in Spaces
Paper
•
2506.00123
•
Published
•
35
Aligning Latent Spaces with Flow Priors
Paper
•
2506.05240
•
Published
•
27
Discrete Diffusion in Large Language and Multimodal Models: A Survey
Paper
•
2506.13759
•
Published
•
43
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark
for Financial LLM Evaluation
Paper
•
2506.14028
•
Published
•
93
Show-o2: Improved Native Unified Multimodal Models
Paper
•
2506.15564
•
Published
•
28
OmniGen2: Exploration to Advanced Multimodal Generation
Paper
•
2506.18871
•
Published
•
77
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual
Tokens
Paper
•
2506.17218
•
Published
•
29
UniFork: Exploring Modality Alignment for Unified Multimodal
Understanding and Generation
Paper
•
2506.17202
•
Published
•
10
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
Paper
•
2506.21277
•
Published
•
15
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable
Reinforcement Learning
Paper
•
2507.01006
•
Published
•
236
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and
Visual Documents
Paper
•
2507.04590
•
Published
•
16
Robust Multimodal Large Language Models Against Modality Conflict
Paper
•
2507.07151
•
Published
•
5
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation
from Diffusion Models
Paper
•
2507.07104
•
Published
•
45
Can Multimodal Foundation Models Understand Schematic Diagrams? An
Empirical Study on Information-Seeking QA over Scientific Papers
Paper
•
2507.10787
•
Published
•
11
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced
Multimodal Reasoning
Paper
•
2507.22607
•
Published
•
46
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding
and Generation
Paper
•
2508.03320
•
Published
•
61
Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance
for Text-to-Image Generation
Paper
•
2508.18032
•
Published
•
41
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility,
Reasoning, and Efficiency
Paper
•
2508.18265
•
Published
•
201
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable
Text-to-Image Reinforcement Learning
Paper
•
2508.20751
•
Published
•
89
OpenVision 2: A Family of Generative Pretrained Visual Encoders for
Multimodal Learning
Paper
•
2509.01644
•
Published
•
33
Visual Representation Alignment for Multimodal Large Language Models
Paper
•
2509.07979
•
Published
•
82
Reconstruction Alignment Improves Unified Multimodal Models
Paper
•
2509.07295
•
Published
•
40
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid
Vision Tokenizer
Paper
•
2509.16197
•
Published
•
52
Hyper-Bagel: A Unified Acceleration Framework for Multimodal
Understanding and Generation
Paper
•
2509.18824
•
Published
•
22
Learning to See Before Seeing: Demystifying LLM Visual Priors from
Language Pre-training
Paper
•
2509.26625
•
Published
•
42
More Thought, Less Accuracy? On the Dual Nature of Reasoning in
Vision-Language Models
Paper
•
2509.25848
•
Published
•
77
IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
Paper
•
2509.26231
•
Published
•
17
Self-Improvement in Multimodal Large Language Models: A Survey
Paper
•
2510.02665
•
Published
•
19