Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations Paper • 2508.09789 • Published Aug 13 • 5
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents Paper • 2508.13186 • Published Aug 14 • 18
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents Paper • 2508.04038 • Published Aug 6 • 1
MultiRef: Controllable Image Generation with Multiple Visual References Paper • 2508.06905 • Published Aug 9 • 21
LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos Paper • 2508.14041 • Published Aug 19 • 59
Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL Paper • 2508.13167 • Published Aug 6 • 127
Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward Paper • 2508.12800 • Published Aug 18 • 5
Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends Paper • 2508.11548 • Published Aug 15 • 5
Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge Paper • 2508.08777 • Published Aug 12 • 15
Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer Paper • 2508.09131 • Published Aug 12 • 16
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers Paper • 2508.14704 • Published Aug 20 • 42
From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery Paper • 2508.14111 • Published Aug 18 • 33
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models Paper • 2505.04921 • Published May 8 • 185
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding Paper • 2501.05452 • Published Jan 9 • 15
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models Paper • 2504.15279 • Published Apr 21 • 77
Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities Paper • 2406.14562 • Published Jun 20, 2024 • 28
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs Paper • 2501.06186 • Published Jan 10 • 65
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models Paper • 2505.13444 • Published May 19 • 16
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? Paper • 2407.01284 • Published Jul 1, 2024 • 81
ComposeAnything: Composite Object Priors for Text-to-Image Generation Paper • 2505.24086 • Published May 30 • 5
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers Paper • 2506.23918 • Published Jun 30 • 88
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model Paper • 2407.07053 • Published Jul 9, 2024 • 47
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning Paper • 2403.12884 • Published Mar 19, 2024 • 1
CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography Paper • 2504.10090 • Published Apr 14
Visual Programming: Compositional visual reasoning without training Paper • 2211.11559 • Published Nov 18, 2022 • 1
ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning Paper • 2408.02210 • Published Aug 5, 2024 • 9
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks Paper • 2412.18072 • Published Dec 24, 2024 • 20
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention Paper • 2504.06261 • Published Apr 8 • 110
Star Attention: Efficient LLM Inference over Long Sequences Paper • 2411.17116 • Published Nov 26, 2024 • 55
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters Paper • 2504.08791 • Published Apr 7 • 136
LLM Inference Unveiled: Survey and Roofline Model Insights Paper • 2402.16363 • Published Feb 26, 2024 • 4
Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures Paper • 2504.11750 • Published Apr 16
Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices Paper • 2410.11795 • Published Oct 15, 2024 • 18
Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions Paper • 2504.19056 • Published Apr 27 • 18
Personalized Image Generation with Deep Generative Models: A Decade Survey Paper • 2502.13081 • Published Feb 18
Diffusion Models: A Comprehensive Survey of Methods and Applications Paper • 2209.00796 • Published Sep 2, 2022
ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation Paper • 2502.09411 • Published Feb 13 • 22
Text-to-image Diffusion Models in Generative AI: A Survey Paper • 2303.07909 • Published Mar 14, 2023
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration Paper • 2412.04440 • Published Dec 5, 2024 • 22
AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose Task Solving Paper • 2506.12508 • Published Jun 14 • 1
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence Paper • 2407.07061 • Published Jul 9, 2024 • 27
VideoTetris: Towards Compositional Text-to-Video Generation Paper • 2406.04277 • Published Jun 6, 2024 • 25
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation Paper • 2407.14505 • Published Jul 19, 2024 • 26
DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation Paper • 2411.16657 • Published Nov 25, 2024 • 20
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations Paper • 2411.10818 • Published Nov 16, 2024 • 26
VideoPoet: A Large Language Model for Zero-Shot Video Generation Paper • 2312.14125 • Published Dec 21, 2023 • 47
PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices Paper • 2504.03664 • Published Mar 15
FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference Paper • 2503.03777 • Published Mar 4
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs Paper • 2503.16163 • Published Mar 20
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading Paper • 2502.12574 • Published Feb 18 • 12
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints Paper • 2504.09345 • Published Apr 12
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Paper • 2504.10479 • Published Apr 14 • 298
Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation Paper • 2508.18032 • Published Aug 25 • 41
PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent LLMs Paper • 2508.17188 • Published Aug 24 • 17
Explain Before You Answer: A Survey on Compositional Visual Reasoning Paper • 2508.17298 • Published Aug 24 • 4
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs Paper • 2508.16153 • Published Aug 22 • 154
AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications Paper • 2508.16279 • Published Aug 22 • 52
CineScale: Free Lunch in High-Resolution Cinematic Visual Generation Paper • 2508.15774 • Published Aug 21 • 20
Self-Rewarding Vision-Language Model via Reasoning Decomposition Paper • 2508.19652 • Published Aug 27 • 84
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies Paper • 2508.20072 • Published Aug 27 • 31
AudioStory: Generating Long-Form Narrative Audio with Large Language Models Paper • 2508.20088 • Published Aug 27 • 20
MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment Paper • 2508.19527 • Published Aug 27 • 10
Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference Paper • 2508.19559 • Published Aug 27 • 6
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning Paper • 2508.20751 • Published Aug 28 • 89
Dress&Dance: Dress up and Dance as You Like It - Technical Preview Paper • 2508.21070 • Published Aug 28 • 6
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control Paper • 2508.21112 • Published Aug 28 • 75
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code Paper • 2508.18106 • Published Aug 25 • 341
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning Paper • 2508.21113 • Published Aug 28 • 109
UItron: Foundational GUI Agent with Advanced Perception and Planning Paper • 2508.21767 • Published Aug 29 • 12
TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training Paper • 2508.17677 • Published Aug 25 • 14
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers Paper • 2508.21148 • Published Aug 28 • 140
Continual Learning for Large Language Models: A Survey Paper • 2402.01364 • Published Feb 2, 2024 • 1
Continual Learning: Applications and the Road Forward Paper • 2311.11908 • Published Nov 20, 2023 • 1
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey Paper • 2509.02547 • Published Sep 2 • 219
SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning Paper • 2509.02479 • Published Sep 2 • 83
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding Paper • 2508.21496 • Published Aug 29 • 54
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use Paper • 2509.01055 • Published Sep 1 • 72
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion Paper • 2509.01215 • Published Sep 1 • 50
GenCompositor: Generative Video Compositing with Diffusion Transformer Paper • 2509.02460 • Published Sep 2 • 25
OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning Paper • 2509.01644 • Published Sep 1 • 33
Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation Paper • 2509.00428 • Published Aug 30 • 17
Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings Paper • 2508.18733 • Published Aug 26 • 9
Towards a Unified View of Large Language Model Post-Training Paper • 2509.04419 • Published Sep 4 • 73
RedStone: Curating General, Code, Math, and QA Data for Large Language Models Paper • 2412.03398 • Published Dec 4, 2024 • 2
RecAgent: A Novel Simulation Paradigm for Recommender Systems Paper • 2306.02552 • Published Jun 5, 2023 • 1
Adversarial Data Collection: Human-Collaborative Perturbations for Efficient and Robust Robotic Imitation Learning Paper • 2503.11646 • Published Mar 14 • 35
How do language models learn facts? Dynamics, curricula and hallucinations Paper • 2503.21676 • Published Mar 27 • 1
Investigating Multi-source Active Learning for Natural Language Inference Paper • 2302.06976 • Published Feb 14, 2023
Targeted Data Acquisition for Evolving Negotiation Agents Paper • 2106.07728 • Published Jun 14, 2021
UniVerse-1: Unified Audio-Video Generation via Stitching of Experts Paper • 2509.06155 • Published Sep 7 • 13
Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models Paper • 2509.06949 • Published Sep 8 • 56
Reinforcement Learning Foundations for Deep Research Systems: A Survey Paper • 2509.06733 • Published Sep 8 • 31
Visual Representation Alignment for Multimodal Large Language Models Paper • 2509.07979 • Published Sep 9 • 83
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions Paper • 2509.06951 • Published Sep 8 • 31
A Survey of Reinforcement Learning for Large Reasoning Models Paper • 2509.08827 • Published Sep 10 • 184
HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants Paper • 2509.08494 • Published Sep 10
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model Paper • 2509.09372 • Published Sep 11 • 233
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning Paper • 2509.08519 • Published Sep 10 • 126
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning Paper • 2509.09674 • Published Sep 11 • 78
Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis Paper • 2509.09595 • Published Sep 11 • 48
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations Paper • 2509.09676 • Published Sep 11 • 31
Visual Programmability: A Guide for Code-as-Thought in Chart Understanding Paper • 2509.09286 • Published Sep 11 • 11
Agentic Software Engineering: Foundational Pillars and a Research Roadmap Paper • 2509.06216 • Published Sep 7 • 7
AI Agentic Programming: A Survey of Techniques, Challenges, and Opportunities Paper • 2508.11126 • Published Aug 15
Agentic AI Frameworks: Architectures, Protocols, and Design Challenges Paper • 2508.10146 • Published Aug 13
Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs Paper • 2509.15020 • Published Sep 18 • 4
Developer-LLM Conversations: An Empirical Study of Interactions and Generated Code Quality Paper • 2509.10402 • Published Sep 12 • 5
Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding Paper • 2509.15178 • Published Sep 18 • 6
RecoWorld: Building Simulated Environments for Agentic Recommender Systems Paper • 2509.10397 • Published Sep 12 • 7
MultiEdit: Advancing Instruction-based Image Editing on Diverse and Challenging Tasks Paper • 2509.14638 • Published Sep 18 • 11
FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning Paper • 2509.13160 • Published Sep 16 • 29
Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation Paper • 2509.15185 • Published Sep 18 • 29
Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation Paper • 2509.15194 • Published Sep 18 • 33
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data Paper • 2509.15221 • Published Sep 18 • 109
Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration Paper • 2509.14760 • Published Sep 18 • 52
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Paper • 2509.16197 • Published Sep 19 • 54
Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification Paper • 2509.15591 • Published Sep 19 • 45
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models Paper • 2509.17627 • Published Sep 22 • 65
OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System Paper • 2509.18091 • Published Sep 22 • 33
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs Paper • 2509.18056 • Published Sep 22 • 27
GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning Paper • 2509.17437 • Published Sep 22 • 17
EpiCache: Episodic KV Cache Management for Long Conversational Question Answering Paper • 2509.17396 • Published Sep 22 • 19
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? Paper • 2509.16941 • Published Sep 21 • 20
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions Paper • 2509.17177 • Published Sep 21 • 13
Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels Paper • 2509.16596 • Published Sep 20 • 14
Reasoning Core: A Scalable RL Environment for LLM Symbolic Reasoning Paper • 2509.18083 • Published Sep 22 • 5
ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment Paper • 2509.17818 • Published Sep 22 • 8
AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing? Paper • 2509.17641 • Published Sep 22 • 4
DIWALI - Diversity and Inclusivity aWare cuLture specific Items for India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian Context Paper • 2509.17399 • Published Sep 22 • 2
When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs Paper • 2509.16633 • Published Sep 20 • 1
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe Paper • 2509.18154 • Published Sep 16 • 49
Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation Paper • 2509.18824 • Published Sep 23 • 22
What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT Paper • 2509.19284 • Published Sep 23 • 22
VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction Paper • 2509.19002 • Published Sep 23 • 2
EmbeddingGemma: Powerful and Lightweight Text Representations Paper • 2509.20354 • Published Sep 24 • 39
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning Paper • 2509.20360 • Published Sep 24 • 17
PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation Paper • 2509.20358 • Published Sep 24 • 14
Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation Paper • 2509.19244 • Published Sep 23 • 11
Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say Paper • 2509.21164 • Published Sep 25 • 8
VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models Paper • 2509.19803 • Published Sep 24 • 117
SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines Paper • 2509.21320 • Published Sep 25 • 99
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources Paper • 2509.21268 • Published Sep 25 • 101
Seedream 4.0: Toward Next-generation Multimodal Image Generation Paper • 2509.20427 • Published Sep 24 • 76
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them Paper • 2509.21117 • Published Sep 25 • 29
Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution Paper • 2509.21072 • Published Sep 25 • 15
Does FLUX Already Know How to Perform Physically Plausible Image Composition? Paper • 2509.21278 • Published Sep 25 • 14
Understanding the Thinking Process of Reasoning Models: A Perspective from Schoenfeld's Episode Theory Paper • 2509.14662 • Published Sep 18 • 13
SD3.5-Flash: Distribution-Guided Distillation of Generative Flows Paper • 2509.21318 • Published Sep 25 • 10
UserRL: Training Interactive User-Centric Agent via Reinforcement Learning Paper • 2509.19736 • Published Sep 24 • 11
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning Paper • 2509.21113 • Published Sep 25 • 5
SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent Paper • 2509.20414 • Published Sep 24 • 9
Thinking While Listening: Simple Test Time Scaling For Audio Classification Paper • 2509.19676 • Published Sep 24 • 4
When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity Paper • 2509.20293 • Published Sep 24 • 7
Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving Paper • 2509.20109 • Published Sep 24 • 3
Blueprints of Trust: AI System Cards for End to End Transparency and Governance Paper • 2509.20394 • Published Sep 23 • 2
StyleBench: Evaluating thinking styles in Large Language Models Paper • 2509.20868 • Published Sep 25 • 3
OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps Paper • 2509.19282 • Published Sep 23 • 6
LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer Paper • 2509.22414 • Published Sep 26 • 21
UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models Paper • 2509.21760 • Published Sep 26 • 14
VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing Paper • 2509.22651 • Published Sep 26 • 22
Evaluating Very Long-Term Conversational Memory of LLM Agents Paper • 2402.17753 • Published Feb 27, 2024 • 20
VBench: Comprehensive Benchmark Suite for Video Generative Models Paper • 2311.17982 • Published Nov 29, 2023 • 9
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness Paper • 2503.21755 • Published Mar 27 • 33
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models Paper • 2411.13503 • Published Nov 20, 2024 • 34
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation Paper • 2406.16855 • Published Jun 24, 2024 • 57
AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection Paper • 2504.20865 • Published Apr 29
ConsumerBench: Benchmarking Generative AI Applications on End-User Devices Paper • 2506.17538 • Published Jun 21 • 7
Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol Paper • 2503.05860 • Published Mar 7 • 11
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks Paper • 2507.12284 • Published Jul 16 • 1
SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation Paper • 2406.14991 • Published Jun 21, 2024 • 2
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation Paper • 2506.00482 • Published May 31 • 8
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions Paper • 2406.15877 • Published Jun 22, 2024 • 48
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains Paper • 2407.18961 • Published Jul 18, 2024 • 40
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation Paper • 2504.02782 • Published Apr 3 • 57
7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models Paper • 2508.12919 • Published Aug 18
Instruction-Following Evaluation in Function Calling for Large Language Models Paper • 2509.18420 • Published Sep 22 • 1
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing Paper • 2509.22186 • Published Sep 26 • 127
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs Paper • 2509.22220 • Published Sep 26 • 64
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark Paper • 2509.24897 • Published Sep 29 • 46
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing Paper • 2509.24900 • Published Sep 29 • 53
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper • 2505.09568 • Published May 14 • 98
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities Paper • 2505.02567 • Published May 5 • 80
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Paper • 2406.12644 • Published Jun 18, 2024 • 5
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies Paper • 2506.12830 • Published Jun 15
CompBench: Benchmarking Complex Instruction-guided Image Editing Paper • 2505.12200 • Published May 18
Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination Paper • 2509.01986 • Published Sep 2 • 4
GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment Paper • 2310.11513 • Published Oct 17, 2023 • 1
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer Paper • 2509.24695 • Published Sep 29 • 43
EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering Paper • 2509.25175 • Published Sep 29 • 29
Towards Personalized Deep Research: Benchmarks and Evaluations Paper • 2509.25106 • Published Sep 29 • 27
VideoScore2: Think before You Score in Generative Video Evaluation Paper • 2509.22799 • Published Sep 26 • 24
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time Paper • 2509.25161 • Published Sep 29 • 23
PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images Paper • 2509.25185 • Published Sep 29 • 4
Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification Paper • 2509.23061 • Published Sep 27 • 6
PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation Paper • 2509.23338 • Published Sep 27 • 4
BPMN Assistant: An LLM-Based Approach to Business Process Modeling Paper • 2509.24592 • Published Sep 29 • 1
Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models Paper • 2509.23233 • Published Sep 27 • 2
Advancing Reference-free Evaluation of Video Captions with Factual Analysis Paper • 2509.16538 • Published Sep 20
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Paper • 2509.24002 • Published Sep 28 • 170
OceanGym: A Benchmark Environment for Underwater Embodied Agents Paper • 2509.26536 • Published Sep 30 • 34
DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder Paper • 2509.25182 • Published Sep 29 • 36
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training Paper • 2509.26625 • Published Sep 30 • 43
VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications Paper • 2509.26490 • Published Sep 30 • 19
TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics Paper • 2509.26329 • Published Sep 30 • 2
BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software Paper • 2509.25248 • Published Sep 27 • 2
Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation Paper • 2509.26555 • Published Sep 30
Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents Paper • 2509.26539 • Published Sep 30 • 8
Improving Editability in Image Generation with Layer-wise Memory Paper • 2505.01079 • Published May 2 • 29
Edit Transfer: Learning Image Editing via Vision In-Context Relations Paper • 2503.13327 • Published Mar 17 • 29
Text2Layer: Layered Image Generation using Latent Diffusion Model Paper • 2307.09781 • Published Jul 19, 2023 • 15
Code2Video: A Code-centric Paradigm for Educational Video Generation Paper • 2510.01174 • Published Oct 1 • 33
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses Paper • 2510.00232 • Published Sep 30 • 15
In-Place Feedback: A New Paradigm for Guiding LLMs in Multi-Turn Reasoning Paper • 2510.00777 • Published Oct 1 • 2
An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications Paper • 2509.19185 • Published Sep 23 • 3
Can Large Multimodal Models Uncover Deep Semantics Behind Images? Paper • 2402.11281 • Published Feb 17, 2024 • 1
Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models Paper • 2509.25162 • Published Sep 29 • 3
BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration Paper • 2510.00438 • Published Oct 1 • 4
BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs Paper • 2509.26514 • Published Sep 30 • 3
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation Paper • 2510.02283 • Published about 1 month ago • 91
StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? Paper • 2510.02209 • Published about 1 month ago • 51
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation Paper • 2510.01284 • Published Sep 30 • 31
A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports Paper • 2510.02190 • Published about 1 month ago • 18
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs Paper • 2504.17432 • Published Apr 24 • 39
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation Paper • 2411.04997 • Published Nov 7, 2024 • 39
Veagle: Advancements in Multimodal Representation Learning Paper • 2403.08773 • Published Jan 18, 2024 • 10
CoDA: Agentic Systems for Collaborative Data Visualization Paper • 2510.03194 • Published 29 days ago • 28
SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys? Paper • 2510.03120 • Published 29 days ago • 6
Paper2Video: Automatic Video Generation from Scientific Papers Paper • 2510.05096 • Published 26 days ago • 109
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation Paper • 2510.05094 • Published 26 days ago • 36
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models Paper • 2510.04618 • Published 27 days ago • 112
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights Paper • 2510.04800 • Published 26 days ago • 36
Cache-to-Cache: Direct Semantic Communication Between Large Language Models Paper • 2510.03215 • Published 29 days ago • 93
Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer Paper • 2510.06590 • Published 25 days ago • 70
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding Paper • 2510.06308 • Published 25 days ago • 52
SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models Paper • 2510.06917 • Published 24 days ago • 34
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization Paper • 2510.08540 • Published 23 days ago • 108
MATRIX: Mask Track Alignment for Interaction-aware Video Generation Paper • 2510.07310 • Published 24 days ago • 35
RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training Paper • 2510.06710 • Published 25 days ago • 36
Vibe Checker: Aligning Code Evaluation with Human Preference Paper • 2510.07315 • Published 24 days ago • 30
U-Bench: A Comprehensive Understanding of U-Net through 100-Variant Benchmarking Paper • 2510.07041 • Published 24 days ago • 3
DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents Paper • 2509.21842 • Published Sep 26 • 2
UniVideo: Unified Understanding, Generation, and Editing for Videos Paper • 2510.08377 • Published 23 days ago • 68
UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution Paper • 2510.08143 • Published 23 days ago • 20
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG Paper • 2510.03663 • Published 29 days ago • 15
NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents Paper • 2510.07172 • Published 24 days ago • 28
VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning Paper • 2510.08555 • Published 23 days ago • 62
Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training Paper • 2510.08008 • Published 24 days ago • 5
Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs Paper • 2510.07429 • Published 24 days ago • 3
Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window Paper • 2510.08276 • Published 23 days ago • 9
SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models Paper • 2510.08559 • Published 23 days ago • 8
WithAnyone: Towards Controllable and ID Consistent Image Generation Paper • 2510.14975 • Published 16 days ago • 79
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Paper • 2510.14979 • Published 16 days ago • 65
Attention Is All You Need for KV Cache in Diffusion LLMs Paper • 2510.14973 • Published 16 days ago • 37
Learning an Image Editing Model without Image Editing Pairs Paper • 2510.14978 • Published 16 days ago • 7
pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation Paper • 2510.14974 • Published 16 days ago • 7
RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems Paper • 2510.13910 • Published 18 days ago • 1
DeepAgent: A General Reasoning Agent with Scalable Toolsets Paper • 2510.21618 • Published 8 days ago • 90
Video-As-Prompt: Unified Semantic Control for Video Generation Paper • 2510.20888 • Published 9 days ago • 41
UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning Paper • 2510.20286 • Published 10 days ago • 22
From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model Paper • 2510.19871 • Published 11 days ago • 28
RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging Paper • 2510.20479 • Published 9 days ago • 11
Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs Paper • 2510.13251 • Published 18 days ago • 12
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling Paper • 2510.20206 • Published 10 days ago • 11
AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite Paper • 2510.21652 • Published 8 days ago • 3
ARC-Encoder: learning compressed text representations for large language models Paper • 2510.20535 • Published 9 days ago • 5
Taming Modality Entanglement in Continual Audio-Visual Segmentation Paper • 2510.17234 • Published 13 days ago • 3
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs Paper • 2307.16789 • Published Jul 31, 2023 • 101
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs Paper • 2304.08244 • Published Apr 14, 2023 • 1
ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use Paper • 2501.02506 • Published Jan 5 • 11
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents Paper • 2207.01206 • Published Jul 4, 2022 • 3
OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents Paper • 2510.24563 • Published 4 days ago • 22
WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking Paper • 2510.24697 • Published 4 days ago • 20
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions Paper • 2510.10666 • Published 20 days ago • 27
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models Paper • 2506.01062 • Published Jun 1 • 5
Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance Paper • 2510.24711 • Published 4 days ago • 18
VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations Paper • 2510.22373 • Published 7 days ago • 7
PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding Paper • 2510.22264 • Published 7 days ago • 1
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark Paper • 2510.26802 • Published 2 days ago • 28
AMO-Bench: Large Language Models Still Struggle in High School Math Competitions Paper • 2510.26768 • Published 2 days ago • 30
The Era of Agentic Organization: Learning to Organize with Language Models Paper • 2510.26658 • Published 2 days ago • 20
OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation Paper • 2510.26213 • Published 3 days ago • 8
Magentic Marketplace: An Open-Source Environment for Studying Agentic Markets Paper • 2510.25779 • Published 5 days ago • 8
CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark Paper • 2510.26160 • Published 3 days ago • 3
The End of Manual Decoding: Towards Truly End-to-End Language Models Paper • 2510.26697 • Published 2 days ago • 90
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning Paper • 2510.23473 • Published 5 days ago • 79
JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence Paper • 2510.23538 • Published 5 days ago • 90
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution Paper • 2510.25726 • Published 3 days ago • 41
VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning Paper • 2510.25772 • Published 3 days ago • 32
RegionE: Adaptive Region-Aware Generation for Efficient Image Editing Paper • 2510.25590 • Published 3 days ago • 24
Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks Paper • 2510.25760 • Published 3 days ago • 16
SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs Paper • 2510.25092 • Published 4 days ago • 6
Reasoning Language Model Inference Serving Unveiled: An Empirical Study Paper • 2510.18672 • Published 11 days ago • 6
InteractComp: Evaluating Search Agents With Ambiguous Queries Paper • 2510.24668 • Published 4 days ago • 94