Abstract
A benchmark for code-diff understanding with tasks including apply, anti-apply, and diff generation, revealing optimal diff formats based on model size and use case.
Reliable handling of code diffs is central to agents that edit and refactor repositories at scale. We introduce Diff-XYZ, a compact benchmark for code-diff understanding with three supervised tasks: apply (old code + diff rightarrow new code), anti-apply (new code - diff rightarrow old code), and diff generation (new code - old code rightarrow diff). Instances in the benchmark are triples langle old code, new code, diff rangle drawn from real commits in CommitPackFT, paired with automatic metrics and a clear evaluation protocol. We use the benchmark to do a focused empirical study of the unified diff format and run a cross-format comparison of different diff representations. Our findings reveal that different formats should be used depending on the use case and model size. For example, representing diffs in search-replace format is good for larger models in the diff generation scenario, yet not suited well for diff analysis and smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing and improving diff handling in LLMs that can aid future development of diff formats and models editing code. The dataset is published on HuggingFace Hub: https://huggingface.co/datasets/JetBrains-Research/diff-xyz.
Community
We present a paper that explores LLMs capabilities to understand diffs and to generate diffs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Generating High-Quality Datasets for Code Editing via Open-Source Language Models (2025)
- Probing Pre-trained Language Models on Code Changes: Insights from ReDef, a High-Confidence Just-in-Time Defect Prediction Dataset (2025)
- SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? (2025)
- VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities (2025)
- A Benchmark for Localizing Code and Non-Code Issues in Software Projects (2025)
- GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging (2025)
- TENET: Leveraging Tests Beyond Validation for Code Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper