LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations
Abstract
LMEnt is a suite for analyzing knowledge acquisition in language models during pretraining, providing annotated corpora, retrieval methods, and pretrained models to study knowledge representations and learning dynamics.
Language models (LMs) increasingly drive real-world applications that require world knowledge. However, the internal processes through which models turn data into representations of knowledge and beliefs about the world, are poorly understood. Insights into these processes could pave the way for developing LMs with knowledge representations that are more consistent, robust, and complete. To facilitate studying these questions, we present LMEnt, a suite for analyzing knowledge acquisition in LMs during pretraining. LMEnt introduces: (1) a knowledge-rich pretraining corpus, fully annotated with entity mentions, based on Wikipedia, (2) an entity-based retrieval method over pretraining data that outperforms previous approaches by as much as 80.4%, and (3) 12 pretrained models with up to 1B parameters and 4K intermediate checkpoints, with comparable performance to popular open-sourced models on knowledge benchmarks. Together, these resources provide a controlled environment for analyzing connections between entity mentions in pretraining and downstream performance, and the effects of causal interventions in pretraining data. We show the utility of LMEnt by studying knowledge acquisition across checkpoints, finding that fact frequency is key, but does not fully explain learning trends. We release LMEnt to support studies of knowledge in LMs, including knowledge representations, plasticity, editing, attribution, and learning dynamics.
Community
LMEnt is an open-sourced suite for analyzing knowledge acquisition in language models during pretraining which contains:
๐ A knowledge-rich pretraining corpus, fully annotated with entity mentions, based on Wikipedia.
๐ An entity-based retrieval method over pretraining data that outperforms previous approaches by as much as 80.4%!
๐ค 12 pretrained models with up to 1B parameters and 4K intermediate checkpoints, with comparable performance to popular open-sourced models on knowledge benchmarks.
We release LMEnt to support studies of knowledge in LMs, including knowledge representations, plasticity, editing, attribution, and learning dynamics.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning Facts at Scale with Active Reading (2025)
- Filtering with Self-Attention and Storing with MLP: One-Layer Transformers Can Provably Acquire and Extract Knowledge (2025)
- Learning Dynamics of Meta-Learning in Small Model Pretraining (2025)
- TiMoE: Time-Aware Mixture of Language Experts (2025)
- RAG in the Wild: On the (In)effectiveness of LLMs with Mixture-of-Knowledge Retrieval Augmentation (2025)
- Knowledge Graph-Infused Fine-Tuning for Structured Reasoning in Large Language Models (2025)
- Comparing Knowledge Injection Methods for LLMs in a Low-Resource Regime (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
Models citing this paper 12
Browse 12 models citing this paperDatasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
 
					 
					 
					 
					
 
						 
					