arnosimons
/

astro-hep-bert

@@ -63,15 +63,19 @@ tags:
 # Model Card for Astro-HEP-BERT
-**Astro-HEP-BERT** is a bidirectional transformer designed primarily to generate contextualized word embeddings for computational conceptual analysis in astrophysics and high-energy physics (HEP). Built upon Google's `bert-base-uncased`, the model underwent additional training for three epochs using the <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/datasets/arnosimons/astro-hep-corpus">Astro-HEP Corpus</a>, containing 21.84 million paragraphs found in more than 600,000 scholarly articles sourced from arXiv, all pertaining to astrophysics and/or high-energy physics (HEP). The sole training objective was masked language modeling.
 The Astro-HEP-BERT project demonstrates the general feasibility of training a customized bidirectional transformer for computational conceptual analysis in the history, philosophy, and sociology of science as an open-source endeavor that does not require a substantial budget. Leveraging only freely available code, weights, and text inputs, the entire training process was conducted on a single MacBook Pro Laptop (M2/96GB).
-For further insights into the model, the corpus, and the underlying research project (<a target="_blank" rel="noopener noreferrer" href="https://doi.org/10.3030/101044932" >Network Epistemology in Practice</a>) please refer to the following two papers:
-1) <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2411.14877">Simons, A. (2024). Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts in astrophysics and high energy physics. arXiv:2411.14877.</a>
-2) <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2411.14073">Simons, A. (2024). Meaning at the planck scale? Contextualized word embeddings for doing history, philosophy, and sociology of science. arXiv:2411.14073.</a>
 ## Model Details

 # Model Card for Astro-HEP-BERT
+**Astro-HEP-BERT** is a bidirectional transformer designed primarily to generate contextualized word embeddings for computational conceptual analysis in astrophysics and high-energy physics (HEP). Built upon Google's `bert-base-uncased`, the model underwent additional training for three epochs using the <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/datasets/arnosimons/astro-hep-corpus">Astro-HEP Corpus</a>, containing 21.84 million paragraphs found in more than 600,000 scholarly articles sourced from arXiv, all pertaining to astrophysics and/or high-energy physics. The sole training objective was **Masked Language Modeling (MLM)**.
+To optimize the model's ability to embed domain-specific language, **training was conducted exclusively on entire paragraphs**, rather than packing in as many sentences as possible, as often suggested in BERT tutorials. This "full-paragraphs format" preserves sentences within their original context, which is especially meaningful in academic writing where paragraphs focus on one idea.
 The Astro-HEP-BERT project demonstrates the general feasibility of training a customized bidirectional transformer for computational conceptual analysis in the history, philosophy, and sociology of science as an open-source endeavor that does not require a substantial budget. Leveraging only freely available code, weights, and text inputs, the entire training process was conducted on a single MacBook Pro Laptop (M2/96GB).
+For further insights into the model, the corpus, and the underlying research project (<a target="_blank" rel="noopener noreferrer" href="https://doi.org/10.3030/101044932" >Network Epistemology in Practice</a>) please refer to the following three papers:
+1) <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2411.14877">Simons, A (2024). Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts in astrophysics and high energy physics. arXiv:2411.14877.</a>
+2) <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2411.14073">Simons, A (2024). Meaning at the planck scale? Contextualized word embeddings for doing history, philosophy, and sociology of science. arXiv:2411.14073.</a>
+3) <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2506.12242">Simons, A; Zichert, M; and Wüthrich, A (2025). Large Language Models for History, Philosophy, and Sociology of Science: Interpretive Uses, Methodological Challenges, and Critical Perspectives. arXiv:2506.12242.</a>
 ## Model Details