arnosimons commited on
Commit
a9fdfcd
·
verified ·
1 Parent(s): ad78659

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -4
README.md CHANGED
@@ -63,15 +63,19 @@ tags:
63
 
64
  # Model Card for Astro-HEP-BERT
65
 
66
- **Astro-HEP-BERT** is a bidirectional transformer designed primarily to generate contextualized word embeddings for computational conceptual analysis in astrophysics and high-energy physics (HEP). Built upon Google's `bert-base-uncased`, the model underwent additional training for three epochs using the <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/datasets/arnosimons/astro-hep-corpus">Astro-HEP Corpus</a>, containing 21.84 million paragraphs found in more than 600,000 scholarly articles sourced from arXiv, all pertaining to astrophysics and/or high-energy physics (HEP). The sole training objective was masked language modeling.
 
 
67
 
68
  The Astro-HEP-BERT project demonstrates the general feasibility of training a customized bidirectional transformer for computational conceptual analysis in the history, philosophy, and sociology of science as an open-source endeavor that does not require a substantial budget. Leveraging only freely available code, weights, and text inputs, the entire training process was conducted on a single MacBook Pro Laptop (M2/96GB).
69
 
70
- For further insights into the model, the corpus, and the underlying research project (<a target="_blank" rel="noopener noreferrer" href="https://doi.org/10.3030/101044932" >Network Epistemology in Practice</a>) please refer to the following two papers:
 
 
71
 
72
- 1) <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2411.14877">Simons, A. (2024). Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts in astrophysics and high energy physics. arXiv:2411.14877.</a>
73
 
74
- 2) <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2411.14073">Simons, A. (2024). Meaning at the planck scale? Contextualized word embeddings for doing history, philosophy, and sociology of science. arXiv:2411.14073.</a>
75
 
76
 
77
  ## Model Details
 
63
 
64
  # Model Card for Astro-HEP-BERT
65
 
66
+ **Astro-HEP-BERT** is a bidirectional transformer designed primarily to generate contextualized word embeddings for computational conceptual analysis in astrophysics and high-energy physics (HEP). Built upon Google's `bert-base-uncased`, the model underwent additional training for three epochs using the <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/datasets/arnosimons/astro-hep-corpus">Astro-HEP Corpus</a>, containing 21.84 million paragraphs found in more than 600,000 scholarly articles sourced from arXiv, all pertaining to astrophysics and/or high-energy physics. The sole training objective was **Masked Language Modeling (MLM)**.
67
+
68
+ To optimize the model's ability to embed domain-specific language, **training was conducted exclusively on entire paragraphs**, rather than packing in as many sentences as possible, as often suggested in BERT tutorials. This "full-paragraphs format" preserves sentences within their original context, which is especially meaningful in academic writing where paragraphs focus on one idea.
69
 
70
  The Astro-HEP-BERT project demonstrates the general feasibility of training a customized bidirectional transformer for computational conceptual analysis in the history, philosophy, and sociology of science as an open-source endeavor that does not require a substantial budget. Leveraging only freely available code, weights, and text inputs, the entire training process was conducted on a single MacBook Pro Laptop (M2/96GB).
71
 
72
+ For further insights into the model, the corpus, and the underlying research project (<a target="_blank" rel="noopener noreferrer" href="https://doi.org/10.3030/101044932" >Network Epistemology in Practice</a>) please refer to the following three papers:
73
+
74
+ 1) <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2411.14877">Simons, A (2024). Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts in astrophysics and high energy physics. arXiv:2411.14877.</a>
75
 
76
+ 2) <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2411.14073">Simons, A (2024). Meaning at the planck scale? Contextualized word embeddings for doing history, philosophy, and sociology of science. arXiv:2411.14073.</a>
77
 
78
+ 3) <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2506.12242">Simons, A; Zichert, M; and Wüthrich, A (2025). Large Language Models for History, Philosophy, and Sociology of Science: Interpretive Uses, Methodological Challenges, and Critical Perspectives. arXiv:2506.12242.</a>
79
 
80
 
81
  ## Model Details