Update README.md
Browse files
README.md
CHANGED
|
@@ -1,66 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# AccPhysBERT
|
| 2 |
|
| 3 |
**AccPhysBERT** is a specialized sentence-embedding model fine-tuned for **accelerator physics**, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature.
|
| 4 |
|
| 5 |
---
|
| 6 |
|
| 7 |
-
|
| 8 |
## Model Description
|
| 9 |
|
| 10 |
- **Architecture**: BERT-based, fine-tuned from [PhysBERT (cased)](https://huggingface.co/thellert/physbert_cased/tree/main) using Supervised Contrastive Learning (SimCSE).
|
| 11 |
- **Optimized For**: Titles, abstracts, proposals, and full text from the accelerator-physics community.
|
| 12 |
- **Notable Features**:
|
| 13 |
-
- Trained on 109
|
| 14 |
-
- Leverages 690
|
| 15 |
-
- Trained via SentenceTransformers to produce dense, semantically rich embeddings
|
| 16 |
|
| 17 |
**Developed by**: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro
|
| 18 |
**Funded by**: US Department of Energy, Lawrence Berkeley National Laboratory
|
| 19 |
**Model Type**: Sentence embedding (BERT-based, SimCSE fine-tuned)
|
| 20 |
**Language**: English
|
| 21 |
-
**License**: [CC
|
| 22 |
-
**Paper**: *Domain-specific text embedding model for accelerator physics
|
|
|
|
| 23 |
|
| 24 |
---
|
| 25 |
|
| 26 |
## Training Data
|
| 27 |
|
| 28 |
-
- **Core Corpus**:
|
| 29 |
-
- 109,000 accelerator-physics publications (INSPIRE HEP category:
|
| 30 |
-
- Over 1
|
| 31 |
-
|
|
|
|
| 32 |
- 690,000 citation pairs
|
| 33 |
-
- 49 semantic categories labeled via ChatGPT-4o
|
| 34 |
- 2,000,000 synthetic query–source pairs generated with LLaMA3-70B
|
| 35 |
-
- **Preprocessing**:
|
| 36 |
-
- Full-text OCR extraction via Nougat, cleaned to plain-text markdown format
|
| 37 |
|
| 38 |
---
|
| 39 |
|
| 40 |
## Training Procedure
|
| 41 |
|
| 42 |
-
- **Fine-tuning Method**: SimCSE
|
| 43 |
- **Hyperparameters**:
|
| 44 |
- Batch size: 512
|
| 45 |
-
- Learning rate:
|
| 46 |
-
- Temperature
|
| 47 |
- Weight decay: 0.01
|
| 48 |
- Optimizer: Adam
|
| 49 |
- Epochs: 2
|
| 50 |
-
|
| 51 |
- **Framework**: SentenceTransformers
|
| 52 |
|
| 53 |
---
|
| 54 |
|
| 55 |
## Evaluation Results
|
| 56 |
|
| 57 |
-
| Task | Metric
|
| 58 |
-
|
| 59 |
-
| Citation Classification | Cosine Accuracy
|
| 60 |
-
| Category Clustering | V‑measure (main/sub)| 63.7 /
|
| 61 |
-
| Information Retrieval
|
| 62 |
|
| 63 |
-
AccPhysBERT
|
| 64 |
|
| 65 |
---
|
| 66 |
|
|
@@ -73,10 +112,47 @@ import torch
|
|
| 73 |
tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert")
|
| 74 |
model = AutoModel.from_pretrained("thellert/accphysbert")
|
| 75 |
|
| 76 |
-
text = "We report on beam instabilities observed in the LCLS
|
| 77 |
inputs = tokenizer(text, return_tensors="pt")
|
| 78 |
outputs = model(**inputs)
|
| 79 |
|
| 80 |
-
# Use mean pooling (excluding [CLS]
|
| 81 |
token_embeddings = outputs.last_hidden_state[:, 1:-1, :]
|
| 82 |
-
sentence_embedding = token_embeddings.mean(dim=1)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: cc-by-4.0
|
| 4 |
+
tags:
|
| 5 |
+
- sentence-transformers
|
| 6 |
+
- feature-extraction
|
| 7 |
+
- sentence-similarity
|
| 8 |
+
- transformers
|
| 9 |
+
- bert
|
| 10 |
+
- accelerator-physics
|
| 11 |
+
- physics
|
| 12 |
+
- scientific-literature
|
| 13 |
+
- embeddings
|
| 14 |
+
- domain-specific
|
| 15 |
+
library_name: sentence-transformers
|
| 16 |
+
pipeline_tag: feature-extraction
|
| 17 |
+
base_model: thellert/physbert_cased
|
| 18 |
+
model-index:
|
| 19 |
+
- name: AccPhysBERT
|
| 20 |
+
results:
|
| 21 |
+
- task:
|
| 22 |
+
type: feature-extraction
|
| 23 |
+
name: Feature Extraction
|
| 24 |
+
dataset:
|
| 25 |
+
name: Accelerator Physics Publications
|
| 26 |
+
type: accelerator-physics
|
| 27 |
+
metrics:
|
| 28 |
+
- type: cosine_accuracy
|
| 29 |
+
value: 0.91
|
| 30 |
+
name: Citation Classification
|
| 31 |
+
- type: v_measure
|
| 32 |
+
value: 0.637
|
| 33 |
+
name: Category Clustering (main)
|
| 34 |
+
- type: ndcg_at_10
|
| 35 |
+
value: 0.663
|
| 36 |
+
name: Information Retrieval
|
| 37 |
+
datasets:
|
| 38 |
+
- inspire-hep
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
# AccPhysBERT
|
| 42 |
|
| 43 |
**AccPhysBERT** is a specialized sentence-embedding model fine-tuned for **accelerator physics**, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature.
|
| 44 |
|
| 45 |
---
|
| 46 |
|
|
|
|
| 47 |
## Model Description
|
| 48 |
|
| 49 |
- **Architecture**: BERT-based, fine-tuned from [PhysBERT (cased)](https://huggingface.co/thellert/physbert_cased/tree/main) using Supervised Contrastive Learning (SimCSE).
|
| 50 |
- **Optimized For**: Titles, abstracts, proposals, and full text from the accelerator-physics community.
|
| 51 |
- **Notable Features**:
|
| 52 |
+
- Trained on 109 k accelerator-physics publications from INSPIRE HEP
|
| 53 |
+
- Leverages 690 k citation pairs and 2 M synthetic query–source pairs
|
| 54 |
+
- Trained via SentenceTransformers to produce dense, semantically rich embeddings
|
| 55 |
|
| 56 |
**Developed by**: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro
|
| 57 |
**Funded by**: US Department of Energy, Lawrence Berkeley National Laboratory
|
| 58 |
**Model Type**: Sentence embedding (BERT-based, SimCSE fine-tuned)
|
| 59 |
**Language**: English
|
| 60 |
+
**License**: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
|
| 61 |
+
**Paper**: *Domain-specific text embedding model for accelerator physics*, Phys. Rev. Accel. Beams 28, 044601 (2025)
|
| 62 |
+
[https://doi.org/10.1103/PhysRevAccelBeams.28.044601](https://doi.org/10.1103/PhysRevAccelBeams.28.044601)
|
| 63 |
|
| 64 |
---
|
| 65 |
|
| 66 |
## Training Data
|
| 67 |
|
| 68 |
+
- **Core Corpus**:
|
| 69 |
+
- 109,000 accelerator-physics publications (INSPIRE HEP category: "Accelerators")
|
| 70 |
+
- Over 1 GB of full-text markdown-style text (via OCR/Nougat)
|
| 71 |
+
|
| 72 |
+
- **Annotation Sources**:
|
| 73 |
- 690,000 citation pairs
|
| 74 |
+
- 49 semantic categories labeled via ChatGPT-4o
|
| 75 |
- 2,000,000 synthetic query–source pairs generated with LLaMA3-70B
|
|
|
|
|
|
|
| 76 |
|
| 77 |
---
|
| 78 |
|
| 79 |
## Training Procedure
|
| 80 |
|
| 81 |
+
- **Fine-tuning Method**: SimCSE (contrastive loss)
|
| 82 |
- **Hyperparameters**:
|
| 83 |
- Batch size: 512
|
| 84 |
+
- Learning rate: 2e-4
|
| 85 |
+
- Temperature: 0.05
|
| 86 |
- Weight decay: 0.01
|
| 87 |
- Optimizer: Adam
|
| 88 |
- Epochs: 2
|
| 89 |
+
- **Infrastructure**: 32 × NVIDIA A100 GPUs @ NERSC
|
| 90 |
- **Framework**: SentenceTransformers
|
| 91 |
|
| 92 |
---
|
| 93 |
|
| 94 |
## Evaluation Results
|
| 95 |
|
| 96 |
+
| Task | Metric | Score |
|
| 97 |
+
|----------------------------|--------------------------|---------|
|
| 98 |
+
| Citation Classification | Cosine Accuracy | 91.0% |
|
| 99 |
+
| Category Clustering | V‑measure (main/sub) | 63.7 / 77.2 |
|
| 100 |
+
| Information Retrieval | nDCG@10 | 66.3 |
|
| 101 |
|
| 102 |
+
AccPhysBERT outperforms BERT, SciBERT, and large general-purpose embedding models in all accelerator-specific benchmarks.
|
| 103 |
|
| 104 |
---
|
| 105 |
|
|
|
|
| 112 |
tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert")
|
| 113 |
model = AutoModel.from_pretrained("thellert/accphysbert")
|
| 114 |
|
| 115 |
+
text = "We report on beam instabilities observed in the LCLS-II injector."
|
| 116 |
inputs = tokenizer(text, return_tensors="pt")
|
| 117 |
outputs = model(**inputs)
|
| 118 |
|
| 119 |
+
# Use mean pooling (excluding [CLS] and [SEP])
|
| 120 |
token_embeddings = outputs.last_hidden_state[:, 1:-1, :]
|
| 121 |
+
sentence_embedding = token_embeddings.mean(dim=1)
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
---
|
| 126 |
+
|
| 127 |
+
## Citation
|
| 128 |
+
|
| 129 |
+
If you use AccPhysBERT, please cite:
|
| 130 |
+
|
| 131 |
+
```bibtex
|
| 132 |
+
@article{Hellert_2025,
|
| 133 |
+
title = {Domain-specific text embedding model for accelerator physics},
|
| 134 |
+
author = {Hellert, Thorsten and Montenegro, João and Venturini, Marco and Pollastro, Andrea},
|
| 135 |
+
journal = {Physical Review Accelerators and Beams},
|
| 136 |
+
volume = {28},
|
| 137 |
+
number = {4},
|
| 138 |
+
pages = {044601},
|
| 139 |
+
year = {2025},
|
| 140 |
+
publisher = {American Physical Society},
|
| 141 |
+
doi = {10.1103/PhysRevAccelBeams.28.044601},
|
| 142 |
+
url = {https://doi.org/10.1103/PhysRevAccelBeams.28.044601}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
---
|
| 147 |
+
|
| 148 |
+
## Contact
|
| 149 |
+
|
| 150 |
+
Thorsten Hellert
|
| 151 |
+
Lawrence Berkeley National Laboratory
|
| 152 | |
| 153 |
+
|
| 154 |
+
---
|
| 155 |
+
|
| 156 |
+
## Acknowledgments
|
| 157 |
+
|
| 158 |
+
This model builds on PhysBERT and was trained using NERSC resources. Thanks to Alex Hexemer, Fernando Sannibale, and Antonin Sulc for their support and discussions.
|