BioGeek commited on
Commit
b130812
·
verified ·
1 Parent(s): c00adce

Add README for instanovoplus-v1.1.0

Browse files
Files changed (1) hide show
  1. README.md +202 -6
README.md CHANGED
@@ -1,10 +1,206 @@
1
  ---
 
 
2
  tags:
3
- - model_hub_mixin
4
- - pytorch_model_hub_mixin
 
 
 
 
 
 
 
 
 
 
5
  ---
6
 
7
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
- - Code: [More Information Needed]
9
- - Paper: [More Information Needed]
10
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: cc-by-nc-sa-4.0
3
+ library_name: pytorch
4
  tags:
5
+ - proteomics
6
+ - mass-spectrometry
7
+ - peptide-sequencing
8
+ - de-novo-sequencing
9
+ - diffusion
10
+ - multinomial-diffusion
11
+ - biology
12
+ - computational-biology
13
+ pipeline_tag: text-generation
14
+ datasets:
15
+ - InstaDeepAI/ms_ninespecies_benchmark
16
+ - InstaDeepAI/ms_proteometools
17
  ---
18
 
19
+ # InstaNovoPlus: Diffusion-Powered De novo Peptide Sequencing Model
20
+
21
+
22
+
23
+ ## Model Description
24
+
25
+ InstaNovoPlus is a diffusion-based model for de novo peptide sequencing from mass spectrometry data. This model leverages multinomial diffusion for accurate, database-free peptide identification for large-scale proteomics experiments.
26
+
27
+
28
+ ## Usage
29
+
30
+ ```python
31
+ import torch
32
+ import numpy as np
33
+ import pandas as pd
34
+ from instanovo.diffusion.multinomial_diffusion import InstaNovoPlus
35
+ from instanovo.utils import SpectrumDataFrame
36
+ from instanovo.transformer.dataset import SpectrumDataset, collate_batch
37
+ from torch.utils.data import DataLoader
38
+ from instanovo.inference import ScoredSequence
39
+ from instanovo.inference.diffusion import DiffusionDecoder
40
+ from instanovo.utils.metrics import Metrics
41
+ from tqdm.notebook import tqdm
42
+
43
+ # Load the model from the Hugging Face Hub
44
+ model, config = InstaNovoPlus.from_pretrained("InstaDeepAI/instanovoplus-v1.1.0")
45
+
46
+ # Move the model to the GPU if available
47
+ device = "cuda" if torch.cuda.is_available() else "cpu"
48
+ model = model.to(device).eval()
49
+
50
+ # Update the residue set with custom modifications
51
+ model.residue_set.update_remapping(
52
+ {
53
+ "M(ox)": "M[UNIMOD:35]",
54
+ "M(+15.99)": "M[UNIMOD:35]",
55
+ "S(p)": "S[UNIMOD:21]", # Phosphorylation
56
+ "T(p)": "T[UNIMOD:21]",
57
+ "Y(p)": "Y[UNIMOD:21]",
58
+ "S(+79.97)": "S[UNIMOD:21]",
59
+ "T(+79.97)": "T[UNIMOD:21]",
60
+ "Y(+79.97)": "Y[UNIMOD:21]",
61
+ "Q(+0.98)": "Q[UNIMOD:7]", # Deamidation
62
+ "N(+0.98)": "N[UNIMOD:7]",
63
+ "Q(+.98)": "Q[UNIMOD:7]",
64
+ "N(+.98)": "N[UNIMOD:7]",
65
+ "C(+57.02)": "C[UNIMOD:4]", # Carboxyamidomethylation
66
+ "(+42.01)": "[UNIMOD:1]", # Acetylation
67
+ "(+43.01)": "[UNIMOD:5]", # Carbamylation
68
+ "(-17.03)": "[UNIMOD:385]",
69
+ }
70
+ )
71
+
72
+ # Load the test data
73
+ sdf = SpectrumDataFrame.from_huggingface(
74
+ "InstaDeepAI/ms_ninespecies_benchmark",
75
+ is_annotated=True,
76
+ shuffle=False,
77
+ split="test[:10%]", # Let's only use a subset of the test data for faster inference
78
+ )
79
+
80
+ # Create the dataset
81
+ ds = SpectrumDataset(
82
+ sdf,
83
+ model.residue_set,
84
+ config.get("n_peaks", 200),
85
+ return_str=False,
86
+ annotated=True,
87
+ peptide_pad_length=model.config.get("max_length", 30),
88
+ reverse_peptide=False, # we do not reverse peptide for diffusion
89
+ add_eos=False,
90
+ tokenize_peptide=True,
91
+ )
92
+
93
+ # Create the data loader
94
+ dl = DataLoader(
95
+ ds,
96
+ batch_size=64,
97
+ num_workers=0, # sdf requirement, handled internally
98
+ shuffle=False, # sdf requirement, handled internally
99
+ collate_fn=collate_batch,
100
+ )
101
+
102
+ # Create the decoder
103
+ diffusion_decoder = DiffusionDecoder(model=model)
104
+
105
+ predictions = []
106
+ log_probs = []
107
+
108
+ # Iterate over the data loader
109
+ for batch in tqdm(dl, total=len(dl)):
110
+ spectra, precursors, spectra_padding_mask, peptides, _ = batch
111
+ spectra = spectra.to(device)
112
+ precursors = precursors.to(device)
113
+ spectra_padding_mask = spectra_padding_mask.to(device)
114
+ peptides = peptides.to(device)
115
+
116
+ # Perform inference
117
+ with torch.no_grad():
118
+ batch_predictions, batch_log_probs = diffusion_decoder.decode(
119
+ spectra=spectra,
120
+ spectra_padding_mask=spectra_padding_mask,
121
+ precursors=precursors,
122
+ initial_sequence=peptides,
123
+ )
124
+ predictions.extend(batch_predictions)
125
+ log_probs.extend(batch_log_probs)
126
+
127
+ # Initialize metrics
128
+ metrics = Metrics(model.residue_set, config["isotope_error_range"])
129
+
130
+ # Compute precision and recall
131
+ aa_precision, aa_recall, peptide_recall, peptide_precision = metrics.compute_precision_recall(
132
+ peptides, preds
133
+ )
134
+
135
+ # Compute amino acid error rate and AUC
136
+ aa_error_rate = metrics.compute_aa_er(targs, preds)
137
+ auc = metrics.calc_auc(targs, preds, np.exp(pd.Series(probs)))
138
+
139
+ print(f"amino acid error rate: {aa_error_rate:.5f}")
140
+ print(f"amino acid precision: {aa_precision:.5f}")
141
+ print(f"amino acid recall: {aa_recall:.5f}")
142
+ print(f"peptide precision: {peptide_precision:.5f}")
143
+ print(f"peptide recall: {peptide_recall:.5f}")
144
+ print(f"area under the PR curve: {auc:.5f}")
145
+ ```
146
+
147
+ For more explanation, see the [Getting Started notebook](https://github.com/instadeepai/InstaNovo/blob/main/notebooks/getting_started_with_instanovo.ipynb) in the repository.
148
+
149
+
150
+ ## Citation
151
+
152
+ If you use InstaNovoPlus in your research, please cite:
153
+
154
+ ```bibtex
155
+ @article{eloff_kalogeropoulos_2025_instanovo,
156
+ title = {InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale
157
+ proteomics experiments},
158
+ author = {Eloff, Kevin and Kalogeropoulos, Konstantinos and Mabona, Amandla and Morell,
159
+ Oliver and Catzel, Rachel and Rivera-de-Torre, Esperanza and Berg Jespersen,
160
+ Jakob and Williams, Wesley and van Beljouw, Sam P. B. and Skwark, Marcin J.
161
+ and Laustsen, Andreas Hougaard and Brouns, Stan J. J. and Ljungars,
162
+ Anne and Schoof, Erwin M. and Van Goey, Jeroen and auf dem Keller, Ulrich and
163
+ Beguir, Karim and Lopez Carranza, Nicolas and Jenkins, Timothy P.},
164
+ year = {2025},
165
+ month = {Mar},
166
+ day = {31},
167
+ journal = {Nature Machine Intelligence},
168
+ doi = {10.1038/s42256-025-01019-5},
169
+ issn = {2522-5839},
170
+ url = {https://doi.org/10.1038/s42256-025-01019-5}
171
+ }
172
+ ```
173
+
174
+
175
+ ## Resources
176
+
177
+ - **Code Repository**: [https://github.com/instadeepai/InstaNovo](https://github.com/instadeepai/InstaNovo)
178
+ - **Documentation**: [https://instadeepai.github.io/InstaNovo/](https://instadeepai.github.io/InstaNovo/)
179
+ - **Publication**: [https://www.nature.com/articles/s42256-025-01019-5](https://www.nature.com/articles/s42256-025-01019-5)
180
+
181
+ ## License
182
+
183
+ - **Code**: Licensed under Apache License 2.0
184
+ - **Model Checkpoints**: Licensed under Creative Commons Non-Commercial (CC BY-NC-SA 4.0)
185
+
186
+ ## Installation
187
+
188
+ ```bash
189
+ pip install instanovo
190
+ ```
191
+
192
+ For GPU support, install with CUDA dependencies:
193
+ ```bash
194
+ pip install instanovo[cu126]
195
+ ```
196
+
197
+ ## Requirements
198
+
199
+ - Python >= 3.10, < 3.13
200
+ - PyTorch >= 1.13.0
201
+ - CUDA (optional, for GPU acceleration)
202
+
203
+
204
+ ## Support
205
+
206
+ For questions, issues, or contributions, please visit the [GitHub repository](https://github.com/instadeepai/InstaNovo) or check the [documentation](https://instadeepai.github.io/InstaNovo/).