--- license: cc-by-nc-sa-4.0 library_name: pytorch tags: - proteomics - mass-spectrometry - peptide-sequencing - de-novo-sequencing - diffusion - multinomial-diffusion - biology - computational-biology pipeline_tag: text-generation datasets: - InstaDeepAI/ms_ninespecies_benchmark - InstaDeepAI/ms_proteometools --- # InstaNovoPlus: Diffusion-Powered De novo Peptide Sequencing Model ## Model Description InstaNovoPlus is a diffusion-based model for de novo peptide sequencing from mass spectrometry data. This model leverages multinomial diffusion for accurate, database-free peptide identification for large-scale proteomics experiments. ## Usage ```python import torch import numpy as np import pandas as pd from instanovo.diffusion.multinomial_diffusion import InstaNovoPlus from instanovo.utils import SpectrumDataFrame from instanovo.transformer.dataset import SpectrumDataset, collate_batch from torch.utils.data import DataLoader from instanovo.inference import ScoredSequence from instanovo.inference.diffusion import DiffusionDecoder from instanovo.utils.metrics import Metrics from tqdm.notebook import tqdm # Load the model from the Hugging Face Hub model, config = InstaNovoPlus.from_pretrained("InstaDeepAI/instanovoplus-v1.1.0") # Move the model to the GPU if available device = "cuda" if torch.cuda.is_available() else "cpu" model = model.to(device).eval() # Update the residue set with custom modifications model.residue_set.update_remapping( { "M(ox)": "M[UNIMOD:35]", "M(+15.99)": "M[UNIMOD:35]", "S(p)": "S[UNIMOD:21]", # Phosphorylation "T(p)": "T[UNIMOD:21]", "Y(p)": "Y[UNIMOD:21]", "S(+79.97)": "S[UNIMOD:21]", "T(+79.97)": "T[UNIMOD:21]", "Y(+79.97)": "Y[UNIMOD:21]", "Q(+0.98)": "Q[UNIMOD:7]", # Deamidation "N(+0.98)": "N[UNIMOD:7]", "Q(+.98)": "Q[UNIMOD:7]", "N(+.98)": "N[UNIMOD:7]", "C(+57.02)": "C[UNIMOD:4]", # Carboxyamidomethylation "(+42.01)": "[UNIMOD:1]", # Acetylation "(+43.01)": "[UNIMOD:5]", # Carbamylation "(-17.03)": "[UNIMOD:385]", } ) # Load the test data sdf = SpectrumDataFrame.from_huggingface( "InstaDeepAI/ms_ninespecies_benchmark", is_annotated=True, shuffle=False, split="test[:10%]", # Let's only use a subset of the test data for faster inference ) # Create the dataset ds = SpectrumDataset( sdf, model.residue_set, config.get("n_peaks", 200), return_str=False, annotated=True, peptide_pad_length=model.config.get("max_length", 30), reverse_peptide=False, # we do not reverse peptide for diffusion add_eos=False, tokenize_peptide=True, ) # Create the data loader dl = DataLoader( ds, batch_size=64, num_workers=0, # sdf requirement, handled internally shuffle=False, # sdf requirement, handled internally collate_fn=collate_batch, ) # Create the decoder diffusion_decoder = DiffusionDecoder(model=model) predictions = [] log_probs = [] # Iterate over the data loader for batch in tqdm(dl, total=len(dl)): spectra, precursors, spectra_padding_mask, peptides, _ = batch spectra = spectra.to(device) precursors = precursors.to(device) spectra_padding_mask = spectra_padding_mask.to(device) peptides = peptides.to(device) # Perform inference with torch.no_grad(): batch_predictions, batch_log_probs = diffusion_decoder.decode( spectra=spectra, spectra_padding_mask=spectra_padding_mask, precursors=precursors, initial_sequence=peptides, ) predictions.extend(batch_predictions) log_probs.extend(batch_log_probs) # Initialize metrics metrics = Metrics(model.residue_set, config["isotope_error_range"]) # Compute precision and recall aa_precision, aa_recall, peptide_recall, peptide_precision = metrics.compute_precision_recall( peptides, preds ) # Compute amino acid error rate and AUC aa_error_rate = metrics.compute_aa_er(targs, preds) auc = metrics.calc_auc(targs, preds, np.exp(pd.Series(probs))) print(f"amino acid error rate: {aa_error_rate:.5f}") print(f"amino acid precision: {aa_precision:.5f}") print(f"amino acid recall: {aa_recall:.5f}") print(f"peptide precision: {peptide_precision:.5f}") print(f"peptide recall: {peptide_recall:.5f}") print(f"area under the PR curve: {auc:.5f}") ``` For more explanation, see the [Getting Started notebook](https://github.com/instadeepai/InstaNovo/blob/main/notebooks/getting_started_with_instanovo.ipynb) in the repository. ## Citation If you use InstaNovoPlus in your research, please cite: ```bibtex @article{eloff_kalogeropoulos_2025_instanovo, title = {InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments}, author = {Eloff, Kevin and Kalogeropoulos, Konstantinos and Mabona, Amandla and Morell, Oliver and Catzel, Rachel and Rivera-de-Torre, Esperanza and Berg Jespersen, Jakob and Williams, Wesley and van Beljouw, Sam P. B. and Skwark, Marcin J. and Laustsen, Andreas Hougaard and Brouns, Stan J. J. and Ljungars, Anne and Schoof, Erwin M. and Van Goey, Jeroen and auf dem Keller, Ulrich and Beguir, Karim and Lopez Carranza, Nicolas and Jenkins, Timothy P.}, year = {2025}, month = {Mar}, day = {31}, journal = {Nature Machine Intelligence}, doi = {10.1038/s42256-025-01019-5}, issn = {2522-5839}, url = {https://doi.org/10.1038/s42256-025-01019-5} } ``` ## Resources - **Code Repository**: [https://github.com/instadeepai/InstaNovo](https://github.com/instadeepai/InstaNovo) - **Documentation**: [https://instadeepai.github.io/InstaNovo/](https://instadeepai.github.io/InstaNovo/) - **Publication**: [https://www.nature.com/articles/s42256-025-01019-5](https://www.nature.com/articles/s42256-025-01019-5) ## License - **Code**: Licensed under Apache License 2.0 - **Model Checkpoints**: Licensed under Creative Commons Non-Commercial (CC BY-NC-SA 4.0) ## Installation ```bash pip install instanovo ``` For GPU support, install with CUDA dependencies: ```bash pip install instanovo[cu126] ``` ## Requirements - Python >= 3.10, < 3.13 - PyTorch >= 1.13.0 - CUDA (optional, for GPU acceleration) ## Support For questions, issues, or contributions, please visit the [GitHub repository](https://github.com/instadeepai/InstaNovo) or check the [documentation](https://instadeepai.github.io/InstaNovo/).