Peptide Sequencing Via Protein Language Models
Paper
•
2408.00892
•
Published
•
1
This model is a fine-tuned version of ProtBERT specifically optimized for unmasking protein sequences. It can predict masked amino acids in protein sequences based on the surrounding context.
For detailed information about the training methodology and approach, please refer to our paper: https://arxiv.org/abs/2408.00892
from transformers import AutoModelForMaskedLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("your-username/protbert-sequence-unmasking")
tokenizer = AutoTokenizer.from_pretrained("your-username/protbert-sequence-unmasking")
# Example usage for E. coli sequence with known amino acids (K,C,Y,H,S,M)
sequence = "MALN[MASK]KFGP[MASK]LVRK"
inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)
predictions = outputs.logits
The model is optimized for:
Example API usage:
from transformers import pipeline
unmasker = pipeline('fill-mask', model='your-username/protbert-sequence-unmasking')
sequence = "K[MASK]YHS[MASK]" # Example with known amino acids K,Y,H,S
results = unmasker(sequence)
for result in results:
print(f"Predicted amino acid: {result['token_str']}, Score: {result['score']:.3f}")
The complete details of the training methodology, dataset preparation, and model evaluation can be found in our paper: https://arxiv.org/abs/2408.00892