---
tags:
- roberta
- sentiment-analysis
- slang
- nlp
- fine-tuning
- transformers
- huggingface
model-index:
- name: roberta-finetune-slangs
  results:
  - task:
      name: Sentiment Analysis
      type: text-classification
    dataset:
      name: TeenSenti Slang Dataset
      type: custom
      split: test
    metrics:
    - name: Accuracy
      type: accuracy
      value: 0.93
    - name: Precision
      type: precision
      value: 0.92
    - name: Recall
      type: recall
      value: 0.93
    - name: F1
      type: f1
      value: 0.925
license: apache-2.0
language:
- en
metrics:
- accuracy
- f1
- precision
- recall
---

# roberta-finetune-slangs

Fine-tuned RoBERTa model for **sentiment analysis of internet slang, abbreviations, and short words**, based on the research paper:

> Sahil Kamath, Vaishnavi Padiya, Sonia D’Silva, Nilesh Patil, Meera Narvekar. *TeenSenti – A novel approach for sentiment analysis of short words and slangs*.

## Model description

This model is fine-tuned from a pre-trained RoBERTa transformer to classify the sentiment of sentences containing **informal internet expressions** such as slang, abbreviations, and short forms. The goal is to address the gap in existing sentiment analysis models, which often fail to interpret modern linguistic nuances.

### Key features:
- **Handles slang and short words** with contextual understanding.
- Trained using a **custom slang dictionary** integrated into the dataset.
- Outperforms the base `twitter-roberta-base-sentiment` model on slang-heavy datasets.
- Designed for **social media, product reviews, and informal text analysis**.

## Intended uses & limitations

**Intended uses**
- Sentiment classification for texts containing slang or abbreviations.
- Social media monitoring, brand sentiment analysis, or content moderation where informal language is common.

**Limitations**
- Optimized for slang/abbreviation-heavy English text; performance may degrade on formal or domain-specific corpora.
- Slang evolves rapidly — periodic retraining is recommended for sustained accuracy.

## Training and evaluation data

- **Dataset**: Custom-curated `TeenSenti` dataset of ~20,000 sentences.
- Each slang term has both positive and negative example sentences generated and verified.
- Dataset split: 80% training, 20% testing (per slang term to avoid overlap).
- Examples include terms like `"ftw"` ("for the win") and `"h8"` ("hate").

## Training procedure

### Preprocessing
- **Custom tokenizer** preserving slang and short words from the slang dictionary.
- Tokenization and text processing using Hugging Face `transformers`.

### Training hyperparameters
- Optimizer: AdamW
- Learning rate schedule: triangular policy
- Batch size: 16
- Epochs: 4
- Max sequence length: 128
- Mixed precision: float32

### Evaluation
Compared to the base `twitter-roberta-base-sentiment` model:
| Example sentence | Base model prediction | Fine-tuned prediction |
|------------------|----------------------|-----------------------|
| "Team India ftw" | Neutral               | Positive              |
| "I h8 that person" | Negative (low confidence) | Negative (high confidence) |

Fine-tuned model achieves:
- **Accuracy**: 0.93  
- **F1-score**: 0.925  
- **Precision**: 0.92  
- **Recall**: 0.93  

## Framework versions
- `transformers`: 4.35.2
- `torch`: 2.x
- `tokenizers`: 0.15.0
- `datasets`: 2.x

## Citation

If you use this model, please cite:

```bibtex
@INPROCEEDINGS{10582077,
  author={Kamath, Sahil and Padiya, Vaishnavi and D'Silva, Sonia and Patil, Nilesh and Narvekar, Meera},
  booktitle={2024 International Conference on Advances in Modern Age Technologies for Health and Engineering Science (AMATHE)}, 
  title={TeenSenti - A novel approach for sentiment analysis of short words and slangs}, 
  year={2024},
  volume={},
  number={},
  pages={1-8},
  keywords={Deep learning;Sentiment analysis;Dictionaries;Accuracy;Reviews;Navigation;Oral communication;Sentiment Analysis;Slang;Short Words;NLP;FastText Embeddings},
  doi={10.1109/AMATHE61652.2024.10582077}}