--- tags: - roberta - sentiment-analysis - slang - nlp - fine-tuning - transformers - huggingface model-index: - name: roberta-finetune-slangs results: - task: name: Sentiment Analysis type: text-classification dataset: name: TeenSenti Slang Dataset type: custom split: test metrics: - name: Accuracy type: accuracy value: 0.93 - name: Precision type: precision value: 0.92 - name: Recall type: recall value: 0.93 - name: F1 type: f1 value: 0.925 license: apache-2.0 language: - en metrics: - accuracy - f1 - precision - recall --- # roberta-finetune-slangs Fine-tuned RoBERTa model for **sentiment analysis of internet slang, abbreviations, and short words**, based on the research paper: > Sahil Kamath, Vaishnavi Padiya, Sonia D’Silva, Nilesh Patil, Meera Narvekar. *TeenSenti – A novel approach for sentiment analysis of short words and slangs*. ## Model description This model is fine-tuned from a pre-trained RoBERTa transformer to classify the sentiment of sentences containing **informal internet expressions** such as slang, abbreviations, and short forms. The goal is to address the gap in existing sentiment analysis models, which often fail to interpret modern linguistic nuances. ### Key features: - **Handles slang and short words** with contextual understanding. - Trained using a **custom slang dictionary** integrated into the dataset. - Outperforms the base `twitter-roberta-base-sentiment` model on slang-heavy datasets. - Designed for **social media, product reviews, and informal text analysis**. ## Intended uses & limitations **Intended uses** - Sentiment classification for texts containing slang or abbreviations. - Social media monitoring, brand sentiment analysis, or content moderation where informal language is common. **Limitations** - Optimized for slang/abbreviation-heavy English text; performance may degrade on formal or domain-specific corpora. - Slang evolves rapidly — periodic retraining is recommended for sustained accuracy. ## Training and evaluation data - **Dataset**: Custom-curated `TeenSenti` dataset of ~20,000 sentences. - Each slang term has both positive and negative example sentences generated and verified. - Dataset split: 80% training, 20% testing (per slang term to avoid overlap). - Examples include terms like `"ftw"` ("for the win") and `"h8"` ("hate"). ## Training procedure ### Preprocessing - **Custom tokenizer** preserving slang and short words from the slang dictionary. - Tokenization and text processing using Hugging Face `transformers`. ### Training hyperparameters - Optimizer: AdamW - Learning rate schedule: triangular policy - Batch size: 16 - Epochs: 4 - Max sequence length: 128 - Mixed precision: float32 ### Evaluation Compared to the base `twitter-roberta-base-sentiment` model: | Example sentence | Base model prediction | Fine-tuned prediction | |------------------|----------------------|-----------------------| | "Team India ftw" | Neutral | Positive | | "I h8 that person" | Negative (low confidence) | Negative (high confidence) | Fine-tuned model achieves: - **Accuracy**: 0.93 - **F1-score**: 0.925 - **Precision**: 0.92 - **Recall**: 0.93 ## Framework versions - `transformers`: 4.35.2 - `torch`: 2.x - `tokenizers`: 0.15.0 - `datasets`: 2.x ## Citation If you use this model, please cite: ```bibtex @INPROCEEDINGS{10582077, author={Kamath, Sahil and Padiya, Vaishnavi and D'Silva, Sonia and Patil, Nilesh and Narvekar, Meera}, booktitle={2024 International Conference on Advances in Modern Age Technologies for Health and Engineering Science (AMATHE)}, title={TeenSenti - A novel approach for sentiment analysis of short words and slangs}, year={2024}, volume={}, number={}, pages={1-8}, keywords={Deep learning;Sentiment analysis;Dictionaries;Accuracy;Reviews;Navigation;Oral communication;Sentiment Analysis;Slang;Short Words;NLP;FastText Embeddings}, doi={10.1109/AMATHE61652.2024.10582077}}