joyjitroy's picture
Update README.md
e39d4e8 verified
---
license: mit
language:
- en
metrics:
- accuracy
- precision
- recall
- f1
pipeline_tag: text-classification
library_name: sklearn
tags:
- finance
- sentiment-analysis
- embeddings
- gradient-boosting
- classical-ml
- market-analysis
- nlp
- weekly-sentiment
---
<p align="left">
<a href="https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis">
<img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" />
</a>
<a href="https://doi.org/10.5281/zenodo.17510735">
<img src="https://img.shields.io/badge/Zenodo-DOI-1877f2?logo=zenodo" />
</a>
<a href="https://huggingface.co/joyjitroy/Stock_Market_News_Sentiment_Analysis">
<img src="https://img.shields.io/badge/HuggingFace-Model-yellow?logo=huggingface" />
</a>
<a href="#">
<img src="https://img.shields.io/badge/arXiv-Active-blue?logo=arxiv" />
</a>
<a href="#">
<img src="https://img.shields.io/badge/License-MIT-green" />
</a>
</p>
# ๐Ÿ“ฐ NLP Stock Sentiment Analysis โ€” Embedding-Based Models for Market Signals
![Model3_Thumbnail](https://cdn-uploads.huggingface.co/production/uploads/68faf2c58b9b8d06b47b769c/uuw0OJrmZ3No5S17TJ5ti.png)
This model card documents the resources and workflow described in the paper:
**โ€œNLP Stock Sentiment Analysis: A Comparative Embedding-Based Approachโ€
(Preprint submitted to arXiv, 2025)**
Author: **Joyjit Roy**
Zenodo DOI: https://doi.org/10.5281/zenodo.17510735
GitHub: https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis
This project provides a complete NLP workflow for classifying the sentiment of headline-level financial news and aggregating weekly sentiment indicators. It serves as a reproducible reference for embedding-based sentiment analysis using classical machine learning models.
---
## ๐Ÿ” Project Overview
The goal is to transform unstructured financial news into structured sentiment scores that can support market commentary and trend evaluation. Three embedding-based approaches were explored:
### **1. Word2Vec (300D)**
- Trained locally on the news corpus
- Mean pooled per headline
### **2. GloVe (100D)**
- Pretrained vectors
- Mean pooled per headline
### **3. SentenceTransformer Embeddings (384D)**
- Model: `all-MiniLM-L6-v2`
- Direct sentence embeddings without fine tuning
For each embedding approach, a **Gradient Boosting classifier** is trained and evaluated on the same dataset. Weekly sentiment summaries are generated to demonstrate applied use cases for financial analysis.
> Note: These models are scikit learn models and are not directly deployed as Hugging Face inference endpoints.
---
## ๐Ÿ’ฝ Dataset Summary
The dataset contains **349 financial news headlines**, each paired with OHLCV market indicators and a sentiment label:
`1` = positive, `0` = neutral, `-1` = negative.
Dataset is available via Zenodo:
๐Ÿ‘‰ https://doi.org/10.5281/zenodo.17510735
### ๐Ÿ“Š Data Dictionary
| Column | Description |
|----------|---------------------------------------------------------------|
| `Date` | Date the news item was released |
| `News` | Headline or snippet text |
| `Open` | Opening price (USD) |
| `High` | Highest price (USD) of the day |
| `Low` | Lowest price (USD) of the day |
| `Close` | Adjusted closing price (USD) |
| `Volume` | Total shares traded |
| `Label` | Sentiment (`1`=positive, `0`=neutral, `-1`=negative) |
---
## ๐Ÿ“Š Model Performance (Validation)
Across all embedding variants, the **tuned GloVe + Gradient Boosting** model achieved the strongest validation results:
- **Accuracy:** 0.714
- **Precision:** 0.758
- **Recall:** 0.714
- **F1 Score:** 0.694
- **Error Rate:** 0.286
Training accuracy is perfect for all models due to dataset size; validation metrics are therefore used for fair comparison. Full results appear in the associated paper (Table 3).
---
## ๐Ÿง  Weekly Sentiment Summaries
To show how sentiment predictions can support market interpretation:
1. Daily predictions are aggregated by week
2. Sentiment ratios (positive / neutral / negative) are computed
3. Weekly summaries are generated using **Mistral-7B-Instruct**
4. These summaries provide narrative insight into market mood
This illustrates a practical workflow where classical NLP models feed into downstream financial analysis.
---
## ๐ŸŒŸ Intended Use
This repository is intended for:
- Research on embedding-based sentiment classification
- Educational exploration of Word2Vec, GloVe, and Sentence Transformer embeddings
- Demonstrating weekly sentiment aggregation
- Benchmarking classical ML approaches for small financial text datasets
---
## โš ๏ธ Limitations
- Small dataset (349 samples)
- Potential overfitting of classical models
- Not designed for automated trading or real-time systems
- Weekly summaries rely on LLM outputs and may include stylistic bias
- No direct price prediction or financial forecasting
This model is best used for experimentation and learning.
---
## ๐Ÿ“˜ Citation
If you use this work, please cite:
> Roy, Joyjit. โ€œNLP Stock Sentiment Analysis: A Comparative Embedding-Based Approach.โ€ arXiv preprint, 2025.
> Dataset: Joyjit Roy. โ€œNLP Stock Sentiment Analysis โ€” AI for Market Signals.โ€ Zenodo. https://doi.org/10.5281/zenodo.17510735
---