|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- accuracy |
|
|
- precision |
|
|
- recall |
|
|
- f1 |
|
|
pipeline_tag: text-classification |
|
|
library_name: sklearn |
|
|
tags: |
|
|
- finance |
|
|
- sentiment-analysis |
|
|
- embeddings |
|
|
- gradient-boosting |
|
|
- classical-ml |
|
|
- market-analysis |
|
|
- nlp |
|
|
- weekly-sentiment |
|
|
--- |
|
|
<p align="left"> |
|
|
<a href="https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis"> |
|
|
<img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" /> |
|
|
</a> |
|
|
<a href="https://doi.org/10.5281/zenodo.17510735"> |
|
|
<img src="https://img.shields.io/badge/Zenodo-DOI-1877f2?logo=zenodo" /> |
|
|
</a> |
|
|
<a href="https://huggingface.co/joyjitroy/Stock_Market_News_Sentiment_Analysis"> |
|
|
<img src="https://img.shields.io/badge/HuggingFace-Model-yellow?logo=huggingface" /> |
|
|
</a> |
|
|
<a href="#"> |
|
|
<img src="https://img.shields.io/badge/arXiv-Active-blue?logo=arxiv" /> |
|
|
</a> |
|
|
<a href="#"> |
|
|
<img src="https://img.shields.io/badge/License-MIT-green" /> |
|
|
</a> |
|
|
</p> |
|
|
|
|
|
# ๐ฐ NLP Stock Sentiment Analysis โ Embedding-Based Models for Market Signals |
|
|
|
|
|
 |
|
|
|
|
|
This model card documents the resources and workflow described in the paper: |
|
|
|
|
|
**โNLP Stock Sentiment Analysis: A Comparative Embedding-Based Approachโ |
|
|
(Preprint submitted to arXiv, 2025)** |
|
|
Author: **Joyjit Roy** |
|
|
Zenodo DOI: https://doi.org/10.5281/zenodo.17510735 |
|
|
GitHub: https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis |
|
|
|
|
|
This project provides a complete NLP workflow for classifying the sentiment of headline-level financial news and aggregating weekly sentiment indicators. It serves as a reproducible reference for embedding-based sentiment analysis using classical machine learning models. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Project Overview |
|
|
|
|
|
The goal is to transform unstructured financial news into structured sentiment scores that can support market commentary and trend evaluation. Three embedding-based approaches were explored: |
|
|
|
|
|
### **1. Word2Vec (300D)** |
|
|
- Trained locally on the news corpus |
|
|
- Mean pooled per headline |
|
|
|
|
|
### **2. GloVe (100D)** |
|
|
- Pretrained vectors |
|
|
- Mean pooled per headline |
|
|
|
|
|
### **3. SentenceTransformer Embeddings (384D)** |
|
|
- Model: `all-MiniLM-L6-v2` |
|
|
- Direct sentence embeddings without fine tuning |
|
|
|
|
|
For each embedding approach, a **Gradient Boosting classifier** is trained and evaluated on the same dataset. Weekly sentiment summaries are generated to demonstrate applied use cases for financial analysis. |
|
|
|
|
|
> Note: These models are scikit learn models and are not directly deployed as Hugging Face inference endpoints. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ฝ Dataset Summary |
|
|
|
|
|
The dataset contains **349 financial news headlines**, each paired with OHLCV market indicators and a sentiment label: |
|
|
`1` = positive, `0` = neutral, `-1` = negative. |
|
|
|
|
|
Dataset is available via Zenodo: |
|
|
๐ https://doi.org/10.5281/zenodo.17510735 |
|
|
|
|
|
### ๐ Data Dictionary |
|
|
|
|
|
| Column | Description | |
|
|
|----------|---------------------------------------------------------------| |
|
|
| `Date` | Date the news item was released | |
|
|
| `News` | Headline or snippet text | |
|
|
| `Open` | Opening price (USD) | |
|
|
| `High` | Highest price (USD) of the day | |
|
|
| `Low` | Lowest price (USD) of the day | |
|
|
| `Close` | Adjusted closing price (USD) | |
|
|
| `Volume` | Total shares traded | |
|
|
| `Label` | Sentiment (`1`=positive, `0`=neutral, `-1`=negative) | |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Model Performance (Validation) |
|
|
|
|
|
Across all embedding variants, the **tuned GloVe + Gradient Boosting** model achieved the strongest validation results: |
|
|
|
|
|
- **Accuracy:** 0.714 |
|
|
- **Precision:** 0.758 |
|
|
- **Recall:** 0.714 |
|
|
- **F1 Score:** 0.694 |
|
|
- **Error Rate:** 0.286 |
|
|
|
|
|
Training accuracy is perfect for all models due to dataset size; validation metrics are therefore used for fair comparison. Full results appear in the associated paper (Table 3). |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ง Weekly Sentiment Summaries |
|
|
|
|
|
To show how sentiment predictions can support market interpretation: |
|
|
|
|
|
1. Daily predictions are aggregated by week |
|
|
2. Sentiment ratios (positive / neutral / negative) are computed |
|
|
3. Weekly summaries are generated using **Mistral-7B-Instruct** |
|
|
4. These summaries provide narrative insight into market mood |
|
|
|
|
|
This illustrates a practical workflow where classical NLP models feed into downstream financial analysis. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Intended Use |
|
|
|
|
|
This repository is intended for: |
|
|
|
|
|
- Research on embedding-based sentiment classification |
|
|
- Educational exploration of Word2Vec, GloVe, and Sentence Transformer embeddings |
|
|
- Demonstrating weekly sentiment aggregation |
|
|
- Benchmarking classical ML approaches for small financial text datasets |
|
|
|
|
|
--- |
|
|
|
|
|
## โ ๏ธ Limitations |
|
|
|
|
|
- Small dataset (349 samples) |
|
|
- Potential overfitting of classical models |
|
|
- Not designed for automated trading or real-time systems |
|
|
- Weekly summaries rely on LLM outputs and may include stylistic bias |
|
|
- No direct price prediction or financial forecasting |
|
|
|
|
|
This model is best used for experimentation and learning. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Citation |
|
|
|
|
|
If you use this work, please cite: |
|
|
|
|
|
> Roy, Joyjit. โNLP Stock Sentiment Analysis: A Comparative Embedding-Based Approach.โ arXiv preprint, 2025. |
|
|
> Dataset: Joyjit Roy. โNLP Stock Sentiment Analysis โ AI for Market Signals.โ Zenodo. https://doi.org/10.5281/zenodo.17510735 |
|
|
|
|
|
--- |