Update README.md

e39d4e8 verified 13 days ago

5.68 kB

	---
	license: mit
	language:
	- en
	metrics:
	- accuracy
	- precision
	- recall
	- f1
	pipeline_tag: text-classification
	library_name: sklearn
	tags:
	- finance
	- sentiment-analysis
	- embeddings
	- gradient-boosting
	- classical-ml
	- market-analysis
	- nlp
	- weekly-sentiment
	---
	<p align="left">
	<a href="https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis">
	<img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" />
	</a>
	<a href="https://doi.org/10.5281/zenodo.17510735">
	<img src="https://img.shields.io/badge/Zenodo-DOI-1877f2?logo=zenodo" />
	</a>
	<a href="https://huggingface.co/joyjitroy/Stock_Market_News_Sentiment_Analysis">
	<img src="https://img.shields.io/badge/HuggingFace-Model-yellow?logo=huggingface" />
	</a>
	<a href="#">
	<img src="https://img.shields.io/badge/arXiv-Active-blue?logo=arxiv" />
	</a>
	<a href="#">
	<img src="https://img.shields.io/badge/License-MIT-green" />
	</a>
	</p>

	# 📰 NLP Stock Sentiment Analysis — Embedding-Based Models for Market Signals

	![Model3_Thumbnail](https://cdn-uploads.huggingface.co/production/uploads/68faf2c58b9b8d06b47b769c/uuw0OJrmZ3No5S17TJ5ti.png)

	This model card documents the resources and workflow described in the paper:

	**“NLP Stock Sentiment Analysis: A Comparative Embedding-Based Approach”
	(Preprint submitted to arXiv, 2025)**
	Author: Joyjit Roy
	Zenodo DOI: https://doi.org/10.5281/zenodo.17510735
	GitHub: https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis

	This project provides a complete NLP workflow for classifying the sentiment of headline-level financial news and aggregating weekly sentiment indicators. It serves as a reproducible reference for embedding-based sentiment analysis using classical machine learning models.

	---

	## 🔍 Project Overview

	The goal is to transform unstructured financial news into structured sentiment scores that can support market commentary and trend evaluation. Three embedding-based approaches were explored:

	### 1. Word2Vec (300D)
	- Trained locally on the news corpus
	- Mean pooled per headline

	### 2. GloVe (100D)
	- Pretrained vectors
	- Mean pooled per headline

	### 3. SentenceTransformer Embeddings (384D)
	- Model: `all-MiniLM-L6-v2`
	- Direct sentence embeddings without fine tuning

	For each embedding approach, a Gradient Boosting classifier is trained and evaluated on the same dataset. Weekly sentiment summaries are generated to demonstrate applied use cases for financial analysis.

	> Note: These models are scikit learn models and are not directly deployed as Hugging Face inference endpoints.

	---

	## 💽 Dataset Summary

	The dataset contains 349 financial news headlines, each paired with OHLCV market indicators and a sentiment label:
	`1` = positive, `0` = neutral, `-1` = negative.

	Dataset is available via Zenodo:
	👉 https://doi.org/10.5281/zenodo.17510735

	### 📊 Data Dictionary

	\| Column \| Description \|
	\|----------\|---------------------------------------------------------------\|
	\| `Date` \| Date the news item was released \|
	\| `News` \| Headline or snippet text \|
	\| `Open` \| Opening price (USD) \|
	\| `High` \| Highest price (USD) of the day \|
	\| `Low` \| Lowest price (USD) of the day \|
	\| `Close` \| Adjusted closing price (USD) \|
	\| `Volume` \| Total shares traded \|
	\| `Label` \| Sentiment (`1`=positive, `0`=neutral, `-1`=negative) \|

	---

	## 📊 Model Performance (Validation)

	Across all embedding variants, the tuned GloVe + Gradient Boosting model achieved the strongest validation results:

	- Accuracy: 0.714
	- Precision: 0.758
	- Recall: 0.714
	- F1 Score: 0.694
	- Error Rate: 0.286

	Training accuracy is perfect for all models due to dataset size; validation metrics are therefore used for fair comparison. Full results appear in the associated paper (Table 3).

	---

	## 🧠 Weekly Sentiment Summaries

	To show how sentiment predictions can support market interpretation:

	1. Daily predictions are aggregated by week
	2. Sentiment ratios (positive / neutral / negative) are computed
	3. Weekly summaries are generated using Mistral-7B-Instruct
	4. These summaries provide narrative insight into market mood

	This illustrates a practical workflow where classical NLP models feed into downstream financial analysis.

	---

	## 🌟 Intended Use

	This repository is intended for:

	- Research on embedding-based sentiment classification
	- Educational exploration of Word2Vec, GloVe, and Sentence Transformer embeddings
	- Demonstrating weekly sentiment aggregation
	- Benchmarking classical ML approaches for small financial text datasets

	---

	## ⚠️ Limitations

	- Small dataset (349 samples)
	- Potential overfitting of classical models
	- Not designed for automated trading or real-time systems
	- Weekly summaries rely on LLM outputs and may include stylistic bias
	- No direct price prediction or financial forecasting

	This model is best used for experimentation and learning.

	---

	## 📘 Citation

	If you use this work, please cite:

	> Roy, Joyjit. “NLP Stock Sentiment Analysis: A Comparative Embedding-Based Approach.” arXiv preprint, 2025.
	> Dataset: Joyjit Roy. “NLP Stock Sentiment Analysis — AI for Market Signals.” Zenodo. https://doi.org/10.5281/zenodo.17510735

	---