Update README.md

e39d4e8 verified 8 days ago

5.68 kB

metadata

license: mit
language:
  - en
metrics:
  - accuracy
  - precision
  - recall
  - f1
pipeline_tag: text-classification
library_name: sklearn
tags:
  - finance
  - sentiment-analysis
  - embeddings
  - gradient-boosting
  - classical-ml
  - market-analysis
  - nlp
  - weekly-sentiment

📰 NLP Stock Sentiment Analysis — Embedding-Based Models for Market Signals

This model card documents the resources and workflow described in the paper:

“NLP Stock Sentiment Analysis: A Comparative Embedding-Based Approach”
(Preprint submitted to arXiv, 2025)
Author: Joyjit Roy
Zenodo DOI: https://doi.org/10.5281/zenodo.17510735
GitHub: https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis

This project provides a complete NLP workflow for classifying the sentiment of headline-level financial news and aggregating weekly sentiment indicators. It serves as a reproducible reference for embedding-based sentiment analysis using classical machine learning models.

🔍 Project Overview

The goal is to transform unstructured financial news into structured sentiment scores that can support market commentary and trend evaluation. Three embedding-based approaches were explored:

1. Word2Vec (300D)

Trained locally on the news corpus
Mean pooled per headline

2. GloVe (100D)

Pretrained vectors
Mean pooled per headline

3. SentenceTransformer Embeddings (384D)

Model: all-MiniLM-L6-v2
Direct sentence embeddings without fine tuning

For each embedding approach, a Gradient Boosting classifier is trained and evaluated on the same dataset. Weekly sentiment summaries are generated to demonstrate applied use cases for financial analysis.

Note: These models are scikit learn models and are not directly deployed as Hugging Face inference endpoints.

💽 Dataset Summary

The dataset contains 349 financial news headlines, each paired with OHLCV market indicators and a sentiment label:
1 = positive, 0 = neutral, -1 = negative.

Dataset is available via Zenodo:
👉 https://doi.org/10.5281/zenodo.17510735

📊 Data Dictionary

Column	Description
`Date`	Date the news item was released
`News`	Headline or snippet text
`Open`	Opening price (USD)
`High`	Highest price (USD) of the day
`Low`	Lowest price (USD) of the day
`Close`	Adjusted closing price (USD)
`Volume`	Total shares traded
`Label`	Sentiment (`1`=positive, `0`=neutral, `-1`=negative)

📊 Model Performance (Validation)

Across all embedding variants, the tuned GloVe + Gradient Boosting model achieved the strongest validation results:

Accuracy: 0.714
Precision: 0.758
Recall: 0.714
F1 Score: 0.694
Error Rate: 0.286

Training accuracy is perfect for all models due to dataset size; validation metrics are therefore used for fair comparison. Full results appear in the associated paper (Table 3).

🧠 Weekly Sentiment Summaries

To show how sentiment predictions can support market interpretation:

Daily predictions are aggregated by week
Sentiment ratios (positive / neutral / negative) are computed
Weekly summaries are generated using Mistral-7B-Instruct
These summaries provide narrative insight into market mood

This illustrates a practical workflow where classical NLP models feed into downstream financial analysis.

🌟 Intended Use

This repository is intended for:

Research on embedding-based sentiment classification
Educational exploration of Word2Vec, GloVe, and Sentence Transformer embeddings
Demonstrating weekly sentiment aggregation
Benchmarking classical ML approaches for small financial text datasets

⚠️ Limitations

Small dataset (349 samples)
Potential overfitting of classical models
Not designed for automated trading or real-time systems
Weekly summaries rely on LLM outputs and may include stylistic bias
No direct price prediction or financial forecasting

This model is best used for experimentation and learning.

📘 Citation

If you use this work, please cite:

Roy, Joyjit. “NLP Stock Sentiment Analysis: A Comparative Embedding-Based Approach.” arXiv preprint, 2025.
Dataset: Joyjit Roy. “NLP Stock Sentiment Analysis — AI for Market Signals.” Zenodo. https://doi.org/10.5281/zenodo.17510735