joyjitroy's picture
Update README.md
e39d4e8 verified
metadata
license: mit
language:
  - en
metrics:
  - accuracy
  - precision
  - recall
  - f1
pipeline_tag: text-classification
library_name: sklearn
tags:
  - finance
  - sentiment-analysis
  - embeddings
  - gradient-boosting
  - classical-ml
  - market-analysis
  - nlp
  - weekly-sentiment

πŸ“° NLP Stock Sentiment Analysis β€” Embedding-Based Models for Market Signals

Model3_Thumbnail

This model card documents the resources and workflow described in the paper:

β€œNLP Stock Sentiment Analysis: A Comparative Embedding-Based Approach”
(Preprint submitted to arXiv, 2025)

Author: Joyjit Roy
Zenodo DOI: https://doi.org/10.5281/zenodo.17510735
GitHub: https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis

This project provides a complete NLP workflow for classifying the sentiment of headline-level financial news and aggregating weekly sentiment indicators. It serves as a reproducible reference for embedding-based sentiment analysis using classical machine learning models.


πŸ” Project Overview

The goal is to transform unstructured financial news into structured sentiment scores that can support market commentary and trend evaluation. Three embedding-based approaches were explored:

1. Word2Vec (300D)

  • Trained locally on the news corpus
  • Mean pooled per headline

2. GloVe (100D)

  • Pretrained vectors
  • Mean pooled per headline

3. SentenceTransformer Embeddings (384D)

  • Model: all-MiniLM-L6-v2
  • Direct sentence embeddings without fine tuning

For each embedding approach, a Gradient Boosting classifier is trained and evaluated on the same dataset. Weekly sentiment summaries are generated to demonstrate applied use cases for financial analysis.

Note: These models are scikit learn models and are not directly deployed as Hugging Face inference endpoints.


πŸ’½ Dataset Summary

The dataset contains 349 financial news headlines, each paired with OHLCV market indicators and a sentiment label:
1 = positive, 0 = neutral, -1 = negative.

Dataset is available via Zenodo:
πŸ‘‰ https://doi.org/10.5281/zenodo.17510735

πŸ“Š Data Dictionary

Column Description
Date Date the news item was released
News Headline or snippet text
Open Opening price (USD)
High Highest price (USD) of the day
Low Lowest price (USD) of the day
Close Adjusted closing price (USD)
Volume Total shares traded
Label Sentiment (1=positive, 0=neutral, -1=negative)

πŸ“Š Model Performance (Validation)

Across all embedding variants, the tuned GloVe + Gradient Boosting model achieved the strongest validation results:

  • Accuracy: 0.714
  • Precision: 0.758
  • Recall: 0.714
  • F1 Score: 0.694
  • Error Rate: 0.286

Training accuracy is perfect for all models due to dataset size; validation metrics are therefore used for fair comparison. Full results appear in the associated paper (Table 3).


🧠 Weekly Sentiment Summaries

To show how sentiment predictions can support market interpretation:

  1. Daily predictions are aggregated by week
  2. Sentiment ratios (positive / neutral / negative) are computed
  3. Weekly summaries are generated using Mistral-7B-Instruct
  4. These summaries provide narrative insight into market mood

This illustrates a practical workflow where classical NLP models feed into downstream financial analysis.


🌟 Intended Use

This repository is intended for:

  • Research on embedding-based sentiment classification
  • Educational exploration of Word2Vec, GloVe, and Sentence Transformer embeddings
  • Demonstrating weekly sentiment aggregation
  • Benchmarking classical ML approaches for small financial text datasets

⚠️ Limitations

  • Small dataset (349 samples)
  • Potential overfitting of classical models
  • Not designed for automated trading or real-time systems
  • Weekly summaries rely on LLM outputs and may include stylistic bias
  • No direct price prediction or financial forecasting

This model is best used for experimentation and learning.


πŸ“˜ Citation

If you use this work, please cite:

Roy, Joyjit. β€œNLP Stock Sentiment Analysis: A Comparative Embedding-Based Approach.” arXiv preprint, 2025.
Dataset: Joyjit Roy. β€œNLP Stock Sentiment Analysis β€” AI for Market Signals.” Zenodo. https://doi.org/10.5281/zenodo.17510735