--- pipeline_tag: text-classification tags: - text-classification - cefr - word2vec - doc2vec - nlp language: - en license: mit --- # CEFR Doc2Vec Classifier A Doc2Vec-based neural network model for classifying English text by CEFR (Common European Framework of Reference for Languages) proficiency levels. The source code to train this model can be found at: https://github.com/luantran/One-model-to-grade-them-all ## Model Description This model is part of an ensemble CEFR text classification system that combines multiple approaches to estimate language proficiency levels. The Doc2Vec classifier uses document embeddings fed into a fully connected neural network to capture semantic patterns characteristic of different proficiency levels. The other models part of this ensemble are: - https://huggingface.co/theluantran/cefr-naive-bayes - https://huggingface.co/theluantran/cefr-bert-classifier ## Labels The model classifies text into 5 CEFR proficiency levels: * **A1**: Beginner * **A2**: Elementary * **B1**: Intermediate * **B2**: Upper Intermediate * **C1/C2**: Advanced ## Model Details * **Type**: Doc2Vec + Fully Connected Neural Network * **Frameworks**: gensim (Doc2Vec), PyTorch (Neural Network) * **Task**: Multi-class text classification * **Architecture**: * Doc2Vec embedding: 300-dimensional document vectors * Neural network: 128 hidden units with dropout (0.3) * Output: 5-class softmax classification * **Input**: Raw text strings * **Output**: Class predictions (0-4) with probability distributions * **Files**: * `doc2vec_model.bin`: Trained Doc2Vec model (gensim binary format) * `nn_weights.pth`: Neural network state dictionary (PyTorch) * `config.json`: Model configuration (embedding_dim, hidden_dim, num_classes, dropout_rate) ## Usage ### Basic Prediction ```python from huggingface_hub import snapshot_download from gensim.models import Doc2Vec import torch import torch.nn as nn import numpy as np import json import os # Download model files local_dir = "./doc2vec_model" snapshot_download( repo_id="theluantran/cefr-doc2vec", local_dir=local_dir, local_dir_use_symlinks=False, allow_patterns=[ "doc2vec_model*", "*.json", "nn_weights.pth" ] ) # Define neural network architecture class Doc2VecClassifier(nn.Module): def __init__(self, embedding_dim, hidden_dim, num_classes, dropout=0.3): super(Doc2VecClassifier, self).__init__() self.fc1 = nn.Linear(embedding_dim, hidden_dim) self.relu = nn.ReLU() self.dropout = nn.Dropout(dropout) self.fc2 = nn.Linear(hidden_dim, num_classes) def forward(self, x): x = self.fc1(x) x = self.relu(x) x = self.dropout(x) x = self.fc2(x) return x # Load Doc2Vec model doc2vec_model = Doc2Vec.load(os.path.join(local_dir, "doc2vec_model.bin")) # Load configuration with open(os.path.join(local_dir, "config.json"), 'r') as f: config = json.load(f) # Reconstruct and load neural network neural_network = Doc2VecClassifier( embedding_dim=config['embedding_dim'], hidden_dim=config['hidden_dim'], num_classes=config['num_classes'], dropout=config['dropout_rate'] ) neural_network.load_state_dict( torch.load(os.path.join(local_dir, "nn_weights.pth")) ) neural_network.eval() # Predict text = "This is a sample text to classify" vector = doc2vec_model.infer_vector(text.split()) with torch.no_grad(): tensor = torch.FloatTensor(vector).unsqueeze(0) output = neural_network(tensor) probabilities = torch.softmax(output, dim=1) probs_array = probabilities.numpy()[0] prediction = int(np.argmax(probs_array)) # Map numeric prediction to CEFR level level_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'} predicted_level = level_map[prediction] print(f"Predicted level: {predicted_level}") print(f"Confidence: {max(probs_array):.2%}") print(f"All probabilities: {dict(zip(level_map.values(), probs_array))}") ``` ## Model Configuration The `config.json` file contains the following parameters: ```json { "embedding_dim": 100, "hidden_dim": 128, "num_classes": 5, "dropout_rate": 0.3 } ``` ## Training This model was trained using proprietary CEFR-labeled text data. The training process involves: 1. **Doc2Vec Embedding Training**: Training Doc2Vec embeddings on the corpus with 10 epochs and minimum word count of 1 2. **Document Vector Generation**: Generating 300-dimensional document vectors for all training samples 3. **Neural Network Training**: Training a fully connected neural network classifier on these embeddings ## License This model is released for research and educational purposes. The training data is proprietary and not included.