--- library_name: transformers license: mit base_model: BAAI/bge-small-en-v1.5 tags: - generated_from_trainer metrics: - accuracy model-index: - name: bge-small-en-v1.5-ultrafineweb-vs-pile-classifier results: [] datasets: - openbmb/Ultra-FineWeb - EleutherAI/the_pile_deduplicated language: - en pipeline_tag: text-classification --- # bge-small-en-v1.5-ultrafineweb-vs-pile-classifier > [!IMPORTANT] > **Note:** This model is provided for reference and reproducibility, not for standalone use. This model is a fine-tuned version of [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) to classify text as high quality or low quality for AI training. - Trained on 100k samples from [openbmb/Ultra-FineWeb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) (high quality) and 100k from [EleutherAI/the_pile_deduplicated](https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated) (low quality) - 80% training / 20% validation split On the validation set: - Loss: 0.2926 - Accuracy: 0.9061 - Combined Score: 2.1448 - Tokens processed: 102,184,960 ## Example ```python from transformers import pipeline classifier = pipeline("text-classification", model="agentlans/bge-small-en-v1.5-ultrafineweb-vs-pile-classifier") classifier("Your text here.") ``` ## Limitations - Tends to be overly strict, labelling most texts outside training data as low quality - English only - May be biased against some text types such as source code and personal blogs ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: linear - num_epochs: 5.0 ### Training results | Training Loss | Epoch | Step | Validation Loss | Accuracy | Combined Score | Input Tokens Seen | |:-------------:|:-----:|:-----:|:---------------:|:--------:|:--------------:|:-----------------:| | 0.2893 | 1.0 | 19958 | 0.2926 | 0.9061 | 2.1448 | 20436992 | | 0.2397 | 2.0 | 39916 | 0.3127 | 0.9076 | 2.1194 | 40873984 | | 0.2 | 3.0 | 59874 | 0.3279 | 0.9109 | 2.0605 | 61310976 | | 0.1576 | 4.0 | 79832 | 0.3887 | 0.9080 | 2.1119 | 81747968 | | 0.1127 | 5.0 | 99790 | 0.4688 | 0.9069 | 2.1308 | 102184960 | ### Framework versions - Transformers 4.51.3 - Pytorch 2.6.0+cu124 - Datasets 3.2.0 - Tokenizers 0.21.0