YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Model Card
Model Description
This is not a real model - just a test publishing evals of TinyLlama/TinyLlama-1.1B-Chat-v1.0 link
Evaluation Results
Hellaswag
- Hellaswag measures ability to complete sentences
- website
- eleuther task
| Metric | Value | Stderr |
|---|---|---|
| acc | 0.2872 | 0.0045 |
| acc_norm | 0.3082 | 0.0046 |
GLUE
- GLUE is a multi-measure llm eval with a number of (subtasks)[https://gluebenchmark.com/tasks]
- eleuther task
| Tasks | Metric | Value | Stderr |
|---|---|---|---|
| cola | mcc | 0.0000 | 0.0000 |
| mnli | acc | 0.3484 | 0.0048 |
| mnli_mismatch | acc | 0.3463 | 0.0048 |
| mrpc | acc | 0.6838 | 0.0230 |
| f1 | 0.8122 | 0.0163 | |
| qnli | acc | 0.4959 | 0.0068 |
| qqp | acc | 0.3678 | 0.0024 |
| f1 | 0.5373 | 0.0026 | |
| rte | acc | 0.5271 | 0.0301 |
| sst2 | acc | 0.5092 | 0.0169 |
| wnli | acc | 0.4225 | 0.0590 |
Subtask breakdown, courtesy of ChatGPT:
GLUE Benchmark Breakdown
GLUE (General Language Understanding Evaluation) is a collection of tasks designed to evaluate natural language understanding (NLU) models. The benchmark includes various subtasks that test different aspects of language comprehension.
Subtasks
1. CoLA (Corpus of Linguistic Acceptability)
- Task: Sentence acceptability judgment (grammaticality).
- Goal: Determine if a sentence is grammatically acceptable.
- Input: A single sentence.
- Output: Binary classification (grammatically correct or not).
2. SST-2 (Stanford Sentiment Treebank)
- Task: Sentiment analysis.
- Goal: Classify a sentence's sentiment as positive or negative.
- Input: A single sentence.
- Output: Binary classification (positive or negative sentiment).
3. MRPC (Microsoft Research Paraphrase Corpus)
- Task: Paraphrase detection.
- Goal: Determine if two sentences are semantically equivalent.
- Input: Two sentences.
- Output: Binary classification (paraphrase or not).
4. STS-B (Semantic Textual Similarity Benchmark)
- Task: Sentence similarity assessment.
- Goal: Assess the degree of similarity between two sentences.
- Input: Two sentences.
- Output: Real-valued score (1 to 5) indicating similarity.
5. QQP (Quora Question Pairs)
- Task: Question pair similarity.
- Goal: Determine whether two questions are semantically equivalent.
- Input: Two questions.
- Output: Binary classification (equivalent or not).
6. MNLI (Multi-Genre Natural Language Inference)
- Task: Textual entailment.
- Goal: Determine the relationship between a premise and a hypothesis (entailment, contradiction, or neutral).
- Input: A premise and a hypothesis.
- Output: Three-way classification (entailment, contradiction, or neutral).
- EH NOTE: Genre here refers to type of language source (telephone call, fiction writing, etc)
- Variants:
- Matched: In-domain evaluation (test data from the same genres as training data).
- Mismatched: Out-of-domain evaluation (test data from different genres than training data).
7. QNLI (Question Natural Language Inference)
- Task: Question answering in inference format.
- Goal: Determine if a sentence answers a question.
- Input: A question and a sentence.
- Output: Binary classification (entailment or not entailment).
8. RTE (Recognizing Textual Entailment)
- Task: Textual entailment.
- Goal: Determine if a premise entails a hypothesis.
- Input: A premise and a hypothesis.
- Output: Binary classification (entailment or not).
9. WNLI (Winograd Natural Language Inference)
- Task: Coreference resolution.
- Goal: Determine whether a pronoun in a sentence refers to a given noun.
- Input: A sentence with a pronoun and a candidate antecedent.
- Output: Binary classification (correct or incorrect coreference).
Summary of Subtasks:
- CoLA: Grammaticality judgment (binary classification).
- SST-2: Sentiment analysis (binary classification).
- MRPC: Paraphrase detection (binary classification).
- STS-B: Sentence similarity (regression score).
- QQP: Question pair similarity (binary classification).
- MNLI: Textual entailment (three-way classification).
- Matched: In-domain data.
- Mismatched: Out-of-domain data.
- QNLI: Question answering entailment (binary classification).
- RTE: Textual entailment (binary classification).
- WNLI: Coreference resolution (binary classification).
How to Use
... don't use this, it's just a test ...
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support