YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Model Card

Model Description

This is not a real model - just a test publishing evals of TinyLlama/TinyLlama-1.1B-Chat-v1.0 link

Evaluation Results

Hellaswag

Hellaswag measures ability to complete sentences
website
eleuther task

Metric	Value	Stderr
acc	0.2872	0.0045
acc_norm	0.3082	0.0046

GLUE

GLUE is a multi-measure llm eval with a number of (subtasks)[https://gluebenchmark.com/tasks]
eleuther task

Tasks	Metric	Value	Stderr
cola	mcc	0.0000	0.0000
mnli	acc	0.3484	0.0048
mnli_mismatch	acc	0.3463	0.0048
mrpc	acc	0.6838	0.0230
	f1	0.8122	0.0163
qnli	acc	0.4959	0.0068
qqp	acc	0.3678	0.0024
	f1	0.5373	0.0026
rte	acc	0.5271	0.0301
sst2	acc	0.5092	0.0169
wnli	acc	0.4225	0.0590

Subtask breakdown, courtesy of ChatGPT:

GLUE Benchmark Breakdown

GLUE (General Language Understanding Evaluation) is a collection of tasks designed to evaluate natural language understanding (NLU) models. The benchmark includes various subtasks that test different aspects of language comprehension.

Subtasks

1. CoLA (Corpus of Linguistic Acceptability)

Task: Sentence acceptability judgment (grammaticality).
Goal: Determine if a sentence is grammatically acceptable.
Input: A single sentence.
Output: Binary classification (grammatically correct or not).

2. SST-2 (Stanford Sentiment Treebank)

Task: Sentiment analysis.
Goal: Classify a sentence's sentiment as positive or negative.
Input: A single sentence.
Output: Binary classification (positive or negative sentiment).

3. MRPC (Microsoft Research Paraphrase Corpus)

Task: Paraphrase detection.
Goal: Determine if two sentences are semantically equivalent.
Input: Two sentences.
Output: Binary classification (paraphrase or not).

4. STS-B (Semantic Textual Similarity Benchmark)

Task: Sentence similarity assessment.
Goal: Assess the degree of similarity between two sentences.
Input: Two sentences.
Output: Real-valued score (1 to 5) indicating similarity.

5. QQP (Quora Question Pairs)

Task: Question pair similarity.
Goal: Determine whether two questions are semantically equivalent.
Input: Two questions.
Output: Binary classification (equivalent or not).

6. MNLI (Multi-Genre Natural Language Inference)

Task: Textual entailment.
Goal: Determine the relationship between a premise and a hypothesis (entailment, contradiction, or neutral).
Input: A premise and a hypothesis.
Output: Three-way classification (entailment, contradiction, or neutral).
EH NOTE: Genre here refers to type of language source (telephone call, fiction writing, etc)
Variants:
- Matched: In-domain evaluation (test data from the same genres as training data).
- Mismatched: Out-of-domain evaluation (test data from different genres than training data).

7. QNLI (Question Natural Language Inference)

Task: Question answering in inference format.
Goal: Determine if a sentence answers a question.
Input: A question and a sentence.
Output: Binary classification (entailment or not entailment).

8. RTE (Recognizing Textual Entailment)

Task: Textual entailment.
Goal: Determine if a premise entails a hypothesis.
Input: A premise and a hypothesis.
Output: Binary classification (entailment or not).

9. WNLI (Winograd Natural Language Inference)

Task: Coreference resolution.
Goal: Determine whether a pronoun in a sentence refers to a given noun.
Input: A sentence with a pronoun and a candidate antecedent.
Output: Binary classification (correct or incorrect coreference).

Summary of Subtasks:

CoLA: Grammaticality judgment (binary classification).
SST-2: Sentiment analysis (binary classification).
MRPC: Paraphrase detection (binary classification).
STS-B: Sentence similarity (regression score).
QQP: Question pair similarity (binary classification).
MNLI: Textual entailment (three-way classification).
- Matched: In-domain data.
- Mismatched: Out-of-domain data.
QNLI: Question answering entailment (binary classification).
RTE: Textual entailment (binary classification).
WNLI: Coreference resolution (binary classification).

How to Use

... don't use this, it's just a test ...

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support