YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Model Card

Model Description

This is not a real model - just a test publishing evals of TinyLlama/TinyLlama-1.1B-Chat-v1.0 link

Evaluation Results

Hellaswag

Metric Value Stderr
acc 0.2872 0.0045
acc_norm 0.3082 0.0046

GLUE

Tasks Metric Value Stderr
cola mcc 0.0000 0.0000
mnli acc 0.3484 0.0048
mnli_mismatch acc 0.3463 0.0048
mrpc acc 0.6838 0.0230
f1 0.8122 0.0163
qnli acc 0.4959 0.0068
qqp acc 0.3678 0.0024
f1 0.5373 0.0026
rte acc 0.5271 0.0301
sst2 acc 0.5092 0.0169
wnli acc 0.4225 0.0590

Subtask breakdown, courtesy of ChatGPT:

GLUE Benchmark Breakdown

GLUE (General Language Understanding Evaluation) is a collection of tasks designed to evaluate natural language understanding (NLU) models. The benchmark includes various subtasks that test different aspects of language comprehension.

Subtasks

1. CoLA (Corpus of Linguistic Acceptability)

  • Task: Sentence acceptability judgment (grammaticality).
  • Goal: Determine if a sentence is grammatically acceptable.
  • Input: A single sentence.
  • Output: Binary classification (grammatically correct or not).

2. SST-2 (Stanford Sentiment Treebank)

  • Task: Sentiment analysis.
  • Goal: Classify a sentence's sentiment as positive or negative.
  • Input: A single sentence.
  • Output: Binary classification (positive or negative sentiment).

3. MRPC (Microsoft Research Paraphrase Corpus)

  • Task: Paraphrase detection.
  • Goal: Determine if two sentences are semantically equivalent.
  • Input: Two sentences.
  • Output: Binary classification (paraphrase or not).

4. STS-B (Semantic Textual Similarity Benchmark)

  • Task: Sentence similarity assessment.
  • Goal: Assess the degree of similarity between two sentences.
  • Input: Two sentences.
  • Output: Real-valued score (1 to 5) indicating similarity.

5. QQP (Quora Question Pairs)

  • Task: Question pair similarity.
  • Goal: Determine whether two questions are semantically equivalent.
  • Input: Two questions.
  • Output: Binary classification (equivalent or not).

6. MNLI (Multi-Genre Natural Language Inference)

  • Task: Textual entailment.
  • Goal: Determine the relationship between a premise and a hypothesis (entailment, contradiction, or neutral).
  • Input: A premise and a hypothesis.
  • Output: Three-way classification (entailment, contradiction, or neutral).
  • EH NOTE: Genre here refers to type of language source (telephone call, fiction writing, etc)
  • Variants:
    • Matched: In-domain evaluation (test data from the same genres as training data).
    • Mismatched: Out-of-domain evaluation (test data from different genres than training data).

7. QNLI (Question Natural Language Inference)

  • Task: Question answering in inference format.
  • Goal: Determine if a sentence answers a question.
  • Input: A question and a sentence.
  • Output: Binary classification (entailment or not entailment).

8. RTE (Recognizing Textual Entailment)

  • Task: Textual entailment.
  • Goal: Determine if a premise entails a hypothesis.
  • Input: A premise and a hypothesis.
  • Output: Binary classification (entailment or not).

9. WNLI (Winograd Natural Language Inference)

  • Task: Coreference resolution.
  • Goal: Determine whether a pronoun in a sentence refers to a given noun.
  • Input: A sentence with a pronoun and a candidate antecedent.
  • Output: Binary classification (correct or incorrect coreference).

Summary of Subtasks:

  • CoLA: Grammaticality judgment (binary classification).
  • SST-2: Sentiment analysis (binary classification).
  • MRPC: Paraphrase detection (binary classification).
  • STS-B: Sentence similarity (regression score).
  • QQP: Question pair similarity (binary classification).
  • MNLI: Textual entailment (three-way classification).
    • Matched: In-domain data.
    • Mismatched: Out-of-domain data.
  • QNLI: Question answering entailment (binary classification).
  • RTE: Textual entailment (binary classification).
  • WNLI: Coreference resolution (binary classification).

How to Use

... don't use this, it's just a test ...

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support