Initial dolly-v2-7b Olive Optimized

Browse files

Files changed (11) hide show

README.md +173 -0
_gpt_neox_layers.0_attention_rotary_emb_Constant_5_attr__value +0 -0
_gpt_neox_layers.0_attention_rotary_emb_Constant_attr__value +0 -0
config.json +33 -0
decoder_model_merged.onnx +3 -0
decoder_model_merged.onnx_data +3 -0
generation_config.json +6 -0
instruct_pipeline.py +208 -0
special_tokens_map.json +11 -0
tokenizer.json +0 -0
tokenizer_config.json +9 -0

README.md CHANGED Viewed

@@ -1,3 +1,176 @@
 ---
 license: mit
 ---

 ---
 license: mit
+language:
+- en
+library_name: transformers
+inference: false
+datasets:
+- databricks/databricks-dolly-15k
 ---
+# dolly-v2-7b Model Card
+## Summary
+Databricks’ `dolly-v2-7b`, an instruction-following large language model trained on the Databricks machine learning platform
+that is licensed for commercial use. Based on `pythia-6.9b`, Dolly is trained on ~15k instruction/response fine tuning records
+[`databricks-dolly-15k`](https://github.com/databrickslabs/dolly/tree/master/data) generated
+by Databricks employees in capability domains from the InstructGPT paper, including brainstorming, classification, closed QA, generation,
+information extraction, open QA and summarization. `dolly-v2-7b` is not a state-of-the-art model, but does exhibit surprisingly
+high quality instruction following behavior not characteristic of the foundation model on which it is based.
+Dolly v2 is also available in these other models sizes:
+* [dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b), a 12 billion parameter based on `pythia-12b`
+* [dolly-v2-3b](https://huggingface.co/databricks/dolly-v2-3b), a 2.8 billion parameter based on `pythia-2.8b`
+Please refer to the [dolly GitHub repo](https://github.com/databrickslabs/dolly#getting-started-with-response-generation) for tips on
+running inference for various GPU configurations.
+**Owner**: Databricks, Inc.
+## Model Overview
+`dolly-v2-7b` is a 6.9 billion parameter causal language model created by [Databricks](https://databricks.com/) that is derived from
+[EleutherAI’s](https://www.eleuther.ai/) [Pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b) and fine-tuned
+on a [~15K record instruction corpus](https://github.com/databrickslabs/dolly/tree/master/data) generated by Databricks employees and released under a permissive license (CC-BY-SA)
+## Usage
+To use the model with the `transformers` library on a machine with GPUs, first make sure you have the `transformers` and `accelerate` libraries installed.
+In a Databricks notebook you could run:
+```python
+%pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2"
+```
+The instruction following pipeline can be loaded using the `pipeline` function as shown below.  This loads a custom `InstructionTextGenerationPipeline`
+found in the model repo [here](https://huggingface.co/databricks/dolly-v2-3b/blob/main/instruct_pipeline.py), which is why `trust_remote_code=True` is required.
+Including `torch_dtype=torch.bfloat16` is generally recommended if this type is supported in order to reduce memory usage.  It does not appear to impact output quality.
+It is also fine to remove it if there is sufficient memory.
+```python
+import torch
+from transformers import pipeline
+generate_text = pipeline(model="databricks/dolly-v2-7b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
+```
+You can then use the pipeline to answer instructions:
+```python
+res = generate_text("Explain to me the difference between nuclear fission and fusion.")
+print(res[0]["generated_text"])
+```
+Alternatively, if you prefer to not use `trust_remote_code=True` you can download [instruct_pipeline.py](https://huggingface.co/databricks/dolly-v2-3b/blob/main/instruct_pipeline.py),
+store it alongside your notebook, and construct the pipeline yourself from the loaded model and tokenizer:
+```python
+import torch
+from instruct_pipeline import InstructionTextGenerationPipeline
+from transformers import AutoModelForCausalLM, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-7b", padding_side="left")
+model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-7b", device_map="auto", torch_dtype=torch.bfloat16)
+generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
+```
+### LangChain Usage
+To use the pipeline with LangChain, you must set `return_full_text=True`, as LangChain expects the full text to be returned
+and the default for the pipeline is to only return the new text.
+```python
+import torch
+from transformers import pipeline
+generate_text = pipeline(model="databricks/dolly-v2-7b", torch_dtype=torch.bfloat16,
+                         trust_remote_code=True, device_map="auto", return_full_text=True)
+```
+You can create a prompt that either has only an instruction or has an instruction with context:
+```python
+from langchain import PromptTemplate, LLMChain
+from langchain.llms import HuggingFacePipeline
+# template for an instrution with no input
+prompt = PromptTemplate(
+    input_variables=["instruction"],
+    template="{instruction}")
+# template for an instruction with input
+prompt_with_context = PromptTemplate(
+    input_variables=["instruction", "context"],
+    template="{instruction}\n\nInput:\n{context}")
+hf_pipeline = HuggingFacePipeline(pipeline=generate_text)
+llm_chain = LLMChain(llm=hf_pipeline, prompt=prompt)
+llm_context_chain = LLMChain(llm=hf_pipeline, prompt=prompt_with_context)
+```
+Example predicting using a simple instruction:
+```python
+print(llm_chain.predict(instruction="Explain to me the difference between nuclear fission and fusion.").lstrip())
+```
+Example predicting using an instruction with context:
+```python
+context = """George Washington (February 22, 1732[b] – December 14, 1799) was an American military officer, statesman,
+and Founding Father who served as the first president of the United States from 1789 to 1797."""
+print(llm_context_chain.predict(instruction="When was George Washington president?", context=context).lstrip())
+```
+## Known Limitations
+### Performance Limitations
+**`dolly-v2-7b` is not a state-of-the-art generative language model** and, though quantitative benchmarking is ongoing, is not designed to perform
+competitively with more modern model architectures or models subject to larger pretraining corpuses.
+The Dolly model family is under active development, and so any list of shortcomings is unlikely to be exhaustive, but we include known limitations and misfires here as a means to document and share our preliminary findings with the community.
+In particular, `dolly-v2-7b` struggles with: syntactically complex prompts, programming problems, mathematical operations, factual errors,
+dates and times, open-ended question answering, hallucination, enumerating lists of specific length, stylistic mimicry, having a sense of humor, etc.
+Moreover, we find that `dolly-v2-7b` does not have some capabilities, such as well-formatted letter writing, present in the original model.
+### Dataset Limitations
+Like all language models, `dolly-v2-7b` reflects the content and limitations of its training corpuses.
+- **The Pile**: GPT-J’s pre-training corpus contains content mostly collected from the public internet, and like most web-scale datasets,
+it contains content many users would find objectionable. As such, the model is likely to reflect these shortcomings, potentially overtly
+in the case it is explicitly asked to produce objectionable content, and sometimes subtly, as in the case of biased or harmful implicit
+associations.
+- **`databricks-dolly-15k`**: The training data on which `dolly-v2-7b` is instruction tuned represents natural language instructions generated
+by Databricks employees during a period spanning March and April 2023 and includes passages from Wikipedia as references passages
+for instruction categories like closed QA and summarization. To our knowledge it does not contain obscenity, intellectual property or
+personally identifying information about non-public figures, but it may contain typos and factual errors.
+The dataset may also reflect biases found in Wikipedia. Finally, the dataset likely reflects
+the interests and semantic choices of Databricks employees, a demographic which is not representative of the global population at large.
+Databricks is committed to ongoing research and development efforts to develop helpful, honest and harmless AI technologies that
+maximize the potential of all individuals and organizations.
+### Benchmark Metrics
+Below you'll find various models benchmark performance on the [EleutherAI LLM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness);
+model results are sorted by geometric mean to produce an intelligible ordering. As outlined above, these results demonstrate that `dolly-v2-7b` is not state of the art,
+and in fact underperforms `dolly-v1-6b` in some evaluation benchmarks. We believe this owes to the composition and size of the underlying fine tuning datasets,
+but a robust statement as to the sources of these variations requires further study.
+|  model                            |   openbookqa |   arc_easy |   winogrande |   hellaswag |   arc_challenge |     piqa |    boolq |    gmean |
+| --------------------------------- | ------------ | ---------- | ------------ | ----------- | --------------- | -------- | -------- | ---------|
+| EleutherAI/pythia-2.8b            |        0.348 |   0.585859 |     0.589582 |    0.591217 |        0.323379 | 0.73395  | 0.638226 | 0.523431 |
+| EleutherAI/pythia-6.9b            |        0.368 |   0.604798 |     0.608524 |    0.631548 |        0.343857 | 0.761153 | 0.6263   | 0.543567 |
+| databricks/dolly-v2-3b            |        0.384 |   0.611532 |     0.589582 |    0.650767 |        0.370307 | 0.742655 | 0.575535 | 0.544886 |
+| EleutherAI/pythia-12b             |        0.364 |   0.627104 |     0.636148 |    0.668094 |        0.346416 | 0.760065 | 0.673394 | 0.559676 |
+| EleutherAI/gpt-j-6B               |        0.382 |   0.621633 |     0.651144 |    0.662617 |        0.363481 | 0.761153 | 0.655963 | 0.565936 |
+| databricks/dolly-v2-12b           |        0.408 |   0.63931  |     0.616417 |    0.707927 |        0.388225 | 0.757889 | 0.568196 | 0.56781  |
+| databricks/dolly-v2-7b            |        0.392 |   0.633838 |     0.607735 |    0.686517 |        0.406997 | 0.750816 | 0.644037 | 0.573487 |
+| databricks/dolly-v1-6b            |        0.41  |   0.62963  |     0.643252 |    0.676758 |        0.384812 | 0.773667 | 0.687768 | 0.583431 |
+| EleutherAI/gpt-neox-20b           |        0.402 |   0.683923 |     0.656669 |    0.7142   |        0.408703 | 0.784004 | 0.695413 | 0.602236 |
+# Happy Hacking!

_gpt_neox_layers.0_attention_rotary_emb_Constant_5_attr__value ADDED Viewed

Binary file (262 kB). View file

_gpt_neox_layers.0_attention_rotary_emb_Constant_attr__value ADDED Viewed

Binary file (262 kB). View file

config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "_name_or_path": "databricks/dolly-v2-7b",
+  "architectures": [
+    "GPTNeoXForCausalLM"
+  ],
+  "bos_token_id": 0,
+  "classifier_dropout": 0.1,
+  "custom_pipelines": {
+    "text-generation": {
+      "impl": "instruct_pipeline.InstructionTextGenerationPipeline",
+      "pt": "AutoModelForCausalLM",
+      "tf": "TFAutoModelForCausalLM"
+    }
+  },
+  "eos_token_id": 0,
+  "hidden_act": "gelu",
+  "hidden_size": 4096,
+  "initializer_range": 0.02,
+  "intermediate_size": 16384,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 2048,
+  "model_type": "gpt_neox",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 32,
+  "rotary_emb_base": 10000,
+  "rotary_pct": 0.25,
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.29.0",
+  "use_cache": true,
+  "use_parallel_residual": true,
+  "vocab_size": 50280
+}

decoder_model_merged.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0ea722b6e7020eae65c844c168d5a97279d3c0d00fccce5bdcfa65688f6e96d6
+size 4169900

decoder_model_merged.onnx_data ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5b6b1e949e90e91c080f89c3aed058c04b8eb842684d9c8502b0b455ac36ef8c
+size 13716054016

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 0,
+  "eos_token_id": 0,
+  "transformers_version": "4.29.0"
+}

instruct_pipeline.py ADDED Viewed

	@@ -0,0 +1,208 @@

+import logging
+import re
+from typing import List
+import numpy as np
+from transformers import Pipeline, PreTrainedTokenizer
+from transformers.utils import is_tf_available
+from transformers import TextStreamer
+if is_tf_available():
+    import tensorflow as tf
+logger = logging.getLogger(__name__)
+INSTRUCTION_KEY = "### Instruction:"
+RESPONSE_KEY = "### Response:"
+END_KEY = "### End"
+INTRO_BLURB = (
+    "Below is an instruction that describes a task. Write a response that appropriately completes the request."
+)
+# This is the prompt that is used for generating responses using an already trained model.  It ends with the response
+# key, where the job of the model is to provide the completion that follows it (i.e. the response itself).
+PROMPT_FOR_GENERATION_FORMAT = """{intro}
+{instruction_key}
+{instruction}
+{response_key}
+""".format(
+    intro=INTRO_BLURB,
+    instruction_key=INSTRUCTION_KEY,
+    instruction="{instruction}",
+    response_key=RESPONSE_KEY,
+)
+def get_special_token_id(tokenizer: PreTrainedTokenizer, key: str) -> int:
+    """Gets the token ID for a given string that has been added to the tokenizer as a special token.
+    When training, we configure the tokenizer so that the sequences like "### Instruction:" and "### End" are
+    treated specially and converted to a single, new token.  This retrieves the token ID each of these keys map to.
+    Args:
+        tokenizer (PreTrainedTokenizer): the tokenizer
+        key (str): the key to convert to a single token
+    Raises:
+        RuntimeError: if more than one ID was generated
+    Returns:
+        int: the token ID for the given key
+    """
+    token_ids = tokenizer.encode(key)
+    if len(token_ids) > 1:
+        raise ValueError(f"Expected only a single token for '{key}' but found {token_ids}")
+    return token_ids[0]
+class InstructionTextGenerationPipeline(Pipeline):
+    def __init__(
+        self, *args, do_sample: bool = True, max_new_tokens: int = 256, streamer: TextStreamer, top_p: float = 0.92, top_k: int = 0, **kwargs
+    ):
+        """Initialize the pipeline
+        Args:
+            do_sample (bool, optional): Whether or not to use sampling. Defaults to True.
+            max_new_tokens (int, optional): Max new tokens after the prompt to generate. Defaults to 128.
+            top_p (float, optional): If set to float < 1, only the smallest set of most probable tokens with
+                probabilities that add up to top_p or higher are kept for generation. Defaults to 0.92.
+            top_k (int, optional): The number of highest probability vocabulary tokens to keep for top-k-filtering.
+                Defaults to 0.
+        """
+        super().__init__(*args, do_sample=do_sample, max_new_tokens=max_new_tokens, top_p=top_p, top_k=top_k,
+                         **kwargs)
+        self.streamer = streamer
+    def _sanitize_parameters(self,
+                             return_full_text: bool = None,
+                             **generate_kwargs):
+        preprocess_params = {}
+        # newer versions of the tokenizer configure the response key as a special token.  newer versions still may
+        # append a newline to yield a single token.  find whatever token is configured for the response key.
+        tokenizer_response_key = next(
+            (token for token in self.tokenizer.additional_special_tokens if token.startswith(RESPONSE_KEY)), None
+        )
+        response_key_token_id = None
+        end_key_token_id = None
+        if tokenizer_response_key:
+            try:
+                response_key_token_id = get_special_token_id(self.tokenizer, tokenizer_response_key)
+                end_key_token_id = get_special_token_id(self.tokenizer, END_KEY)
+                # Ensure generation stops once it generates "### End"
+                generate_kwargs["eos_token_id"] = end_key_token_id
+            except ValueError:
+                pass
+        forward_params = generate_kwargs
+        postprocess_params = {
+            "response_key_token_id": response_key_token_id,
+            "end_key_token_id": end_key_token_id
+        }
+        if return_full_text is not None:
+            postprocess_params["return_full_text"] = return_full_text
+        return preprocess_params, forward_params, postprocess_params
+    def preprocess(self, instruction_text, **generate_kwargs):
+        prompt_text = PROMPT_FOR_GENERATION_FORMAT.format(instruction=instruction_text)
+        inputs = self.tokenizer(
+            prompt_text,
+            return_tensors="pt",
+        )
+        inputs["prompt_text"] = prompt_text
+        inputs["instruction_text"] = instruction_text
+        return inputs
+    def _forward(self, model_inputs, **generate_kwargs):
+        input_ids = model_inputs["input_ids"]
+        attention_mask = model_inputs.get("attention_mask", None)
+        if input_ids.shape[1] == 0:
+            input_ids = None
+            attention_mask = None
+            in_b = 1
+        else:
+            in_b = input_ids.shape[0]
+        generated_sequence = self.model.generate(
+            input_ids=input_ids.to(self.model.device),
+            attention_mask=attention_mask.to(self.model.device) if attention_mask is not None else None,
+            pad_token_id=self.tokenizer.pad_token_id,
+            streamer=self.streamer,
+            **generate_kwargs,
+        )
+        out_b = generated_sequence.shape[0]
+        if self.framework == "pt":
+            generated_sequence = generated_sequence.reshape(in_b, out_b // in_b, *generated_sequence.shape[1:])
+        elif self.framework == "tf":
+            generated_sequence = tf.reshape(generated_sequence, (in_b, out_b // in_b, *generated_sequence.shape[1:]))
+        instruction_text = model_inputs.pop("instruction_text")
+        return {"generated_sequence": generated_sequence, "input_ids": input_ids, "instruction_text": instruction_text}
+    def postprocess(self, model_outputs, response_key_token_id, end_key_token_id, return_full_text: bool = False):
+        generated_sequence = model_outputs["generated_sequence"][0]
+        instruction_text = model_outputs["instruction_text"]
+        generated_sequence: List[List[int]] = generated_sequence.numpy().tolist()
+        records = []
+        for sequence in generated_sequence:
+            # The response will be set to this variable if we can identify it.
+            decoded = None
+            # If we have token IDs for the response and end, then we can find the tokens and only decode between them.
+            if response_key_token_id and end_key_token_id:
+                # Find where "### Response:" is first found in the generated tokens.  Considering this is part of the
+                # prompt, we should definitely find it.  We will return the tokens found after this token.
+                try:
+                    response_pos = sequence.index(response_key_token_id)
+                except ValueError:
+                    logger.warn(f"Could not find response key {response_key_token_id} in: {sequence}")
+                    response_pos = None
+                if response_pos:
+                    # Next find where "### End" is located.  The model has been trained to end its responses with this
+                    # sequence (or actually, the token ID it maps to, since it is a special token).  We may not find
+                    # this token, as the response could be truncated.  If we don't find it then just return everything
+                    # to the end.  Note that even though we set eos_token_id, we still see the this token at the end.
+                    try:
+                        end_pos = sequence.index(end_key_token_id)
+                    except ValueError:
+                        end_pos = None
+                    decoded = self.tokenizer.decode(sequence[response_pos + 1 : end_pos]).strip()
+            if not decoded:
+                # Otherwise we'll decode everything and use a regex to find the response and end.
+                fully_decoded = self.tokenizer.decode(sequence)
+                # The response appears after "### Response:".  The model has been trained to append "### End" at the
+                # end.
+                m = re.search(r"#+\s*Response:\s*(.+?)#+\s*End", fully_decoded, flags=re.DOTALL)
+                if m:
+                    decoded = m.group(1).strip()
+                else:
+                    # The model might not generate the "### End" sequence before reaching the max tokens.  In this case,
+                    # return everything after "### Response:".
+                    m = re.search(r"#+\s*Response:\s*(.+)", fully_decoded, flags=re.DOTALL)
+                    if m:
+                        decoded = m.group(1).strip()
+                    else:
+                        logger.warn(f"Failed to find response in:\n{fully_decoded}")
+            # If the full text is requested, then append the decoded text to the original instruction.
+            # This technically isn't the full text, as we format the instruction in the prompt the model has been
+            # trained on, but to the client it will appear to be the full text.
+            if return_full_text:
+                decoded = f"{instruction_text}\n{decoded}"
+            rec = {"generated_text": decoded}
+            records.append(rec)
+        return records

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "additional_special_tokens": [
+    "### End",
+    "### Instruction:",
+    "### Response:"
+  ],
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "pad_token": "<|endoftext|>",
+  "unk_token": "<|endoftext|>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "add_prefix_space": false,
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|endoftext|>",
+  "model_max_length": 1000000000000000019884624838656,
+  "tokenizer_class": "GPTNeoXTokenizer",
+  "unk_token": "<|endoftext|>"
+}