Add code snippets, metadata tags (#3)

Browse files

- Add code snippets, metadata tags (2428a783737ceece641c973c90ccbcfcdd1c822a)
- Update README.md (609c6f20d6f10b2f7f7f8e967b3d4e13047ae748)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +49 -0

README.md CHANGED Viewed

@@ -1,6 +1,8 @@
 ---
 language: en
 license: mit
 ---
 # Kosmos-2.5-chat
@@ -13,6 +15,53 @@ Kosmos-2.5-chat is a model specifically trained for Visual Question Answering (V
 [Kosmos-2.5: A Multimodal Literate Model](https://arxiv.org/abs/2309.11419)
 ## NOTE:
 Since this is a generative model, there is a risk of **hallucination** during the generation process, and it **CAN NOT** guarantee the accuracy of all results in the images.

 ---
 language: en
 license: mit
+library_name: transformers
+pipeline_tag: image-text-to-text
 ---
 # Kosmos-2.5-chat
 [Kosmos-2.5: A Multimodal Literate Model](https://arxiv.org/abs/2309.11419)
+## Usage
+KOSMOS-2.5 is supported from Transformers >= 4.56. Find the docs [here](https://huggingface.co/docs/transformers/main/en/model_doc/kosmos2_5).
+```python
+import re
+import torch
+import requests
+from PIL import Image, ImageDraw
+from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration
+repo = "microsoft/kosmos-2.5-chat"
+device = "cuda:0"
+dtype = torch.bfloat16
+model = Kosmos2_5ForConditionalGeneration.from_pretrained(repo,
+                                                          device_map=device,
+                                                          torch_dtype=dtype,
+                                                          attn_implementation="flash_attention_2")
+processor = AutoProcessor.from_pretrained(repo)
+# sample image
+url = "https://huggingface.co/microsoft/kosmos-2.5/resolve/main/receipt_00008.png"
+image = Image.open(requests.get(url, stream=True).raw)
+question = "What is the sub total of the receipt?"
+template = "<md>A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"
+prompt = template.format(question)
+inputs = processor(text=prompt, images=image, return_tensors="pt")
+height, width = inputs.pop("height"), inputs.pop("width")
+raw_width, raw_height = image.size
+scale_height = raw_height / height
+scale_width = raw_width / width
+inputs = {k: v.to(device) if v is not None else None for k, v in inputs.items()}
+inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)
+generated_ids = model.generate(
+    **inputs,
+    max_new_tokens=1024,
+)
+generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
+print(generated_text[0])
+```
 ## NOTE:
 Since this is a generative model, there is a risk of **hallucination** during the generation process, and it **CAN NOT** guarantee the accuracy of all results in the images.