Spaces:
Running
Running
| import streamlit as st | |
| from streamlit_extras.switch_page_button import switch_page | |
| translations = { | |
| 'en': {'title': 'LLaVA-NeXT', | |
| 'original_tweet': | |
| """ | |
| [Original tweet](https://twitter.com/mervenoyann/status/1770832875551682563) (March 21, 2024) | |
| """, | |
| 'tweet_1': | |
| """ | |
| LLaVA-NeXT is recently merged to 🤗 Transformers and it outperforms many of the proprietary models like Gemini on various benchmarks!🤩 | |
| For those who don't know LLaVA, it's a language model that can take image 💬 | |
| Let's take a look, demo and more in this. | |
| """, | |
| 'tweet_2': | |
| """ | |
| LLaVA is essentially a vision-language model that consists of ViT-based CLIP encoder, a MLP projection and Vicuna as decoder ✨ | |
| LLaVA 1.5 was released with Vicuna, but LLaVA NeXT (1.6) is released with four different LLMs: | |
| - Nous-Hermes-Yi-34B | |
| - Mistral-7B | |
| - Vicuna 7B & 13B | |
| """, | |
| 'tweet_3': | |
| """ | |
| Thanks to 🤗 Transformers integration, it is very easy to use LLaVA NeXT, not only standalone but also with 4-bit loading and Flash Attention 2 💜 | |
| See below on standalone usage 👇 | |
| """, | |
| 'tweet_4': | |
| """ | |
| To fit large models and make it even faster and memory efficient, you can enable Flash Attention 2 and load model into 4-bit using bitsandbytes ⚡️ transformers makes it very easy to do this! See below 👇 | |
| """, | |
| 'tweet_5': | |
| """ | |
| If you want to try the code right away, here's the [notebook](https://t.co/NvoxvY9z1u). | |
| Lastly, you can directly play with the LLaVA-NeXT based on Mistral-7B through the demo [here](https://t.co/JTDlqMUwEh) 🤗 | |
| """, | |
| 'ressources': | |
| """ | |
| Ressources: | |
| [LLaVA-NeXT: Improved reasoning, OCR, and world knowledge](https://llava-vl.github.io/blog/2024-01-30-llava-next/) | |
| by Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee (2024) | |
| [GitHub](https://github.com/haotian-liu/LLaVA/tree/main) | |
| [Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/llava_next) | |
| """ | |
| }, | |
| 'fr': { | |
| 'title': 'LLaVA-NeXT', | |
| 'original_tweet': | |
| """ | |
| [Tweet de base](https://twitter.com/mervenoyann/status/1770832875551682563) (en anglais) (21 mars 2024) | |
| """, | |
| 'tweet_1': | |
| """ | |
| LLaVA-NeXT a récemment été intégré à 🤗 Transformers et surpasse de nombreux modèles propriétaires comme Gemini sur différents benchmarks !🤩 | |
| Pour ceux qui ne connaissent pas LLaVA, il s'agit d'un modèle de langage qui peut prendre des images 💬. | |
| """, | |
| 'tweet_2': | |
| """ | |
| LLaVA est essentiellement un modèle langage/vision qui se compose d'un encodeur CLIP basé sur ViT, d'une projection MLP et de Vicuna en tant que décodeur ✨. | |
| LLaVA 1.5 a été publié avec Vicuna, mais LLaVA NeXT (1.6) est publié avec quatre LLM différents : | |
| - Nous-Hermes-Yi-34B | |
| - Mistral-7B | |
| - Vicuna 7B & 13B | |
| """, | |
| 'tweet_3': | |
| """ | |
| Grâce à l'intégration dans 🤗 Transformers, il est très facile d'utiliser LLaVA NeXT, non seulement en mode autonome mais aussi avec un chargement 4 bits et Flash Attention 2 💜. | |
| Voir ci-dessous pour l'utilisation autonome 👇 | |
| """, | |
| 'tweet_4': | |
| """ | |
| Pour entraîner des grands modèles et les rendre encore plus rapides et efficaces en termes de mémoire, vous pouvez activer Flash Attention 2 et charger le modèle en 4 bits à l'aide de bitsandbytes ⚡️ ! Voir ci-dessous 👇 """, | |
| 'tweet_5': | |
| """ | |
| Si vous voulez essayer le code tout de suite, voici le [notebook](https://t.co/NvoxvY9z1u). | |
| Enfin, vous pouvez directement jouer avec le LLaVA-NeXT reposant sur Mistral-7B grâce à cette [démo](https://t.co/JTDlqMUwEh) 🤗 | |
| """, | |
| 'ressources': | |
| """ | |
| Ressources : | |
| [LLaVA-NeXT: Improved reasoning, OCR, and world knowledge](https://llava-vl.github.io/blog/2024-01-30-llava-next/) | |
| de Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee (2024) | |
| [GitHub](https://github.com/haotian-liu/LLaVA/tree/main) | |
| [Documentation d'Hugging Face](https://huggingface.co/docs/transformers/model_doc/llava_next) | |
| """ | |
| } | |
| } | |
| def language_selector(): | |
| languages = {'EN': '🇬🇧', 'FR': '🇫🇷'} | |
| selected_lang = st.selectbox('', options=list(languages.keys()), format_func=lambda x: languages[x], key='lang_selector') | |
| return 'en' if selected_lang == 'EN' else 'fr' | |
| left_column, right_column = st.columns([5, 1]) | |
| # Add a selector to the right column | |
| with right_column: | |
| lang = language_selector() | |
| # Add a title to the left column | |
| with left_column: | |
| st.title(translations[lang]["title"]) | |
| st.success(translations[lang]["original_tweet"], icon="ℹ️") | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_1"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.image("pages/LLaVA-NeXT/image_1.jpeg", use_column_width=True) | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_2"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.image("pages/LLaVA-NeXT/image_2.jpeg", use_column_width=True) | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_3"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.image("pages/LLaVA-NeXT/image_3.jpeg", use_column_width=True) | |
| st.markdown(""" """) | |
| with st.expander ("Code"): | |
| st.code(""" | |
| from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration | |
| import torch | |
| processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf") | |
| model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True) | |
| model.to("cuda:0") | |
| inputs = processor(prompt, image, return_tensors="pt").to("cuda:0") | |
| output = model.generate(**inputs, max_new_tokens=100) | |
| print(processor.decode(output[0], skip_special_tokens=True)) | |
| """) | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_4"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.image("pages/LLaVA-NeXT/image_4.jpeg", use_column_width=True) | |
| st.markdown(""" """) | |
| with st.expander ("Code"): | |
| st.code(""" | |
| from transformers import LlavaNextForConditionalGeneration, BitsandBytesconfig | |
| # 4bit | |
| quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtpe="torch.float16") | |
| model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", quantization_config=quantization_config, device_map="auto") | |
| # Flash Attention 2 | |
| model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True, use_flash_attention_2=True).to(0) | |
| """) | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_5"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.video("pages/LLaVA-NeXT//video_1.mp4", format="video/mp4") | |
| st.markdown(""" """) | |
| st.info(translations[lang]["ressources"], icon="📚") | |
| st.markdown(""" """) | |
| st.markdown(""" """) | |
| st.markdown(""" """) | |
| col1, col2, col3= st.columns(3) | |
| with col1: | |
| if lang == "en": | |
| if st.button('Previous paper', use_container_width=True): | |
| switch_page("UDOP") | |
| else: | |
| if st.button('Papier précédent', use_container_width=True): | |
| switch_page("UDOP") | |
| with col2: | |
| if lang == "en": | |
| if st.button("Home", use_container_width=True): | |
| switch_page("Home") | |
| else: | |
| if st.button("Accueil", use_container_width=True): | |
| switch_page("Home") | |
| with col3: | |
| if lang == "en": | |
| if st.button("Next paper", use_container_width=True): | |
| switch_page("Painter") | |
| else: | |
| if st.button("Papier suivant", use_container_width=True): | |
| switch_page("Painter") |