Spaces:

imh0
/

transformers-p1-embeddings

Runtime error

App Files Files Community

im commited on Jul 23, 2023

Commit

ab03e32

1 Parent(s): 74ab428

init

Browse files

Files changed (4) hide show

.gitignore +165 -0
.streamlit/config.toml +3 -0
app.py +246 -0
requirements.txt +3 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,165 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# flask
+flask_session
+*.log
+datasets/
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+.idea/

.streamlit/config.toml ADDED Viewed

	@@ -0,0 +1,3 @@

+[theme]
+base="dark"
+font="sans serif"

app.py ADDED Viewed

	@@ -0,0 +1,246 @@

+import streamlit as st
+# TODO: move to 'utils'
+mystyle = '''
+    <style>
+        p {
+            text-align: justify;
+        }
+    </style>
+    '''
+st.markdown(mystyle, unsafe_allow_html=True)
+def divider():
+    _, c, _ = st.columns(3)
+    c.divider()
+st.title("Transformers: Tokenisers and Embeddings")
+preface_image, preface_text,  = st.columns(2)
+# preface_image.image("https://static.streamlit.io/examples/dice.jpg")
+# preface_image.image("""https://assets.digitalocean.com/articles/alligator/boo.svg""")
+preface_text.write("""*Transformers represent a revolutionary class of machine learning architectures that have sparked
+immense interest. While numerous insightful tutorials are available, the evolution of transformer architectures over
+the last few years has led to significant simplifications. These advancements have made it increasingly
+straightforward to understand their inner workings. In this series of articles, I aim to provide a direct, clear explanation of
+how and why modern transformers function, unburdened by the historical complexities associated with their inception.*
+""")
+divider()
+st.write("""In order to understand the recent success in AI we need to understand the Transformer architecture. Its
+rise in the field of Natural Language Processing (NLP) is largely attributed to a combination of several key
+advancements:
+- Tokenisers and Embeddings
+- Attention and Self-Attention
+- Encoder-Decoder architecture
+Understanding these foundational concepts is crucial to comprehending the overall structure and function of the
+Transformer model. They are the building blocks from which the rest of the model is constructed, and their roles
+within the architecture are essential to the model's ability to process and generate language.
+Given the importance and complexity of these concepts, I have chosen to dedicate the first article in this series
+solely to Tokenisation and embeddings. The decision to separate the topics into individual articles is driven by a
+desire to provide a thorough and in-depth understanding of each component of the Transformer model.
+""")
+with st.expander("Copernicus Museum in Warsaw"):
+    st.write("""
+Have you ever visited the Copernicus Museum in Warsaw? It's an engaging interactive hub that allows
+you to familiarize yourself with various scientific topics. The experience is both entertaining and educational,
+providing the opportunity to explore different concepts firsthand. **They even feature a small neural network that
+illustrates the neuron activation process during the recognition of handwritten digits!**
+Taking inspiration from this approach, we'll embark on our journey into the world of Transformer models by first
+establishing a firm understanding of Tokenisation and embeddings. This foundation will equip us with the knowledge
+needed to delve into the more complex aspects of these models later on.
+I encourage you not to hesitate in modifying parameters or experimenting with different models in the provided
+examples. This hands-on exploration can significantly enhance your learning experience. So, let's begin our journey
+through this virtual, interactive museum of AI. Enjoy the exploration!
+""")
+    st.image("https://i.pinimg.com/originals/04/11/2c/04112c791a859d07a01001ac4f436e59.jpg")
+divider()
+st.header("Tokenisers and Tokenisation")
+st.write("""Tokenisation is the initial step in the data preprocessing pipeline for natural language processing (NLP)
+models. It involves breaking down a piece of text—whether a sentence, paragraph, or document—into smaller units,
+known as "tokens". In English and many other languages, a token often corresponds to a word, but it can also be a
+subword, character, or n-gram. The choice of token size depends on various factors, including the task at hand and
+the language of the text.
+""")
+from transformers import AutoTokenizer
+sentence = st.text_input("Sentence to explore (you can change it):", value="Tokenising text is a fundamental step for NLP models.")
+sentence_split = sentence.split()
+tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
+sentence_tokenise_bert = tokenizer.tokenize(sentence)
+sentence_encode_bert = tokenizer.encode(sentence)
+sentence_encode_bert = list(zip(sentence_tokenise_bert, sentence_encode_bert))
+st.write(f"""
+Consider the sentence:
+""")
+st.code(f"""
+"{sentence}"
+""")
+st.write(f"""
+A basic word-level Tokenisation would produce tokens:
+""")
+st.code(f"""
+{sentence_split}
+""")
+st.write(f"""
+However, a more sophisticated algorithm, with several optimizations, might generate a different set of tokens:
+""")
+st.code(f"""
+{sentence_tokenise_bert}
+""")
+with st.expander("click to look at the code:"):
+    st.code(f"""\
+from transformers import AutoTokenizer
+sentence = st.text_input("Sentence to explore (you can change it):", value="{sentence}")
+sentence_split = sentence.split()
+tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
+sentence_tokenise_bert = tokenizer.tokenize(sentence)
+sentence_encode_bert = tokenizer.encode(sentence)
+sentence_encode_bert = list(zip(sentence_tokenise_bert, sentence_encode_bert))
+    """, language='python')
+st.write("""
+As machine learning models, including Transformers, work with numbers rather than words, each vocabulary
+entry is assigned a corresponding numerical value. Here is a potential key-value, vocabulary-based representation of
+the input (so called 'token ids'):
+"""
+)
+st.code(f"""
+{sentence_encode_bert}
+""")
+st.write("""
+What distinguishes subword Tokenisation is its reliance on statistical rules and algorithms, learned from
+the pretraining corpus. The resulting Tokeniser creates a vocabulary, which usually represents the most frequently
+used words and subwords. For example, Byte Pair Encoding (BPE) first encodes the most frequent words as single
+tokens, while less frequent words are represented by multiple tokens, each representing a word part.
+There are numerous different Tokenisers available, including spaCy, Moses, Byte-Pair Encoding (BPE),
+Byte-level BPE, WordPiece, Unigram, and SentencePiece. It's crucial to choose a specific Tokeniser and stick with it.
+Changing the Tokeniser is akin to altering the model's language on the fly—imagine studying physics in English and
+then taking the exam in French or Spanish. You might get lucky, but it's a considerable risk.
+""")
+with st.expander("""Let's train a tokeniser using our own dataset"""):
+    training_dataset = """\
+Beautiful is better than ugly.
+Explicit is better than implicit.
+Simple is better than complex.
+Complex is better than complicated.
+Flat is better than nested.
+Sparse is better than dense.
+Readability counts.
+"""
+    training_dataset = st.text_area("*Training Dataset - Vocabulary:*", value=training_dataset, height=200)
+    training_dataset = training_dataset.split('\n')
+    vocabulary_size = st.number_input("Vocabulary Size:", value=100000)
+    # TODO: add more tokenisers
+    from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
+    tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
+    # tokenizer = Tokenizer(models.Unigram())
+    tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
+    tokenizer.decoder = decoders.ByteLevel()
+    trainer = trainers.BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=vocabulary_size)
+    # trainer = trainers.UnigramTrainer(
+    #     vocab_size=20000,
+    #     initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
+    #     special_tokens=["<PAD>", "<BOS>", "<EOS>"],
+    # )
+    tokenizer.train_from_iterator(training_dataset, trainer=trainer)
+    sentence = st.text_input("*Text to tokenise:*", value="[CLS]  Tokenising text is a fundamental step for NLP models. [SEP] [PAD] [PAD] [PAD]")
+    output = tokenizer.encode(sentence)
+    st.write("*Tokens:*")
+    st.code(f"""{output.tokens}""")
+    st.code(f"""\
+    ids: {output.ids}
+    attention_mast: {output.attention_mask}
+    """)
+    st.subheader("Try Yourself:")
+    st.write(f""" *Aim to find or create a comprehensive vocabulary (training dataset) for Tokenisation, which can enhance the
+        efficiency of the process. This approach helps to eliminate unknown tokens, thereby making the token sequence
+        more understandable and containing less tokens*
+    """)
+    st.caption("Special tokens meaning:")
+    st.write("""
+\\#\\# prefix: It means that the preceding string is not whitespace, any token with this prefix should be
+merged with the previous token when you convert the tokens back to a string.
+[UNK]: Stands for "unknown". This token is used to represent any word that is not in the model's vocabulary. Since
+most models have a fixed-size vocabulary, it's not possible to have a unique token for every possible word. The [UNK]
+token is used as a catch-all for any words the model hasn't seen before. E.g. in our example we 'decided' that Large
+Language (LL) abbreviation is not part of the model's vocabulary.
+[CLS]: Stands for "classification". In models like BERT, this token is added at the beginning of every input
+sequence. The representation (embedding) of this token is used as the aggregate sequence representation for
+classification tasks. In other words, the model is trained to encode the meaning of the entire sequence into this token.
+[SEP]: Stands for "separator". This token is used to separate different sequences when the model needs to take more
+than one input sequence. For example, in question-answering tasks, the model takes two inputs: a question and a
+passage that contains the answer. The two inputs are separated by a [SEP] token.
+[MASK]: This token is specific to models like BERT, which are trained with a masked language modelling objective.
+During training, some percentage of the input tokens are replaced with the [MASK] token, and the model's goal is to
+predict the original value of the masked tokens.
+[PAD]: Stands for "padding". This token is used to fill in the extra spaces when batching sequences of different
+lengths together. Since models require input sequences to be the same length, shorter sequences are extended with [
+PAD] tokens. In our example, we extended the length of the input sequence to 16 tokens.
+    """)
+    st.caption("Python code:")
+    st.code(f"""
+from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
+tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
+tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
+tokenizer.decoder = decoders.ByteLevel()
+trainer = trainers.BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size={vocabulary_size})
+training_dataset = {training_dataset}
+tokenizer.train_from_iterator(training_dataset, trainer=trainer)
+output = tokenizer.encode("{sentence}")
+            """, language='python')
+with st.expander("References:"):
+    st.write("""\
+- https://huggingface.co/docs/transformers/tokenizer_summary
+- https://huggingface.co/docs/tokenizers/training_from_memory
+- https://en.wikipedia.org/wiki/Byte_pair_encoding
+    """)
+divider()
+st.header("Embeddings")
+st.caption("TBD...")

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+streamlit~=1.21.0
+tokenizers~=0.13.3
+transformers~=4.31.0