flowpoint
AI & ML interests
Recent Activity
Organizations
This post is a good and valid reminder for the need of good science around tokenization.
However, my dislike of tokenizers stems more from their practical implications.
Tokenizers:
- Are another software component that can and does go wrong.
- Are uncommon or more problematic to finetune.
- Mostly don't run on GPU/TPU.
...
Many are solvable implementation problems, but the bitter lesson would imply that we should rather just train/search/learn tokenization inside the networks.
The increased costs can be mitigated within the network architecture and with performance optimizations.
The performance and interpretability are strong points for it but they trade off against implementation problems and possibly lower model quality.
Additionally, its pretty appropriate to say tokenizer-free if you have no specific part in the whole software that is actually a tokenizer, and a str.encode hardly deserves the mention.