Thanks for the thoughtful comment! For now, I'm of the opinion that SaaS embedding API's are cheap enough that even a large dataset can be re-vectorised. For example, for the 143k chunks the costs were anywhere between around $6 - $30 (from memory). That's every High Court judgement up to 2023 in Australia. Personally I think of the vectors themselves as essentially disposable, since there's better models coming out every month or so. I know not everyone is of a similar mindset, and for ultimate control you'd definitely want to go local.
Adrian Lucas Malec
adlumal
AI & ML interests
None yet
Recent Activity
replied to
their
post
2 days ago
I benchmarked embedding APIs for speed, compared local vs hosted models, and tuned USearch for sub-millisecond retrieval on 143k chunks using only CPU. The post walks through the results, trade-offs, and what I learned about embedding API terms of service.
The main motivation for using USearch is that CPU compute is cheap and easy to scale.
Blog post: https://huggingface.co/blog/adlumal/lightning-fast-vector-search-for-legal-documents
liked
a dataset
3 days ago
cais/hle
authored
a paper
6 days ago
The Massive Legal Embedding Benchmark (MLEB)