Update README.md
Browse files
README.md
CHANGED
|
@@ -85,6 +85,8 @@ All the benchmarks only assess the "trivial" mode on questions requiring some fo
|
|
| 85 |
Pleias-RAG-350M is not simply a cost-effective version of larger models. We found it has been able to answer correctly to several hundred questions from HotPotQA that neither Llama-3-8b nor Qwen-2.5-7b could solve. Consequently we encourage its use as part of multi-model RAG systems.
|
| 86 |
|
| 87 |
## Deployment
|
|
|
|
|
|
|
| 88 |
With only 350 million parameters, Pleias-RAG-350M is classified among the *phone-sized SLM*, a niche with very little alternatives (Smollm, Qwen-0.5) and none that currently works well for retrieval-augmented generation.
|
| 89 |
|
| 90 |
We also release an unquantized GGUF version for deployment on CPU. Our internal performance benchmarks suggest that waiting times are currently acceptable for most either even under constrained RAM: about 20 seconds for a complex generation including reasoning traces on 8g RAM and below. Since the model is unquantized, quality of text generation should be identical to the original model.
|
|
|
|
| 85 |
Pleias-RAG-350M is not simply a cost-effective version of larger models. We found it has been able to answer correctly to several hundred questions from HotPotQA that neither Llama-3-8b nor Qwen-2.5-7b could solve. Consequently we encourage its use as part of multi-model RAG systems.
|
| 86 |
|
| 87 |
## Deployment
|
| 88 |
+
The easiest way to deploy Pleias-RAG-350M is through [our official library](https://github.com/Pleias/Pleias-RAG-Library). It features an API-like workflow with standardized export of the structured reasoning/answer output into json format. A [Colab Notebook](https://colab.research.google.com/drive/1oG0qq0I1fSEV35ezSah-a335bZqmo4_7?usp=sharing) is available for easy tests and experimentations.
|
| 89 |
+
|
| 90 |
With only 350 million parameters, Pleias-RAG-350M is classified among the *phone-sized SLM*, a niche with very little alternatives (Smollm, Qwen-0.5) and none that currently works well for retrieval-augmented generation.
|
| 91 |
|
| 92 |
We also release an unquantized GGUF version for deployment on CPU. Our internal performance benchmarks suggest that waiting times are currently acceptable for most either even under constrained RAM: about 20 seconds for a complex generation including reasoning traces on 8g RAM and below. Since the model is unquantized, quality of text generation should be identical to the original model.
|