PleIAs
/

Pleias-RAG-350M

Text Generation

text-generation-inference

Model card Files Files and versions

Pclanglais commited on Apr 24

Commit

701ed14

·

verified ·

1 Parent(s): c05158b

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -85,6 +85,8 @@ All the benchmarks only assess the "trivial" mode on questions requiring some fo
 Pleias-RAG-350M is not simply a cost-effective version of larger models. We found it has been able to answer correctly to several hundred questions from HotPotQA that neither Llama-3-8b nor Qwen-2.5-7b could solve. Consequently we encourage its use as part of multi-model RAG systems.
 ## Deployment
 With only 350 million parameters, Pleias-RAG-350M is classified among the *phone-sized SLM*, a niche with very little alternatives (Smollm, Qwen-0.5) and none that currently works well for retrieval-augmented generation.
 We also release an unquantized GGUF version for deployment on CPU. Our internal performance benchmarks suggest that waiting times are currently acceptable for most either even under constrained RAM: about 20 seconds for a complex generation including reasoning traces on 8g RAM and below. Since the model is unquantized, quality of text generation should be identical to the original model.

 Pleias-RAG-350M is not simply a cost-effective version of larger models. We found it has been able to answer correctly to several hundred questions from HotPotQA that neither Llama-3-8b nor Qwen-2.5-7b could solve. Consequently we encourage its use as part of multi-model RAG systems.
 ## Deployment
+The easiest way to deploy Pleias-RAG-350M is through [our official library](https://github.com/Pleias/Pleias-RAG-Library). It features an API-like workflow with standardized export of the structured reasoning/answer output into json format. A [Colab Notebook](https://colab.research.google.com/drive/1oG0qq0I1fSEV35ezSah-a335bZqmo4_7?usp=sharing) is available for easy tests and experimentations.
 With only 350 million parameters, Pleias-RAG-350M is classified among the *phone-sized SLM*, a niche with very little alternatives (Smollm, Qwen-0.5) and none that currently works well for retrieval-augmented generation.
 We also release an unquantized GGUF version for deployment on CPU. Our internal performance benchmarks suggest that waiting times are currently acceptable for most either even under constrained RAM: about 20 seconds for a complex generation including reasoning traces on 8g RAM and below. Since the model is unquantized, quality of text generation should be identical to the original model.