--- language: - en license: mit library_name: peft tags: - reranking - information-retrieval - pointwise - lora - peft - efficient - ranknet base_model: meta-llama/Llama-3.2-3B datasets: - Tevatron/msmarco-passage - abdoelsayed/DeAR-COT pipeline_tag: text-classification --- # DeAR-3B-Reranker-RankNet-LoRA-v1 ## Model Description **DeAR-3B-Reranker-RankNet-LoRA-v1** is a LoRA adapter for the most efficient reranker in the DeAR family. This ultra-lightweight adapter (~40MB) achieves fast inference speeds while maintaining competitive accuracy, making it ideal for resource-constrained production environments. ## Model Details - **Model Type:** LoRA Adapter for Pointwise Reranking - **Base Model:** meta-llama/Llama-3.2-3B - **Adapter Size:** ~40MB - **Training Method:** LoRA with RankNet Loss + Knowledge Distillation - **LoRA Rank:** 16 - **LoRA Alpha:** 32 - **Trainable Parameters:** 25M (0.8% of total) ## Key Features ✅ **Ultra Lightweight:** Only 40MB storage ✅ **Fastest Inference:** 1.5s for 100 documents ✅ **Memory Efficient:** 10GB GPU for inference ✅ **Easy Deployment:** Quick adapter loading ✅ **Cost Effective:** Minimal compute requirements ## Usage ### Load and Use ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification from peft import PeftModel, PeftConfig # Load LoRA adapter adapter_path = "abdoelsayed/dear-3b-reranker-ranknet-lora-v1" config = PeftConfig.from_pretrained(adapter_path) # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token # Load base model base_model = AutoModelForSequenceClassification.from_pretrained( config.base_model_name_or_path, num_labels=1, torch_dtype=torch.bfloat16 ) # Load and merge LoRA model = PeftModel.from_pretrained(base_model, adapter_path) model = model.merge_and_unload() model.eval().cuda() # Use model query = "What is machine learning?" document = "Machine learning is a subset of artificial intelligence..." inputs = tokenizer( f"query: {query}", f"document: {document}", return_tensors="pt", truncation=True, max_length=228, padding="max_length" ) inputs = {k: v.cuda() for k, v in inputs.items()} with torch.no_grad(): score = model(**inputs).logits.squeeze().item() print(f"Relevance score: {score}") ``` ### Helper Function ```python from typing import List, Tuple def load_3b_lora_ranker(adapter_path: str): """Load 3B LoRA adapter efficiently.""" config = PeftConfig.from_pretrained(adapter_path) tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token base = AutoModelForSequenceClassification.from_pretrained( config.base_model_name_or_path, num_labels=1, torch_dtype=torch.bfloat16 ) model = PeftModel.from_pretrained(base, adapter_path) model = model.merge_and_unload() model.eval().cuda() return tokenizer, model # Load once tokenizer, model = load_3b_lora_ranker("abdoelsayed/dear-3b-reranker-ranknet-lora-v1") # Rerank function @torch.inference_mode() def rerank(tokenizer, model, query: str, docs: List[Tuple[str, str]], batch_size=128): scores = [] device = next(model.parameters()).device for i in range(0, len(docs), batch_size): batch = docs[i:i + batch_size] queries = [f"query: {query}"] * len(batch) documents = [f"document: {t} {p}" for t, p in batch] inputs = tokenizer(queries, documents, return_tensors="pt", truncation=True, max_length=228, padding=True) inputs = {k: v.to(device) for k, v in inputs.items()} logits = model(**inputs).logits.squeeze(-1) scores.extend(logits.cpu().tolist()) return sorted(enumerate(scores), key=lambda x: x[1], reverse=True) ``` ## Training Details ### LoRA Configuration ```python { "r": 16, "lora_alpha": 32, "target_modules": [ "q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ], "lora_dropout": 0.05, "bias": "none", "task_type": "SEQ_CLS" } ``` ### Training Hyperparameters - **Learning Rate:** 1e-4 - **Batch Size:** 8 (larger due to lower memory) - **Gradient Accumulation:** 2 - **Epochs:** 2 - **Hardware:** 4x A100 (40GB) - **Training Time:** ~6 hours (3x faster than full) - **Memory:** ~18GB per GPU ## Efficiency Comparison ### Storage Efficiency ``` LoRA Adapter: 40MB Full 3B Model: 6GB Full 8B Model: 16GB Ratio: 0.67% of 3B, 0.25% of 8B ``` ### Inference Speed ``` 3B LoRA: 1.5s (100 docs) 8B Full: 2.2s (100 docs) Speedup: 1.47x faster than 8B ``` ### Memory Usage ``` 3B LoRA: 10GB GPU 3B Full: 12GB GPU 8B Full: 18GB GPU ``` ## When to Use **Best for:** - ✅ Extreme resource constraints - ✅ Multiple domain-specific versions - ✅ Fast iteration cycles - ✅ Edge deployment - ✅ Maximum throughput **Use full 3B for:** - ❌ Slightly better accuracy needed - ❌ No storage constraints **Use 8B for:** - ❌ +3 NDCG@10 accuracy gain needed ## Deployment Example ```python # Minimal memory deployment import torch from transformers import AutoModelForSequenceClassification from peft import PeftModel adapter_path = "abdoelsayed/dear-3b-reranker-ranknet-lora-v1" # Load with memory optimization model = AutoModelForSequenceClassification.from_pretrained( "meta-llama/Llama-3.2-3B", num_labels=1, torch_dtype=torch.bfloat16, device_map="auto", low_cpu_mem_usage=True ) # Load adapter model = PeftModel.from_pretrained(model, adapter_path) model = model.merge_and_unload() model.eval() # Optional: Compile for speedup if hasattr(torch, 'compile'): model = torch.compile(model, mode="max-autotune") ``` ## Performance vs Size ``` Model Size vs NDCG@10 (TREC DL19): ├─ Teacher-13B: 73.8 (26GB) ├─ DeAR-8B-Full: 74.5 (16GB) ├─ DeAR-8B-LoRA: 74.2 (100MB + base) ├─ DeAR-3B-Full: 71.2 (6GB) └─ DeAR-3B-LoRA: 70.9 (40MB + base) ← This model Best Efficiency: 95% accuracy at 0.25% size of 8B! ``` ## Related Models **Full Version:** - [DeAR-3B-RankNet](https://huggingface.co/abdoelsayed/dear-3b-reranker-ranknet-v1) **Same Size (3B):** - [DeAR-3B-CE-LoRA](https://huggingface.co/abdoelsayed/dear-3b-reranker-ce-lora-v1) **Larger (8B):** - [DeAR-8B-RankNet-LoRA](https://huggingface.co/abdoelsayed/dear-8b-reranker-ranknet-lora-v1) **Resources:** - [DeAR-COT Dataset](https://huggingface.co/datasets/abdoelsayed/DeAR-COT) - [Teacher Model](https://huggingface.co/abdoelsayed/llama2-13b-rankllama-teacher) ## Citation ```bibtex @article{abdallah2025dear, title={DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation}, author={Abdallah, Abdelrahman and Mozafari, Jamshid and Piryani, Bhawna and Jatowt, Adam}, journal={arXiv preprint arXiv:2508.16998}, year={2025} } ``` ## License MIT License ## More Information - **GitHub:** [DataScienceUIBK/DeAR-Reranking](https://github.com/DataScienceUIBK/DeAR-Reranking) - **Paper:** [arXiv:2508.16998](https://arxiv.org/abs/2508.16998) - **Collection:** [DeAR Models](https://huggingface.co/collections/abdoelsayed/dear-reranking)