Prompt-Kala: A Multimodal Conversational Agent for E-Commerce Built on Dual-Retrieval RAG Architecture
Abstract
Effectively harnessing the vast and unstructured data from customer comments is a critical challenge in modern e-commerce. An intelligent system that can accurately interpret and respond to nuanced, multimodal user queries is essential for enhancing customer experience and providing scalable support. We propose a novel, dual-phase Retrieval-Augmented Generation (RAG) system that integrates both textual and visual information to power a conversational chatbot. Our empirical results demonstrate a significant performance uplift, with question-answering accuracy increasing by up to 20 percentage points when visual context is provided alongside text. This work establishes a robust framework for transforming raw customer feedback into a dynamic, interactive, and reliable knowledge base for e-commerce applications. The code for this project is available at https://github.com/NLP-Final-Projects/digikala rag. Index Terms—Retrieval-Augmented Generation (RAG), Natural Language Processing (NLP), Knowledge base/External knowl- edge, Vector database, Prompt engineering
Model Variants
The adapters are organized by their training configuration. The naming convention is clip_lora_adapters_{epochs}e{rank}r, with subdirectories for different training checkpoints.
r(Rank): The rank of the LoRA decomposition. Higher ranks can capture more complex patterns but increase the number of trainable parameters. We provide adapters with ranks 16 and 32.e(Epochs): The total number of training epochs. All primary models were trained for 80 epochs.Cut: Checkpoints saved at intermediate epochs (e.g.,30eCut,50eCut). These can be useful if the model starts to overfit in later epochs.ES(Early Stopping): The final adapter saved based on the best validation score using an early stopping mechanism.
Adapter Directory Structure:
clip_lora_adapters_80e16r_ES: Final LoRA adapter with rank 16, trained for 80 epochs with early stopping.clip_lora_adapters_80e16r_30eCut: Checkpoint from the same run at 30 epochs.clip_lora_adapters_80e16r_50eCut: Checkpoint at 50 epochs.clip_lora_adapters_80e16r_70eCut: Checkpoint at 70 epochs.
clip_lora_adapters_80e32r_ES: Final LoRA adapter with rank 32, trained for 80 epochs with early stopping.clip_lora_adapters_80e32r_30eCut: Checkpoint at 30 epochs.clip_lora_adapters_80e32r_50eCut: Checkpoint at 50 epochs.clip_lora_adapters_80e32r_70eCut: Checkpoint at 70 epochs.
glot-contrastive-final-lora: A curated final version, recommended for general use (symbolic link to the best-performing adapter, e.g.,clip_lora_adapters_80e32r_ES).glot-mlm-adapted: An experimental version of the adapter further fine-tuned with a Masked Language Modeling (MLM) objective on the text encoder.
How to Use
To use these LoRA adapters, you need to install the transformers, peft, and torch libraries. First, load the base CLIP model, and then attach the desired LoRA adapter from this repository.
CLIPFaLORA
import torch
from torchvision import transforms
from PIL import Image
from transformers import CLIPVisionModel, RobertaModel, AutoTokenizer
from peft import PeftModel
from .CombinedContrastive import CombinedContrastive
import requests
from io import BytesIO
from typing import List
class CLIPFaLORA:
def __init__(self, name: str, path: str):
self.name = name
self.path = path
self.device = "cuda:0"
self.model = PeftModel.from_pretrained(
CombinedContrastive(
CLIPVisionModel.from_pretrained("SajjadAyoubi/clip-fa-vision"),
RobertaModel.from_pretrained("SajjadAyoubi/clip-fa-text"),
),
self.path,
)
self.model = self.model.to(self.device)
self.model.eval()
self.text_transform = AutoTokenizer.from_pretrained("SajjadAyoubi/clip-fa-text")
self.image_transform = transforms.Compose(
[
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.8544, 0.8390, 0.8298], std=[0.2618, 0.2729, 0.2855]
),
]
)
def get_text_embedding(self, contents: List[str]) -> List[List[float]]:
inputs = self.text_transform(
contents, return_tensors="pt", padding=True, truncation=True
).to(self.device)
with torch.no_grad():
embeddings = self.model.text_encoder(**inputs).pooler_output
return embeddings.cpu().numpy().tolist()
def get_image_embedding(self, images: List[str]) -> List[List[float]]:
images = [
self.image_transform(Image.open(image).convert("RGB")) for image in images
]
images = torch.stack(images).to(self.device)
with torch.no_grad():
embeddings = self.model.vision_encoder(images).pooler_output
return embeddings.cpu().numpy().tolist()
def get_image_embedding_url(self, images: List[str]) -> List[List[float]]:
contents = [requests.get(image).content for image in images]
images = [BytesIO(content) for content in contents]
images = [
self.image_transform(Image.open(image).convert("RGB")) for image in images
]
images = torch.stack(images).to(self.device)
with torch.no_grad():
embeddings = self.model.vision_encoder(images).pooler_output
return embeddings.cpu().numpy().tolist()
GLOT500LORA
import torch
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
from typing import List
class GLOT500LORA:
def __init__(self, name: str, base: str, adapters: str):
self.name = name
self.base = base
self.adapters = adapters
self.device = "cuda:0"
self.model = PeftModel.from_pretrained(
AutoModel.from_pretrained(base), adapters
)
self.model.to(self.device)
self.text_transform = AutoTokenizer.from_pretrained(base, use_fast=False)
def get_text_embedding(self, contents: List[str]) -> List[List[float]]:
inputs = self.text_transform(
contents, return_tensors="pt", padding=True, truncation=True
).to(self.device)
with torch.no_grad():
outputs = self.model(**inputs)
embeddings = outputs.last_hidden_state
mask = (
inputs["attention_mask"].unsqueeze(-1).expand(embeddings.size()).float()
)
embeddings = torch.sum(embeddings * mask, 1) / torch.clamp(
mask.sum(1), min=1e-9
)
return embeddings.cpu().numpy().tolist()
- Downloads last month
- -
Model tree for parsi-ai-nlpclass/Digikala_RAG
Base model
openai/clip-vit-large-patch14