Prompt-Kala: A Multimodal Conversational Agent for E-Commerce Built on Dual-Retrieval RAG Architecture

Abstract

Effectively harnessing the vast and unstructured data from customer comments is a critical challenge in modern e-commerce. An intelligent system that can accurately interpret and respond to nuanced, multimodal user queries is essential for enhancing customer experience and providing scalable support. We propose a novel, dual-phase Retrieval-Augmented Generation (RAG) system that integrates both textual and visual information to power a conversational chatbot. Our empirical results demonstrate a significant performance uplift, with question-answering accuracy increasing by up to 20 percentage points when visual context is provided alongside text. This work establishes a robust framework for transforming raw customer feedback into a dynamic, interactive, and reliable knowledge base for e-commerce applications. The code for this project is available at https://github.com/NLP-Final-Projects/digikala rag. Index Terms—Retrieval-Augmented Generation (RAG), Natural Language Processing (NLP), Knowledge base/External knowl- edge, Vector database, Prompt engineering

Model Variants

The adapters are organized by their training configuration. The naming convention is clip_lora_adapters_{epochs}e{rank}r, with subdirectories for different training checkpoints.

r (Rank): The rank of the LoRA decomposition. Higher ranks can capture more complex patterns but increase the number of trainable parameters. We provide adapters with ranks 16 and 32.
e (Epochs): The total number of training epochs. All primary models were trained for 80 epochs.
Cut: Checkpoints saved at intermediate epochs (e.g., 30eCut, 50eCut). These can be useful if the model starts to overfit in later epochs.
ES (Early Stopping): The final adapter saved based on the best validation score using an early stopping mechanism.

Adapter Directory Structure:

clip_lora_adapters_80e16r_ES: Final LoRA adapter with rank 16, trained for 80 epochs with early stopping.
- clip_lora_adapters_80e16r_30eCut: Checkpoint from the same run at 30 epochs.
- clip_lora_adapters_80e16r_50eCut: Checkpoint at 50 epochs.
- clip_lora_adapters_80e16r_70eCut: Checkpoint at 70 epochs.
clip_lora_adapters_80e32r_ES: Final LoRA adapter with rank 32, trained for 80 epochs with early stopping.
- clip_lora_adapters_80e32r_30eCut: Checkpoint at 30 epochs.
- clip_lora_adapters_80e32r_50eCut: Checkpoint at 50 epochs.
- clip_lora_adapters_80e32r_70eCut: Checkpoint at 70 epochs.
glot-contrastive-final-lora: A curated final version, recommended for general use (symbolic link to the best-performing adapter, e.g., clip_lora_adapters_80e32r_ES).
glot-mlm-adapted: An experimental version of the adapter further fine-tuned with a Masked Language Modeling (MLM) objective on the text encoder.

How to Use

To use these LoRA adapters, you need to install the transformers, peft, and torch libraries. First, load the base CLIP model, and then attach the desired LoRA adapter from this repository.

CLIPFaLORA

import torch
from torchvision import transforms
from PIL import Image
from transformers import CLIPVisionModel, RobertaModel, AutoTokenizer

from peft import PeftModel
from .CombinedContrastive import CombinedContrastive

import requests
from io import BytesIO

from typing import List


class CLIPFaLORA:
    def __init__(self, name: str, path: str):
        self.name = name
        self.path = path

        self.device = "cuda:0"
        self.model = PeftModel.from_pretrained(
            CombinedContrastive(
                CLIPVisionModel.from_pretrained("SajjadAyoubi/clip-fa-vision"),
                RobertaModel.from_pretrained("SajjadAyoubi/clip-fa-text"),
            ),
            self.path,
        )
        self.model = self.model.to(self.device)
        self.model.eval()

        self.text_transform = AutoTokenizer.from_pretrained("SajjadAyoubi/clip-fa-text")
        self.image_transform = transforms.Compose(
            [
                transforms.Resize((224, 224)),
                transforms.ToTensor(),
                transforms.Normalize(
                    mean=[0.8544, 0.8390, 0.8298], std=[0.2618, 0.2729, 0.2855]
                ),
            ]
        )

    def get_text_embedding(self, contents: List[str]) -> List[List[float]]:
        inputs = self.text_transform(
            contents, return_tensors="pt", padding=True, truncation=True
        ).to(self.device)

        with torch.no_grad():
            embeddings = self.model.text_encoder(**inputs).pooler_output

        return embeddings.cpu().numpy().tolist()

    def get_image_embedding(self, images: List[str]) -> List[List[float]]:
        images = [
            self.image_transform(Image.open(image).convert("RGB")) for image in images
        ]
        images = torch.stack(images).to(self.device)

        with torch.no_grad():
            embeddings = self.model.vision_encoder(images).pooler_output

        return embeddings.cpu().numpy().tolist()

    def get_image_embedding_url(self, images: List[str]) -> List[List[float]]:
        contents = [requests.get(image).content for image in images]
        images = [BytesIO(content) for content in contents]

        images = [
            self.image_transform(Image.open(image).convert("RGB")) for image in images
        ]
        images = torch.stack(images).to(self.device)

        with torch.no_grad():
            embeddings = self.model.vision_encoder(images).pooler_output

        return embeddings.cpu().numpy().tolist()

GLOT500LORA

import torch
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
from typing import List


class GLOT500LORA:
    def __init__(self, name: str, base: str, adapters: str):
        self.name = name
        self.base = base
        self.adapters = adapters

        self.device = "cuda:0"
        self.model = PeftModel.from_pretrained(
            AutoModel.from_pretrained(base), adapters
        )
        self.model.to(self.device)

        self.text_transform = AutoTokenizer.from_pretrained(base, use_fast=False)

    def get_text_embedding(self, contents: List[str]) -> List[List[float]]:
        inputs = self.text_transform(
            contents, return_tensors="pt", padding=True, truncation=True
        ).to(self.device)

        with torch.no_grad():
            outputs = self.model(**inputs)
            embeddings = outputs.last_hidden_state
            mask = (
                inputs["attention_mask"].unsqueeze(-1).expand(embeddings.size()).float()
            )
            embeddings = torch.sum(embeddings * mask, 1) / torch.clamp(
                mask.sum(1), min=1e-9
            )

        return embeddings.cpu().numpy().tolist()

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for parsi-ai-nlpclass/Digikala_RAG

Base model

openai/clip-vit-large-patch14

Adapter

(3)

this model