First version uploaded
Browse files- README.md +32 -3
- rf_minilm.pkl +3 -0
- rf_octoai.pkl +3 -0
README.md
CHANGED
|
@@ -1,3 +1,32 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Model Description
|
| 2 |
+
|
| 3 |
+
The purpose of our trained Random Forest models is to identify malicious prompts given the prompt embeddings derived from [OpenAI](https://huggingface.co/datasets/ahsanayub/malicious-prompts-openai-embeddings), [OctoAI](https://huggingface.co/datasets/ahsanayub/malicious-prompts-octoai-embeddings), and [MiniLM](https://huggingface.co/datasets/ahsanayub/malicious-prompts-minilm-embeddings). The models are trained with 373,120 benign and malicious prompts. We split this dataset into 80% training and 20% test sets. To ensure equal proportion of the malicious and benign labels across splits, we use stratified sampling.
|
| 4 |
+
|
| 5 |
+
Embeddings consist of fixed-length numerical representations. OpenAI generates an embedding vector consisting of 1,536 floating-point numbers for each prompt. Similarly, the embedding datasets for OctoAI and MiniLM consist of 1,027 and 387 features, respectively.
|
| 6 |
+
|
| 7 |
+
# Model Evaluation
|
| 8 |
+
|
| 9 |
+
The binary classification performance of embedding-based random forest models is shared below:
|
| 10 |
+
|
| 11 |
+
| Embedding | Precision | Recall | F1-score | AUC |
|
| 12 |
+
|-----------|-----------|--------|----------|-------|
|
| 13 |
+
| OpenAI | 0.867 | 0.867 | 0.867 | 0.764 |
|
| 14 |
+
| OctoAI | 0.849 | 0.853 | 0.851 | 0.731 |
|
| 15 |
+
| MiniLM | 0.849 | 0.853 | 0.851 | 0.730 |
|
| 16 |
+
|
| 17 |
+
## How to Use the Model
|
| 18 |
+
|
| 19 |
+
We have shared three versions of random forest models in this repository. We used the following embedding models: `text-embedding-3-small` from OpenAI, and the open-source models `gte-large` hosted on OctoAI, as well as the well-known `all-MiniLM-L6-v2`. Therefore, you need to covert the prompts to its respective embeddings before querying the model to obtain its prediction: `0` for benign and `1` for malicous.
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
## Citing This Work
|
| 23 |
+
Our implementation, along with the curated datasets used for evaluation, is available on [GitHub](https://github.com/AhsanAyub/malicious-prompt-detection). Additionaly, if you use our implementation for scientific research, you are highly encouraged to cite [our paper](https://arxiv.org/abs/2410.22284).
|
| 24 |
+
|
| 25 |
+
```
|
| 26 |
+
@article{ayub2024embedding,
|
| 27 |
+
title={Embedding-based classifiers can detect prompt injection attacks},
|
| 28 |
+
author={Ayub, Md Ahsan and Majumdar, Subhabrata},
|
| 29 |
+
booktitle={CAMLIS},
|
| 30 |
+
year={2024}
|
| 31 |
+
}
|
| 32 |
+
```
|
rf_minilm.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5a16a5effa8d82d820dcdc84b4cea9514309dc6423cbd3082949dec78e0e1a43
|
| 3 |
+
size 386568804
|
rf_octoai.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:11001475d3fe2d296ba84609994b951754627c2798adbc7090e62495014bbb88
|
| 3 |
+
size 358151204
|