π΅ Optimizing Dark Web Data Collection: A Comparative Evaluation of Machine Learning Approaches π΅
I.C. Bakermans, dr. G. Cascavilla, Prof. Dr. ing. Z.J.M.H. Geradts, D. de Pascale
This repository is used for the research Optimizing Dark Web Data Collection: A Comparative Evaluation of Machine Learning Approaches and contains the following assets: data, code, models, and the images used in the thesis. In this README file, we will zoom in on the structure of this repository and the places where certain assets can be found. Sharing these assets aims to enhance future research and provide guidelines for navigating the repository.
Usage and License Notices: The data and model checkpoints are intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna, ChatGPT, GLiNER and UniversalNER.
The repository is constructed in the following way:
Let's start π
This repository has different functionalities. Below, we will describe the necessary steps to ensure a fluent usage of the technologies developed per functionality.
Clone this repository
git clone https://github.com/ICBKRMNS/Optimizing-Dark-Web-Data-Collection.git
Crator
This folder contains the Dark Web crawler used during the research. The crawler named Crator is derived from the technologies developed by D. de Pascale. For using the crawler please clone this repository, navigate to the Crator folder, and install the required packages:
- Navigate to Crator folder
cd Crator
- Install the required packages
pip install -r requirements.txt
After installing the required packages, navigate to the resources folder. In this folder, the crator.yml file can be modified to set up the crawler preferences, such as link depth, maximum crawled links, etc. The seeds.txt file can also be modified to define the seed URLs to be crawled. After setting up the preferences, the crawler can be executed by running crator.py which is located in the python folder.
Data Preprocessing
In the Data-Preprocessing folder, we included the Jupyter notebooks, which have been utilized for creating the market-specific datasets and the final dataset. These notebooks include tailor-made RegEx patterns per market, data cleaning procedures, and exploratory analysis of the data.
Datasets
All data utilized during this research is positioned in the datasets folder. The folder consists of three subfolders named dumps, final dataset, and preprocessed datasets. The dump folder contains the raw HTML source code documents crawled from the darknet market's (DNM) product listing pages, the data within this folder is in HTML format. The final dataset folder contains the final dataset used for training the proposed models and the robustness test dataset, both in CSV format. The last folder, preprocessed datasets, contains the preprocessed dataset per DNM in CSV format.
ELMo-BiLSTM-CNN
Install Requirements The ELMo-BiLSTM-CNN model is the first approach tested in this thesis which is derived from the approach developed by Shah et al. (2022). The local ELMoForManyLangs directory is used to develop the weights.txt file, which is used as an ELMo-embedding layer when training the model. To use the embeddings the package needs to be installed, this can be done by executing the following commands:
- Navigate to ELMoForManyLangs-local folder
cd ELMo-BiLSTM-CNN/ELMoForManyLangs-local
- Install the embedding package
python setup.py install
Set up the config_path As stated by Che et al. (2018), after unzipping the model, you will find a JSON file ${lang}.model/config.json. Please change the "config_path" field to the relative path to the model configuration cnn_50_100_512_4096_sample.json. For example, if your ELMo model is zht.model/config.json and your model configuration is zht.model/cnn_50_100_512_4096_sample.json, you need to change "config_path" in zht.model/config.json to cnn_50_100_512_4096_sample.json.
If there is no configuration cnn_50_100_512_4096_sample.json under ${lang}.model, you can copy the elmoformanylangs/configs/cnn_50_100_512_4096_sample.json into ${lang}.model, or change the "config_path" into elmoformanylangs/configs/cnn_50_100_512_4096_sample.json
Modelling By following the steps described in the feature embedding notebook in the ELMoForManyLangs-local folder, the weights.txt document can be obtained, which presents the ELMo-Embedding for each token in the corpus. We use the embedding document as an embedding layer in the creation of the ELMo-BiLSTM-CNN model. We strongly advise training the model on an A100 since this reduces the training time by 98% compared with a Mac silicon M3 Max (using the code in the ELMoForManyLangs-snellius folder). The final model can be found in the same folder named elmo_model_V2.keras . The model can be loaded using the following steps:
- Navigate to LMoForManyLangs-snellius folder
cd ELMo-BiLSTM-CNN/ELMoForManyLangs-snellius
- Install the embedding package
model = keras.models.load_model("elmo_model_V2.keras")
- Note: when using the model keep in mind you use the right data preprocessing stated in the Jupyter Notebook.
UniversalNer
Installation The second approach developed is the fine-tuned UniversalNER model introduced by Zhou et al. (2023). The UniversalNER project depends on vllm. Therefore, ensure you have gcc version 5 or later, and CUDA versions between 11.0 and 11.8, as specified in the installation requirements for vllm. For installation please follow the following steps:
- Navigate to UniversalNER folder
cd UniversalNER
- Install the required packages
pip install -r requirements.txt
Data preparation The fine-tuning of the UniversalNER model started with data preparation since UniversalNER requires its data in a specific template. Therefore, we introduce the data preprocessing pipeline, which consists of two different Jupyter notebooks. The notebooks for preparing the data are located in the UniversalNER-Data-Preparation folder. The way to run through these notebooks is the following:
- training_data_creation.ipynb / robustness_evaluation_data_creation.ipynb
This notebook creates the required data in JSON format. It can be used to create training data or evaluation data (e.g., robustness check).
- test-train-data-creation.ipynb
This notebook creates a train-test split of the provided data for training the model and ensures that the data is formatted correctly.
Fine-tuning UniversalNER The finetuning of UniversalNER is done differently as advised on the page of UniversalNER (the original page can be found here). To make the fine-tuning process of UniversalNER more efficient we used from PEFT the Low-Rank Adaptation of Large Language Models (LoRA) introduced by Hu et al. With this approach, we needed 1x A100 40GB instead of 8x A100 80GB, which represents a significant increase of efficiency. The finetuning notebook is placed in the UniversalNER-Finetuning folder, and it uses the data presented in the data folder, which is obtained via the previous data preparation step.
Evaluation UniversalNER The assets are presented in the UniversalNER-Validation folder for the creation and evaluation of the fine-tuned model. The notebook shows merging the base model UniversalNER7B-type with the LoRA layer. In addition, the notebook shows how to evaluate the model using the following command:
!python -m src.eval.evaluate \
--model_path ./universalNer-ft-V3 \
--data_path ./src/test-dataset-final.json \
--tensor_parallel_size 1 \
GliNER
Installation The last approach evaluated is the fine-tuned GLiNER approach. GLiNER is a Named Entity Recognition (NER) model developed by Zaratiana et al. (2023) It can identify any entity type using a bidirectional transformer encoder (BERT-like). To begin using the GLiNER model, first, the required packages should be installed:
- Navigate to GLiNER/GLiNER folder
cd GLiNER/GLiNER
- Install the required packages
pip install -r requirements.txt
Data preparation & modeling GLiNER requires us to format our data in a specific way. The Data preparation is included in the notebook which is also used for training the model. The preprocessing steps are presented in the GLiNER folder in the Jupyter notebook. The required template consists of the tokenized sentences in a dictionary with the position per stated entity. We obtain a data file from the preprocessing called data.json (robustness-data.json for the robustness test). After the data preparation, we can use the JSON data to fine-tune the GLiNER model. We tuned different GLiNER variations with different hyperparameter settings and saved them (with best performance) as a Pickle file.
Models The obtained models after fine-tuning are stored in the GLiNER-Models-V2 folder and can be loaded via the following command:
- Import Pickle and load model
import pickle
load_model = pickle.load(open("GLiNER-Large-FT-500-epoch-V2.pkl", 'rb'))
Notee Recently, GLiNER has been updated to v2.1. Using this new version the fine-tuned model can be saved and loaded as follows:
trainer.model.save_pretrained("GLiNER-Large-FT-500-epoch-V2")
md = GLiNER.from_pretrained("GLiNER-Large-FT-500-epoch-V2", local_files_only=True)
This new approach has out-of-memory (OOM) issues, so we could not include it in our research. We have stored the notebook for this approach in the archive.
Sources
Shah, S. A. A., Masood, M. A., and Yasin, A. (2022). Dark Web: E-Commerce Information Extraction Based on Name Entity Recognition Using Bidirectional-LSTM. Received 29 August 2022, accepted 7 September 2022, date of publication 14 September 2022, date of current version 26 September 2022. Digital Object Identifier: 10.1109/ACCESS.2022.3206539.
Che, W., Liu, Y., Wang, Y., Zheng, B., & Liu, T. (2018). Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (pp. 55-64). Brussels, Belgium: Association for Computational Linguistics. http://www.aclweb.org/anthology/K18-2005.
Zhou, W., Zhang, S., Gu, Y., Chen, M., & Poon, H. (2024). UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition. [Submitted on 7 Aug 2023 (v1), last revised 19 Jan 2024 (this version, v2)]. Accepted at ICLR 2024. Computation and Language (cs.CL). arXiv:2308.03279 [cs.CL]. https://doi.org/10.48550/arXiv.2308.03279.
Hu, E. J., Shen, Y., Wallis, P., Allen{-}Zhu, Z., Li, Y., Wang, S., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. CoRR, abs/2106.09685. Available at: https://arxiv.org/abs/2106.09685.
Zaratiana, U., Tomeh, N., Holat, P., & Charnois, T. (2023). GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer. [Submitted on 14 Nov 2023]. Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG). arXiv:2311.08526 [cs.CL] (or arXiv:2311.08526v1 [cs.CL] for this version). https://doi.org/10.48550/arXiv.2311.08526.