CyberMetric Dataset Processing Script
This repository contains the processing script used to convert the original CyberMetric JSON files into well-structured Hugging Face datasets with comprehensive documentation.
🎯 Overview
The script processes 4 different CyberMetric JSON files and uploads them as separate, documented datasets:
- cybermetric_80_v1 - 80 cybersecurity questions (quick evaluation)
- cybermetric_500_v1 - 500 cybersecurity questions (standard benchmark)
- cybermetric_2000_v1 - 2,000 cybersecurity questions (comprehensive evaluation)
- cybermetric_10000_v1 - 10,180 cybersecurity questions (extensive training set)
📊 Processed Datasets
All processed datasets are available at: tuandunghcmut
🔍 Dataset Content
Topic Coverage
- 🔐 Cryptography: Random Bit Generators, Key Derivation Functions, Encryption algorithms
- 💳 PCI DSS: Payment Card Industry Data Security Standards compliance
- 🛡️ Security Controls: Access controls, privilege management, authentication mechanisms
- 🔍 Threat Intelligence: Attack patterns, security frameworks, vulnerability management
- ⚠️ Risk Assessment: Security weaknesses, mitigations, and incident response
Question Examples
Cryptography:
Q: What is the primary requirement for an Random Bit Generator's (RBG) output to be used for generating cryptographic keys?
A) Length matching target data
B) Computationally indistinguishable from random bits with sufficient entropy ✓
C) Maximum possible length
D) Exact symmetric key length
PCI DSS:
Q: What is the primary purpose of segmentation in the context of PCI DSS?
A) Reduce applicable requirements
B) Limit assessment scope and minimize breach potential ✓
C) Remove PCI DSS applicability
D) Eliminate need for controls
🚀 Usage
Prerequisites
pip install datasets huggingface_hub
Authentication
huggingface-cli login
# or
hf auth login
Running the Script
- Clone the original CyberMetric repository:
git clone https://github.com/cybermetric/CyberMetric.git cyber_metric
- Run the processing script:
python process_cybermetric_with_docs.py --username YOUR_HF_USERNAME
Command Line Options
--username: Your Hugging Face username (default: tuandunghcmut)--token: Hugging Face token (optional if already logged in)--data-dir: Path to CyberMetric data directory (default:cyber_metric)
🔧 Features
Data Processing
- ✅ Consistent Schema: Standardized field structure across all datasets
- ✅ Rich Metadata: Includes formatted questions and answer options
- ✅ Quality Assurance: Validates JSON structure and question format
- ✅ Scalable Processing: Handles datasets from 80 to 10,000+ questions
Documentation
- 📚 Comprehensive READMEs: Each dataset includes detailed documentation:
- Dataset statistics and topic coverage
- Usage examples and code snippets
- Question format and structure explanations
- Application scenarios and use cases
- Citation information
- 🎯 Topic-Specific Details: Covers cryptography, PCI DSS, security controls
- 📖 Ready-to-Use Examples: Python code for immediate dataset usage
Upload Features
- 🚀 Automated Processing: Handles all 4 datasets in sequence
- 📤 Smart Upload: Checks for existing datasets and updates appropriately
- 📝 Documentation Integration: Automatically generates and uploads README files
- ⚡ Progress Monitoring: Detailed logging with upload statistics
📁 Dataset Structure
Each processed dataset contains these fields:
{
'question': str, # The cybersecurity question text
'option_a': str, # Multiple choice option A
'option_b': str, # Multiple choice option B
'option_c': str, # Multiple choice option C
'option_d': str, # Multiple choice option D
'correct_answer': str, # Correct answer (A, B, C, or D)
'all_options': List[str], # Formatted list of all options
'formatted_question': str # Complete question with options and answer
}
Usage Example
from datasets import load_dataset
# Load any CyberMetric dataset
dataset = load_dataset("tuandunghcmut/cybermetric_500_v1")
# Access a question
sample = dataset['train'][0]
print(f"Question: {sample['question']}")
print(f"Options:")
for i, option_key in enumerate(['option_a', 'option_b', 'option_c', 'option_d'], 1):
print(f" {chr(64+i)}) {sample[option_key]}")
print(f"Correct Answer: {sample['correct_answer']}")
# Use formatted version
print("\\n" + sample['formatted_question'])
📊 Dataset Statistics
| Dataset | Questions | Topics | Difficulty | Use Case |
|---|---|---|---|---|
| 80 V1 | 80 | Core concepts | Professional | Quick evaluation |
| 500 V1 | 500 | Comprehensive | Professional | Standard benchmark |
| 2000 V1 | 2,000 | Extensive | Professional | Comprehensive testing |
| 10000 V1 | 10,180 | Complete | Professional | Training & research |
Total: 12,760 cybersecurity questions 🔐
🎓 Applications
These datasets are perfect for:
- ✅ Cybersecurity Education: Professional training and certification prep
- ✅ Model Evaluation: Testing LLM cybersecurity knowledge and reasoning
- ✅ Benchmark Development: Creating standardized security assessments
- ✅ Research: Academic studies in AI security and knowledge evaluation
- ✅ Certification Prep: CISSP, CISM, CISA, and other security certifications
- ✅ Corporate Training: Employee security awareness and skill assessment
🏆 Quality Assurance
- Expert Curated: Questions developed by cybersecurity professionals
- Standards Aligned: Based on industry frameworks (NIST, ISO 27001, PCI DSS)
- Difficulty Calibrated: Professional certification level complexity
- Comprehensive Coverage: Major cybersecurity domains and practices
- Regular Updates: Maintained with current threat landscape
📄 Citation
If you use these datasets in your research or applications, please cite:
@misc{cybermetric2024,
title={CyberMetric: Cybersecurity Knowledge Assessment Datasets},
author={CyberMetric Contributors},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/tuandunghcmut}
}
🔗 Original Source
This processing script is based on the CyberMetric dataset from:
- Repository: https://github.com/cybermetric/CyberMetric
- Paper: CyberMetric benchmark suite for cybersecurity knowledge evaluation
🤝 Contributing
Contributions are welcome! Please feel free to:
- Submit bug reports or feature requests
- Improve documentation or code quality
- Add support for additional formats or features
📜 License
This script and processed datasets are released under the same license terms as the original CyberMetric repository.
🔐 Ready for cybersecurity knowledge evaluation and professional training!
Empowering the next generation of cybersecurity professionals through comprehensive, high-quality assessment data. 🛡️