YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

CyberMetric Dataset Processing Script

This repository contains the processing script used to convert the original CyberMetric JSON files into well-structured Hugging Face datasets with comprehensive documentation.

🎯 Overview

The script processes 4 different CyberMetric JSON files and uploads them as separate, documented datasets:

  1. cybermetric_80_v1 - 80 cybersecurity questions (quick evaluation)
  2. cybermetric_500_v1 - 500 cybersecurity questions (standard benchmark)
  3. cybermetric_2000_v1 - 2,000 cybersecurity questions (comprehensive evaluation)
  4. cybermetric_10000_v1 - 10,180 cybersecurity questions (extensive training set)

📊 Processed Datasets

All processed datasets are available at: tuandunghcmut

🔍 Dataset Content

Topic Coverage

  • 🔐 Cryptography: Random Bit Generators, Key Derivation Functions, Encryption algorithms
  • 💳 PCI DSS: Payment Card Industry Data Security Standards compliance
  • 🛡️ Security Controls: Access controls, privilege management, authentication mechanisms
  • 🔍 Threat Intelligence: Attack patterns, security frameworks, vulnerability management
  • ⚠️ Risk Assessment: Security weaknesses, mitigations, and incident response

Question Examples

Cryptography:

Q: What is the primary requirement for an Random Bit Generator's (RBG) output to be used for generating cryptographic keys?
A) Length matching target data
B) Computationally indistinguishable from random bits with sufficient entropy ✓
C) Maximum possible length
D) Exact symmetric key length

PCI DSS:

Q: What is the primary purpose of segmentation in the context of PCI DSS?
A) Reduce applicable requirements
B) Limit assessment scope and minimize breach potential ✓
C) Remove PCI DSS applicability
D) Eliminate need for controls

🚀 Usage

Prerequisites

pip install datasets huggingface_hub

Authentication

huggingface-cli login
# or
hf auth login

Running the Script

  1. Clone the original CyberMetric repository:
git clone https://github.com/cybermetric/CyberMetric.git cyber_metric
  1. Run the processing script:
python process_cybermetric_with_docs.py --username YOUR_HF_USERNAME

Command Line Options

  • --username: Your Hugging Face username (default: tuandunghcmut)
  • --token: Hugging Face token (optional if already logged in)
  • --data-dir: Path to CyberMetric data directory (default: cyber_metric)

🔧 Features

Data Processing

  • Consistent Schema: Standardized field structure across all datasets
  • Rich Metadata: Includes formatted questions and answer options
  • Quality Assurance: Validates JSON structure and question format
  • Scalable Processing: Handles datasets from 80 to 10,000+ questions

Documentation

  • 📚 Comprehensive READMEs: Each dataset includes detailed documentation:
    • Dataset statistics and topic coverage
    • Usage examples and code snippets
    • Question format and structure explanations
    • Application scenarios and use cases
    • Citation information
  • 🎯 Topic-Specific Details: Covers cryptography, PCI DSS, security controls
  • 📖 Ready-to-Use Examples: Python code for immediate dataset usage

Upload Features

  • 🚀 Automated Processing: Handles all 4 datasets in sequence
  • 📤 Smart Upload: Checks for existing datasets and updates appropriately
  • 📝 Documentation Integration: Automatically generates and uploads README files
  • Progress Monitoring: Detailed logging with upload statistics

📁 Dataset Structure

Each processed dataset contains these fields:

{
    'question': str,              # The cybersecurity question text
    'option_a': str,              # Multiple choice option A
    'option_b': str,              # Multiple choice option B  
    'option_c': str,              # Multiple choice option C
    'option_d': str,              # Multiple choice option D
    'correct_answer': str,        # Correct answer (A, B, C, or D)
    'all_options': List[str],     # Formatted list of all options
    'formatted_question': str     # Complete question with options and answer
}

Usage Example

from datasets import load_dataset

# Load any CyberMetric dataset
dataset = load_dataset("tuandunghcmut/cybermetric_500_v1")

# Access a question
sample = dataset['train'][0]
print(f"Question: {sample['question']}")
print(f"Options:")
for i, option_key in enumerate(['option_a', 'option_b', 'option_c', 'option_d'], 1):
    print(f"  {chr(64+i)}) {sample[option_key]}")
print(f"Correct Answer: {sample['correct_answer']}")

# Use formatted version
print("\\n" + sample['formatted_question'])

📊 Dataset Statistics

Dataset Questions Topics Difficulty Use Case
80 V1 80 Core concepts Professional Quick evaluation
500 V1 500 Comprehensive Professional Standard benchmark
2000 V1 2,000 Extensive Professional Comprehensive testing
10000 V1 10,180 Complete Professional Training & research

Total: 12,760 cybersecurity questions 🔐

🎓 Applications

These datasets are perfect for:

  • Cybersecurity Education: Professional training and certification prep
  • Model Evaluation: Testing LLM cybersecurity knowledge and reasoning
  • Benchmark Development: Creating standardized security assessments
  • Research: Academic studies in AI security and knowledge evaluation
  • Certification Prep: CISSP, CISM, CISA, and other security certifications
  • Corporate Training: Employee security awareness and skill assessment

🏆 Quality Assurance

  • Expert Curated: Questions developed by cybersecurity professionals
  • Standards Aligned: Based on industry frameworks (NIST, ISO 27001, PCI DSS)
  • Difficulty Calibrated: Professional certification level complexity
  • Comprehensive Coverage: Major cybersecurity domains and practices
  • Regular Updates: Maintained with current threat landscape

📄 Citation

If you use these datasets in your research or applications, please cite:

@misc{cybermetric2024,
  title={CyberMetric: Cybersecurity Knowledge Assessment Datasets},
  author={CyberMetric Contributors},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/tuandunghcmut}
}

🔗 Original Source

This processing script is based on the CyberMetric dataset from:

🤝 Contributing

Contributions are welcome! Please feel free to:

  • Submit bug reports or feature requests
  • Improve documentation or code quality
  • Add support for additional formats or features

📜 License

This script and processed datasets are released under the same license terms as the original CyberMetric repository.


🔐 Ready for cybersecurity knowledge evaluation and professional training!

Empowering the next generation of cybersecurity professionals through comprehensive, high-quality assessment data. 🛡️

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support