ByteFlow Net - 1 Level Regex Rate Distortion
This is an ByteFlow Net trained with regex rate distortion chunking.
Model Details
- Model Name: BFlowNet_1B_1levels_regex_rate_100bt
- Architecture: BFlowNet with adaptive layers
- Parameters: ~1B parameters
- Training Step: 350,000
- Sequence Length: 8192
- Vocabulary Size: 258
Architecture Details
Model Configuration
- Dimensions: [512, 2048]
- Head Dimensions: [64, 128]
- Layers: [6, 24]
- Sliding Windows: [512, 4096]
- Max Sequence Lengths: [8192, 3200]
Block Configuration
- Dimension: 512
- Number of Layers: 8
- RoPE Theta: 500000.0
- Norm Epsilon: 1e-05
Training Details
Data
- Dataset: fineweb_edu_100bt
- Batch Size: 19
- Tokenizer: bytes
- Chunking Type: regex_rate_distortion
Optimization
- Learning Rate: 0.0006
- Weight Decay: 0.1
- Scheduler: cosine
- Warmup Steps: 10000
Distributed Training
- Data Parallel Replicas: 8
- Model Dtype: bf16
- FSDP Type: full_shard
Usage
This model uses a custom AUNet architecture with regex rate distortion chunking. The checkpoint contains distributed model weights that need to be loaded with the appropriate framework.
# Example loading code would go here
# Note: This requires the specific AUNet framework used for training
Evaluation Tasks
The model was evaluated on the following tasks:
- hellaswag
- boolq
- piqa
- social_iqa
- winogrande
- openbookqa
- arc_easy
- arc_challenge
- race
- commonsense_qa
- copa
Training Configuration
The complete training configuration is preserved in the uploaded files.
Files Description
*.distcp: Distributed checkpoint files containing model weightsparams.json: Model parameters and configurationtrain_state_*.json: Training state information including optimizer statesconfig.yaml: Complete training configuration
Citation
If you use this model, please cite the AUNet paper and methodology.
- Downloads last month
- 1