ByteFlow Net - 1 Level Regex Rate Distortion

This is an ByteFlow Net trained with regex rate distortion chunking.

Model Details

Model Name: BFlowNet_1B_1levels_regex_rate_100bt
Architecture: BFlowNet with adaptive layers
Parameters: ~1B parameters
Training Step: 350,000
Sequence Length: 8192
Vocabulary Size: 258

Architecture Details

Model Configuration

Dimensions: [512, 2048]
Head Dimensions: [64, 128]
Layers: [6, 24]
Sliding Windows: [512, 4096]
Max Sequence Lengths: [8192, 3200]

Block Configuration

Dimension: 512
Number of Layers: 8
RoPE Theta: 500000.0
Norm Epsilon: 1e-05

Training Details

Data

Dataset: fineweb_edu_100bt
Batch Size: 19
Tokenizer: bytes
Chunking Type: regex_rate_distortion

Optimization

Learning Rate: 0.0006
Weight Decay: 0.1
Scheduler: cosine
Warmup Steps: 10000

Distributed Training

Data Parallel Replicas: 8
Model Dtype: bf16
FSDP Type: full_shard

Usage

This model uses a custom AUNet architecture with regex rate distortion chunking. The checkpoint contains distributed model weights that need to be loaded with the appropriate framework.

# Example loading code would go here
# Note: This requires the specific AUNet framework used for training

Evaluation Tasks

The model was evaluated on the following tasks:

hellaswag
boolq
piqa
social_iqa
winogrande
openbookqa
arc_easy
arc_challenge
race
commonsense_qa
copa

Training Configuration

The complete training configuration is preserved in the uploaded files.

Files Description

*.distcp: Distributed checkpoint files containing model weights
params.json: Model parameters and configuration
train_state_*.json: Training state information including optimizer states
config.yaml: Complete training configuration

Citation

If you use this model, please cite the AUNet paper and methodology.

Downloads last month: 1