diff --git "a/train.log" "b/train.log" new file mode 100644--- /dev/null +++ "b/train.log" @@ -0,0 +1,3512 @@ +[2025-02-18 19:36:31,459] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) +[2025-02-18 19:36:31,459] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) +[2025-02-18 19:36:31,459] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) +[2025-02-18 19:36:31,459] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) +[2025-02-18 19:36:31,459] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) +[2025-02-18 19:36:31,460] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) +[2025-02-18 19:36:31,467] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) +INFO 02-18 19:36:38 __init__.py:190] Automatically detected platform cuda. +INFO 02-18 19:36:38 __init__.py:190] Automatically detected platform cuda. +INFO 02-18 19:36:38 __init__.py:190] Automatically detected platform cuda. +INFO 02-18 19:36:38 __init__.py:190] Automatically detected platform cuda. +INFO 02-18 19:36:38 __init__.py:190] Automatically detected platform cuda. +INFO 02-18 19:36:38 __init__.py:190] Automatically detected platform cuda. +INFO 02-18 19:36:38 __init__.py:190] Automatically detected platform cuda. +[2025-02-18 19:36:46,276] [INFO] [comm.py:652:init_distributed] cdb=None +[2025-02-18 19:36:46,276] [INFO] [comm.py:652:init_distributed] cdb=None +[2025-02-18 19:36:46,277] [INFO] [comm.py:652:init_distributed] cdb=None +[2025-02-18 19:36:46,288] [INFO] [comm.py:652:init_distributed] cdb=None +[2025-02-18 19:36:46,300] [INFO] [comm.py:652:init_distributed] cdb=None +[2025-02-18 19:36:46,300] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl +[2025-02-18 19:36:46,301] [INFO] [comm.py:652:init_distributed] cdb=None +[2025-02-18 19:36:46,305] [INFO] [comm.py:652:init_distributed] cdb=None + Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 35000 examples [00:00, 83157.10 examples/s] Generating train split: 35000 examples [00:00, 82784.95 examples/s] + Map: 0%| | 0/35000 [00:00 + Map: 29%|██▊ | 10000/35000 [00:00<00:01, 16433.51 examples/s] Map: 96%|█████████▋| 33770/35000 [00:02<00:00, 8414.30 examples/s] p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3374383 [0] NCCL INFO cudaDriverVersion 12040 +NCCL version 2.21.5+cuda12.4 +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3374388 [5] NCCL INFO cudaDriverVersion 12040 +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3374387 [4] NCCL INFO cudaDriverVersion 12040 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3374385 [2] NCCL INFO cudaDriverVersion 12040 +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3374386 [3] NCCL INFO cudaDriverVersion 12040 +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3374387 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3374388 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3374385 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3374386 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3374387 [4] NCCL INFO Bootstrap : Using bond0:10.9.200.82<0> +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3374386 [3] NCCL INFO Bootstrap : Using bond0:10.9.200.82<0> +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3374385 [2] NCCL INFO Bootstrap : Using bond0:10.9.200.82<0> +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3374388 [5] NCCL INFO Bootstrap : Using bond0:10.9.200.82<0> + Map: 100%|██████████| 35000/35000 [00:02<00:00, 12722.79 examples/s] + Map: 34%|███▍ | 12000/35000 [00:00<00:01, 16676.60 examples/s]p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO P2P plugin IBext_v8 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO P2P plugin IBext_v8 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO P2P plugin IBext_v8 +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO P2P plugin IBext_v8 +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO P2P plugin IBext_v8 +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.82<0> +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Using non-device net plugin version 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Using network IBext_v8 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.82<0> +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.82<0> +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.82<0> +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Using non-device net plugin version 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Using network IBext_v8 +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Using non-device net plugin version 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Using network IBext_v8 +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Using non-device net plugin version 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Using network IBext_v8 +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.82<0> +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Using non-device net plugin version 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Using network IBext_v8 +[2025-02-18 19:36:51,069] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 +You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour +You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. +Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3374389 [6] NCCL INFO cudaDriverVersion 12040 +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3374389 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3374389 [6] NCCL INFO Bootstrap : Using bond0:10.9.200.82<0> + Map: 41%|████ | 14298/35000 [00:01<00:02, 9063.53 examples/s] Map: 46%|████▌ | 16000/35000 [00:01<00:01, 10307.19 examples/s]p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO P2P plugin IBext_v8 +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.82<0> +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Using non-device net plugin version 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Using network IBext_v8 + Map: 51%|█████▏ | 18000/35000 [00:01<00:01, 11763.61 examples/s] Map: 57%|█████▋ | 20000/35000 [00:01<00:01, 13013.50 examples/s] Map: 63%|██████▎ | 22000/35000 [00:01<00:00, 14054.38 examples/s] Map: 69%|██████▊ | 24000/35000 [00:01<00:00, 14701.56 examples/s] Map: 74%|███████▍ | 26000/35000 [00:01<00:00, 15299.05 examples/s] Map: 80%|████████ | 28000/35000 [00:02<00:00, 15772.87 examples/s] Map: 86%|████████▌ | 30000/35000 [00:02<00:00, 16125.76 examples/s] Map: 91%|█████████▏| 32000/35000 [00:02<00:00, 16396.84 examples/s] Map: 96%|█████████▋| 33770/35000 [00:02<00:00, 8883.69 examples/s] Map: 100%|██████████| 35000/35000 [00:02<00:00, 12654.37 examples/s] +[2025-02-18 19:36:52,784] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 +You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour +You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. +Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3374384 [1] NCCL INFO cudaDriverVersion 12040 +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3374384 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3374384 [1] NCCL INFO Bootstrap : Using bond0:10.9.200.82<0> +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO P2P plugin IBext_v8 +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.82<0> +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Using non-device net plugin version 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Using network IBext_v8 +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO ncclCommInitRank comm 0x564cdc4c2590 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId c6000 commId 0x63e0264b3b738c57 - Init START +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO ncclCommInitRank comm 0x563dad337df0 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 8f000 commId 0x63e0264b3b738c57 - Init START +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO ncclCommInitRank comm 0x562d195aa380 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8a000 commId 0x63e0264b3b738c57 - Init START +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO ncclCommInitRank comm 0x562b06bdba80 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 10000 commId 0x63e0264b3b738c57 - Init START +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO ncclCommInitRank comm 0x5573fb9a9a70 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 4d000 commId 0x63e0264b3b738c57 - Init START +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO ncclCommInitRank comm 0x55b294e49110 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 49000 commId 0x63e0264b3b738c57 - Init START +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO ncclCommInitRank comm 0x560d11cb10c0 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 16000 commId 0x63e0264b3b738c57 - Init START +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO NVLS multicast support is not available on dev 3 +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff,00000000 +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO NVLS multicast support is not available on dev 4 +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO NVLS multicast support is not available on dev 1 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000 +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO NVLS multicast support is not available on dev 5 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO NVLS multicast support is not available on dev 2 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO NVLS multicast support is not available on dev 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000 +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO NVLS multicast support is not available on dev 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO comm 0x562b06bdba80 rank 0 nRanks 7 nNodes 1 localRanks 7 localRank 0 MNNVL 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO comm 0x564cdc4c2590 rank 6 nRanks 7 nNodes 1 localRanks 7 localRank 6 MNNVL 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 00/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 01/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 02/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO comm 0x563dad337df0 rank 5 nRanks 7 nNodes 1 localRanks 7 localRank 5 MNNVL 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 03/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 04/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 05/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 06/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO comm 0x562d195aa380 rank 4 nRanks 7 nNodes 1 localRanks 7 localRank 4 MNNVL 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 07/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 08/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5 [2] -1/-1/-1->6->5 [3] -1/-1/-1->6->5 [4] -1/-1/-1->6->5 [5] -1/-1/-1->6->5 [6] -1/-1/-1->6->5 [7] -1/-1/-1->6->5 [8] -1/-1/-1->6->5 [9] -1/-1/-1->6->5 [10] -1/-1/-1->6->5 [11] -1/-1/-1->6->5 [12] -1/-1/-1->6->5 [13] -1/-1/-1->6->5 [14] -1/-1/-1->6->5 [15] -1/-1/-1->6->5 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 09/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO P2P Chunksize set to 524288 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 10/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO comm 0x5573fb9a9a70 rank 3 nRanks 7 nNodes 1 localRanks 7 localRank 3 MNNVL 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 11/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 12/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO P2P Chunksize set to 524288 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 13/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 14/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 15/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO comm 0x55b294e49110 rank 2 nRanks 7 nNodes 1 localRanks 7 localRank 2 MNNVL 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO P2P Chunksize set to 524288 +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO P2P Chunksize set to 524288 +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO P2P Chunksize set to 524288 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO P2P Chunksize set to 524288 +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO comm 0x560d11cb10c0 rank 1 nRanks 7 nNodes 1 localRanks 7 localRank 1 MNNVL 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO P2P Chunksize set to 524288 +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 00/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 01/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 02/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 03/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 04/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 05/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 06/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 07/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 08/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 09/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 10/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 11/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 12/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 13/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 14/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 15/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Connected all rings +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Connected all rings +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Connected all rings +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Connected all rings +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Connected all rings +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Connected all rings +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Connected all rings +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO Connected all trees +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO Connected all trees +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO Connected all trees +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO Connected all trees +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO Connected all trees +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO Connected all trees +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO Connected all trees +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3375789 [3] NCCL INFO ncclCommInitRank comm 0x5573fb9a9a70 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 4d000 commId 0x63e0264b3b738c57 - Init COMPLETE +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3375792 [4] NCCL INFO ncclCommInitRank comm 0x562d195aa380 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8a000 commId 0x63e0264b3b738c57 - Init COMPLETE +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3375791 [2] NCCL INFO ncclCommInitRank comm 0x55b294e49110 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 49000 commId 0x63e0264b3b738c57 - Init COMPLETE +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3375873 [1] NCCL INFO ncclCommInitRank comm 0x560d11cb10c0 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 16000 commId 0x63e0264b3b738c57 - Init COMPLETE +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3375784 [0] NCCL INFO ncclCommInitRank comm 0x562b06bdba80 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 10000 commId 0x63e0264b3b738c57 - Init COMPLETE +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3375822 [6] NCCL INFO ncclCommInitRank comm 0x564cdc4c2590 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId c6000 commId 0x63e0264b3b738c57 - Init COMPLETE +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3375790 [5] NCCL INFO ncclCommInitRank comm 0x563dad337df0 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 8f000 commId 0x63e0264b3b738c57 - Init COMPLETE +[2025-02-18 19:36:54,517] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 730, num_elems = 8.29B + Loading checkpoint shards: 0%| | 0/4 [00:00 +[2025-02-18 19:37:20,464] [INFO] [config.py:1003:print] communication_data_type ...... None +[2025-02-18 19:37:20,464] [INFO] [config.py:1003:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} +[2025-02-18 19:37:20,464] [INFO] [config.py:1003:print] curriculum_enabled_legacy .... False +[2025-02-18 19:37:20,464] [INFO] [config.py:1003:print] curriculum_params_legacy ..... False +[2025-02-18 19:37:20,464] [INFO] [config.py:1003:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} +[2025-02-18 19:37:20,464] [INFO] [config.py:1003:print] data_efficiency_enabled ...... False +[2025-02-18 19:37:20,464] [INFO] [config.py:1003:print] dataloader_drop_last ......... False +[2025-02-18 19:37:20,464] [INFO] [config.py:1003:print] disable_allgather ............ False +[2025-02-18 19:37:20,464] [INFO] [config.py:1003:print] dump_state ................... False +[2025-02-18 19:37:20,464] [INFO] [config.py:1003:print] dynamic_loss_scale_args ...... None +[2025-02-18 19:37:20,464] [INFO] [config.py:1003:print] eigenvalue_enabled ........... False +[2025-02-18 19:37:20,464] [INFO] [config.py:1003:print] eigenvalue_gas_boundary_resolution 1 +[2025-02-18 19:37:20,464] [INFO] [config.py:1003:print] eigenvalue_layer_name ........ bert.encoder.layer +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] eigenvalue_layer_num ......... 0 +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] eigenvalue_max_iter .......... 100 +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] eigenvalue_stability ......... 1e-06 +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] eigenvalue_tol ............... 0.01 +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] eigenvalue_verbose ........... False +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] elasticity_enabled ........... False +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] flops_profiler_config ........ { + "enabled": false, + "recompute_fwd_factor": 0.0, + "profile_step": 1, + "module_depth": -1, + "top_modules": 1, + "detailed": true, + "output_file": null +} +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] fp16_auto_cast ............... None +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] fp16_enabled ................. False +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] fp16_master_weights_and_gradients False +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] global_rank .................. 0 +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] grad_accum_dtype ............. None +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] gradient_accumulation_steps .. 2 +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] gradient_clipping ............ 1.0 +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] gradient_predivide_factor .... 1.0 +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] graph_harvesting ............. False +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] initial_dynamic_scale ........ 1 +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] load_universal_checkpoint .... False +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] loss_scale ................... 1.0 +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] memory_breakdown ............. False +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] mics_hierarchial_params_gather False +[2025-02-18 19:37:20,465] [INFO] [config.py:1003:print] mics_shard_size .............. -1 +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] nebula_config ................ { + "enabled": false, + "persistent_storage_path": null, + "persistent_time_interval": 100, + "num_of_version_in_retention": 2, + "enable_nebula_load": true, + "load_path": null +} +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] optimizer_legacy_fusion ...... False +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] optimizer_name ............... None +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] optimizer_params ............. None +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] pld_enabled .................. False +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] pld_params ................... False +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] prescale_gradients ........... False +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] scheduler_name ............... None +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] scheduler_params ............. None +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] seq_parallel_communication_data_type torch.float32 +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] sparse_attention ............. None +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] sparse_gradients_enabled ..... False +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] steps_per_print .............. inf +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] timers_config ................ enabled=True synchronized=True +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] train_batch_size ............. 14 +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] train_micro_batch_size_per_gpu 1 +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] use_data_before_expert_parallel_ False +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] use_node_local_storage ....... False +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] wall_clock_breakdown ......... False +[2025-02-18 19:37:20,466] [INFO] [config.py:1003:print] weight_quantization_config ... None +[2025-02-18 19:37:20,467] [INFO] [config.py:1003:print] world_size ................... 7 +[2025-02-18 19:37:20,467] [INFO] [config.py:1003:print] zero_allow_untested_optimizer False +[2025-02-18 19:37:20,467] [INFO] [config.py:1003:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=True, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True +[2025-02-18 19:37:20,467] [INFO] [config.py:1003:print] zero_enabled ................. True +[2025-02-18 19:37:20,467] [INFO] [config.py:1003:print] zero_force_ds_cpu_optimizer .. True +[2025-02-18 19:37:20,467] [INFO] [config.py:1003:print] zero_optimization_stage ...... 3 +[2025-02-18 19:37:20,467] [INFO] [config.py:989:print_user_config] json = { + "fp16": { + "enabled": false, + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + "bf16": { + "enabled": true + }, + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "none", + "pin_memory": true + }, + "offload_param": { + "device": "none", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1.000000e+09, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1.000000e+09, + "stage3_max_reuse_distance": 1.000000e+09, + "stage3_gather_16bit_weights_on_model_save": true + }, + "gradient_accumulation_steps": 2, + "gradient_clipping": 1.0, + "steps_per_print": inf, + "train_batch_size": 14, + "train_micro_batch_size_per_gpu": 1, + "wall_clock_breakdown": false, + "zero_optimization.reduce_bucket_size": 1.284506e+07, + "zero_optimization.stage3_param_persistence_threshold": 3.584000e+04, + "zero_optimization.stage3_prefetch_bucket_size": 1.156055e+07 +} +INFO 02-18 19:37:34 config.py:542] This model supports multiple tasks: {'embed', 'classify', 'reward', 'generate', 'score'}. Defaulting to 'generate'. +WARNING 02-18 19:37:34 arg_utils.py:1079] --enable-prefix-caching is currently not supported for multimodal models in v0 and has been disabled. +INFO 02-18 19:37:34 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='/home/vlm/workspace/r1_checkpoints/qwen2_vl_7b_R1_finetune_by_clevr_math_correct_35k_cot_sft', speculative_config=None, tokenizer='/home/vlm/workspace/r1_checkpoints/qwen2_vl_7b_R1_finetune_by_clevr_math_correct_35k_cot_sft', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda:7, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/vlm/workspace/r1_checkpoints/qwen2_vl_7b_R1_finetune_by_clevr_math_correct_35k_cot_sft, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, +INFO 02-18 19:37:34 cuda.py:230] Using Flash Attention backend. +INFO 02-18 19:37:35 model_runner.py:1110] Starting to load model /home/vlm/workspace/r1_checkpoints/qwen2_vl_7b_R1_finetune_by_clevr_math_correct_35k_cot_sft... +INFO 02-18 19:37:36 config.py:2992] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248] + Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00 4096). Running this sequence through the model will result in indexing errors +WARNING 02-18 19:42:22 profiling.py:187] The context length (32768) of the model is too short to hold the multi-modal embeddings in the worst case (49152 tokens in total, out of which {'image': 16384, 'video': 32768} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`. +INFO 02-18 19:42:26 worker.py:267] Memory profiling takes 283.46 seconds +INFO 02-18 19:42:26 worker.py:267] the current vLLM instance can use total_gpu_memory (79.32GiB) x gpu_memory_utilization (0.70) = 55.53GiB +INFO 02-18 19:42:26 worker.py:267] model weights take 0.00GiB; non_torch_memory takes 0.00GiB; PyTorch activation peak memory takes 0.00GiB; the rest of the memory reserved for KV Cache is 55.53GiB. +INFO 02-18 19:42:26 executor_base.py:110] # CUDA blocks: 64982, # CPU blocks: 4681 +INFO 02-18 19:42:26 executor_base.py:115] Maximum concurrency for 32768 tokens per request: 31.73x +INFO 02-18 19:42:29 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. + Capturing CUDA graph shapes: 0%| | 0/35 [00:003->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO P2P Chunksize set to 524288 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO comm 0x7f9dec06fcd0 rank 0 nRanks 7 nNodes 1 localRanks 7 localRank 0 MNNVL 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO comm 0x7f766c071430 rank 6 nRanks 7 nNodes 1 localRanks 7 localRank 6 MNNVL 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO comm 0x7f4758070b90 rank 1 nRanks 7 nNodes 1 localRanks 7 localRank 1 MNNVL 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO P2P Chunksize set to 524288 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 00/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 01/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO comm 0x7f95d4071940 rank 5 nRanks 7 nNodes 1 localRanks 7 localRank 5 MNNVL 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 02/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 03/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 04/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 05/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 06/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5 [2] -1/-1/-1->6->5 [3] -1/-1/-1->6->5 [4] -1/-1/-1->6->5 [5] -1/-1/-1->6->5 [6] -1/-1/-1->6->5 [7] -1/-1/-1->6->5 [8] -1/-1/-1->6->5 [9] -1/-1/-1->6->5 [10] -1/-1/-1->6->5 [11] -1/-1/-1->6->5 [12] -1/-1/-1->6->5 [13] -1/-1/-1->6->5 [14] -1/-1/-1->6->5 [15] -1/-1/-1->6->5 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 07/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO P2P Chunksize set to 524288 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 08/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO comm 0x7f627c071a60 rank 4 nRanks 7 nNodes 1 localRanks 7 localRank 4 MNNVL 0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 09/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 10/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO P2P Chunksize set to 524288 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 11/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 12/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 13/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 14/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 15/16 : 0 1 2 3 4 5 6 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO P2P Chunksize set to 524288 +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO P2P Chunksize set to 524288 +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO P2P Chunksize set to 524288 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 00/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 01/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 02/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 03/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 04/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 05/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 06/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 07/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 08/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 09/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 10/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 11/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 12/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 13/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 14/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 15/0 : 6[6] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Connected all rings +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Connected all rings +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Connected all rings +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Connected all rings +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Connected all rings +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Connected all rings +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Connected all rings +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/IPC/read +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO Connected all trees +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO Connected all trees +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO Connected all trees +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO Connected all trees +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO Connected all trees +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO Connected all trees +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO Connected all trees +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer +p-phy-ctyun-gz-a800-node-prod-200-82:3374385:3383020 [2] NCCL INFO ncclCommSplit comm 0x7f09b806f910 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 49000 parent 0x55b294e49110 color -1326228412 key 2 commId 0x6a1746dd34fb6d88 - Init COMPLETE +p-phy-ctyun-gz-a800-node-prod-200-82:3374386:3383023 [3] NCCL INFO ncclCommSplit comm 0x7f9b18070d70 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 4d000 parent 0x5573fb9a9a70 color -1326228412 key 3 commId 0x6a1746dd34fb6d88 - Init COMPLETE +p-phy-ctyun-gz-a800-node-prod-200-82:3374383:3383024 [0] NCCL INFO ncclCommSplit comm 0x7f9dec06fcd0 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 10000 parent 0x562b06bdba80 color -1326228412 key 0 commId 0x6a1746dd34fb6d88 - Init COMPLETE +p-phy-ctyun-gz-a800-node-prod-200-82:3374387:3383032 [4] NCCL INFO ncclCommSplit comm 0x7f627c071a60 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8a000 parent 0x562d195aa380 color -1326228412 key 4 commId 0x6a1746dd34fb6d88 - Init COMPLETE +p-phy-ctyun-gz-a800-node-prod-200-82:3374384:3383031 [1] NCCL INFO ncclCommSplit comm 0x7f4758070b90 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 16000 parent 0x560d11cb10c0 color -1326228412 key 1 commId 0x6a1746dd34fb6d88 - Init COMPLETE +p-phy-ctyun-gz-a800-node-prod-200-82:3374388:3383030 [5] NCCL INFO ncclCommSplit comm 0x7f95d4071940 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 8f000 parent 0x563dad337df0 color -1326228412 key 5 commId 0x6a1746dd34fb6d88 - Init COMPLETE +p-phy-ctyun-gz-a800-node-prod-200-82:3374389:3383033 [6] NCCL INFO ncclCommSplit comm 0x7f766c071430 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId c6000 parent 0x564cdc4c2590 color -1326228412 key 6 commId 0x6a1746dd34fb6d88 - Init COMPLETE + 0%| | 1/2500 [00:20<14:20:33, 20.66s/it] {'loss': 0.0, 'grad_norm': 1.0356209808771049, 'learning_rate': 9.996e-07, 'completion_length': 157.7857208251953, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.1071428619325161, 'kl': 0.0, 'epoch': 0.0} + 0%| | 1/2500 [00:20<14:20:33, 20.66s/it] 0%| | 2/2500 [00:35<12:06:43, 17.46s/it] {'loss': -0.0, 'grad_norm': 0.5376474836324036, 'learning_rate': 9.992e-07, 'completion_length': 170.30358123779297, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8571429252624512, 'reward_std': 0.1539071798324585, 'kl': -1.7512589693069458e-05, 'epoch': 0.0} + 0%| | 2/2500 [00:35<12:06:43, 17.46s/it] 0%| | 3/2500 [00:50<11:20:13, 16.35s/it] {'loss': -0.0, 'grad_norm': 0.5003043348678123, 'learning_rate': 9.988e-07, 'completion_length': 158.30358123779297, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.1428571492433548, 'kl': -1.1801719665527344e-05, 'epoch': 0.0} + 0%| | 3/2500 [00:50<11:20:13, 16.35s/it] 0%| | 4/2500 [01:02<10:08:49, 14.64s/it] {'loss': -0.0, 'grad_norm': 0.17975979235819434, 'learning_rate': 9.983999999999998e-07, 'completion_length': 139.69644165039062, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': -2.7418136596679688e-05, 'epoch': 0.0} + 0%| | 4/2500 [01:02<10:08:49, 14.64s/it] 0%| | 5/2500 [01:15<9:42:58, 14.02s/it] {'loss': 0.0, 'grad_norm': 0.4424752456476232, 'learning_rate': 9.98e-07, 'completion_length': 145.64286041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 3.2782554626464844e-06, 'epoch': 0.0} + 0%| | 5/2500 [01:15<9:42:58, 14.02s/it] 0%| | 6/2500 [01:29<9:36:03, 13.86s/it] {'loss': -0.0, 'grad_norm': 0.3900128522320079, 'learning_rate': 9.976e-07, 'completion_length': 159.37500762939453, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.1539071872830391, 'kl': -2.1457672119140625e-06, 'epoch': 0.0} + 0%| | 6/2500 [01:29<9:36:03, 13.86s/it] 0%| | 7/2500 [01:41<9:10:47, 13.26s/it] {'loss': 0.0, 'grad_norm': 0.6591274892727715, 'learning_rate': 9.972e-07, 'completion_length': 146.21428680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.910714328289032, 'reward_std': 0.1071428619325161, 'kl': 1.8961727619171143e-06, 'epoch': 0.0} + 0%| | 7/2500 [01:41<9:10:47, 13.26s/it] 0%| | 8/2500 [01:53<9:00:47, 13.02s/it] {'loss': -0.0, 'grad_norm': 0.4832066159738895, 'learning_rate': 9.968e-07, 'completion_length': 153.17858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.11266788095235825, 'kl': -2.127140760421753e-06, 'epoch': 0.0} + 0%| | 8/2500 [01:53<9:00:47, 13.02s/it] 0%| | 9/2500 [02:05<8:43:27, 12.61s/it] {'loss': 0.0, 'grad_norm': 0.2435361766860265, 'learning_rate': 9.964e-07, 'completion_length': 137.46429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 1.683831214904785e-05, 'epoch': 0.0} + 0%| | 9/2500 [02:05<8:43:27, 12.61s/it] 0%| | 10/2500 [02:20<9:13:23, 13.33s/it] {'loss': 0.0, 'grad_norm': 0.5369382526618449, 'learning_rate': 9.959999999999999e-07, 'completion_length': 165.19644165039062, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.1428571492433548, 'kl': 4.8220157623291016e-05, 'epoch': 0.0} + 0%| | 10/2500 [02:20<9:13:23, 13.33s/it] 0%| | 11/2500 [02:32<8:56:14, 12.93s/it] {'loss': 0.0, 'grad_norm': 0.034038632550324, 'learning_rate': 9.956e-07, 'completion_length': 146.96428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 6.717443466186523e-05, 'epoch': 0.0} + 0%| | 11/2500 [02:32<8:56:14, 12.93s/it] 0%| | 12/2500 [02:45<8:56:47, 12.95s/it] {'loss': 0.0, 'grad_norm': 0.4596343164489708, 'learning_rate': 9.952e-07, 'completion_length': 155.00000762939453, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 6.079673767089844e-05, 'epoch': 0.0} + 0%| | 12/2500 [02:45<8:56:47, 12.95s/it] 1%| | 13/2500 [02:57<8:44:03, 12.64s/it] {'loss': 0.0, 'grad_norm': 0.2983485479461047, 'learning_rate': 9.948e-07, 'completion_length': 137.73214721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 7.891654968261719e-05, 'epoch': 0.01} + 1%| | 13/2500 [02:57<8:44:03, 12.64s/it] 1%| | 14/2500 [03:10<8:43:05, 12.62s/it] {'loss': 0.0, 'grad_norm': 0.5451303283257383, 'learning_rate': 9.944e-07, 'completion_length': 151.67857360839844, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.8750001192092896, 'reward_std': 0.14838216826319695, 'kl': 0.00010561943054199219, 'epoch': 0.01} + 1%| | 14/2500 [03:10<8:43:05, 12.62s/it] 1%| | 15/2500 [03:23<8:53:36, 12.88s/it] {'loss': 0.0, 'grad_norm': 0.49631868167527243, 'learning_rate': 9.94e-07, 'completion_length': 176.94644165039062, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.1896214708685875, 'kl': 0.00018310546875, 'epoch': 0.01} + 1%| | 15/2500 [03:23<8:53:36, 12.88s/it] 1%| | 16/2500 [03:36<8:53:44, 12.89s/it] {'loss': 0.0, 'grad_norm': 0.6696675100978099, 'learning_rate': 9.936e-07, 'completion_length': 160.4107208251953, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8571429252624512, 'reward_std': 0.19514648616313934, 'kl': 0.00012993812561035156, 'epoch': 0.01} + 1%| | 16/2500 [03:36<8:53:44, 12.89s/it] 1%| | 17/2500 [03:49<8:59:12, 13.03s/it] {'loss': 0.0, 'grad_norm': 1.0224763688187306, 'learning_rate': 9.931999999999999e-07, 'completion_length': 148.12500762939453, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.14838215708732605, 'kl': 0.00017952919006347656, 'epoch': 0.01} + 1%| | 17/2500 [03:49<8:59:12, 13.03s/it] 1%| | 18/2500 [04:02<8:49:12, 12.79s/it] {'loss': 0.0, 'grad_norm': 0.7650051983493152, 'learning_rate': 9.928e-07, 'completion_length': 140.80358123779297, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.1539071798324585, 'kl': 0.00021982192993164062, 'epoch': 0.01} + 1%| | 18/2500 [04:02<8:49:12, 12.79s/it] 1%| | 19/2500 [04:13<8:35:58, 12.48s/it] {'loss': 0.0, 'grad_norm': 0.5947399934489298, 'learning_rate': 9.923999999999998e-07, 'completion_length': 140.75000762939453, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.14838216453790665, 'kl': 0.00019979476928710938, 'epoch': 0.01} + 1%| | 19/2500 [04:13<8:35:58, 12.48s/it] 1%| | 20/2500 [04:26<8:34:09, 12.44s/it] {'loss': 0.0, 'grad_norm': 0.24800822680683796, 'learning_rate': 9.92e-07, 'completion_length': 159.1964340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00019502639770507812, 'epoch': 0.01} + 1%| | 20/2500 [04:26<8:34:09, 12.44s/it] 1%| | 21/2500 [04:38<8:33:05, 12.42s/it] {'loss': 0.0, 'grad_norm': 0.605612035399812, 'learning_rate': 9.916e-07, 'completion_length': 146.58929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.11266788095235825, 'kl': 0.0003170967102050781, 'epoch': 0.01} + 1%| | 21/2500 [04:38<8:33:05, 12.42s/it] 1%| | 22/2500 [04:52<8:50:02, 12.83s/it] {'loss': 0.0, 'grad_norm': 0.35317820073608497, 'learning_rate': 9.912e-07, 'completion_length': 167.6964340209961, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.00028896331787109375, 'epoch': 0.01} + 1%| | 22/2500 [04:52<8:50:02, 12.83s/it] 1%| | 23/2500 [05:05<8:53:59, 12.93s/it] {'loss': 0.0, 'grad_norm': 0.6179040862663758, 'learning_rate': 9.908e-07, 'completion_length': 160.73214721679688, 'rewards/accuracy_reward': 0.767857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7321429252624512, 'reward_std': 0.2610500380396843, 'kl': 0.0003871917724609375, 'epoch': 0.01} + 1%| | 23/2500 [05:05<8:53:59, 12.93s/it] 1%| | 24/2500 [05:19<9:10:38, 13.34s/it] {'loss': 0.0, 'grad_norm': 0.4494354480549279, 'learning_rate': 9.903999999999999e-07, 'completion_length': 152.21428680419922, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.14838216453790665, 'kl': 0.00029754638671875, 'epoch': 0.01} + 1%| | 24/2500 [05:19<9:10:38, 13.34s/it] 1%| | 25/2500 [05:32<9:00:00, 13.09s/it] {'loss': 0.0, 'grad_norm': 0.6330433316413292, 'learning_rate': 9.9e-07, 'completion_length': 149.92858123779297, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.1539071835577488, 'kl': 0.00035762786865234375, 'epoch': 0.01} + 1%| | 25/2500 [05:32<9:00:00, 13.09s/it] 1%| | 26/2500 [05:44<8:51:40, 12.89s/it] {'loss': 0.0, 'grad_norm': 0.36259467827182684, 'learning_rate': 9.896e-07, 'completion_length': 150.4107208251953, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.00033283233642578125, 'epoch': 0.01} + 1%| | 26/2500 [05:44<8:51:40, 12.89s/it] 1%| | 27/2500 [05:57<8:50:23, 12.87s/it] {'loss': 0.0, 'grad_norm': 0.5415188613264963, 'learning_rate': 9.892e-07, 'completion_length': 165.85714721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.1071428656578064, 'kl': 0.0004329681396484375, 'epoch': 0.01} + 1%| | 27/2500 [05:57<8:50:23, 12.87s/it] 1%| | 28/2500 [06:10<8:52:49, 12.93s/it] {'loss': 0.0, 'grad_norm': 0.4995715892485011, 'learning_rate': 9.888e-07, 'completion_length': 155.00000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.1428571492433548, 'kl': 0.0004978179931640625, 'epoch': 0.01} + 1%| | 28/2500 [06:10<8:52:49, 12.93s/it] 1%| | 29/2500 [06:22<8:44:23, 12.73s/it] {'loss': 0.0, 'grad_norm': 0.5049460258449023, 'learning_rate': 9.884e-07, 'completion_length': 137.21428680419922, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.14838216453790665, 'kl': 0.0004367828369140625, 'epoch': 0.01} + 1%| | 29/2500 [06:22<8:44:23, 12.73s/it] 1%| | 30/2500 [06:35<8:45:36, 12.77s/it] {'loss': 0.0, 'grad_norm': 0.17351651683413083, 'learning_rate': 9.88e-07, 'completion_length': 154.6428680419922, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.00041866302490234375, 'epoch': 0.01} + 1%| | 30/2500 [06:35<8:45:36, 12.77s/it] 1%| | 31/2500 [06:48<8:43:55, 12.73s/it] {'loss': 0.0, 'grad_norm': 0.838347764010721, 'learning_rate': 9.876e-07, 'completion_length': 147.875, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.000701904296875, 'epoch': 0.01} + 1%| | 31/2500 [06:48<8:43:55, 12.73s/it] 1%|▏ | 32/2500 [07:01<8:43:55, 12.74s/it] {'loss': 0.0, 'grad_norm': 0.8397511625597204, 'learning_rate': 9.871999999999998e-07, 'completion_length': 149.2857208251953, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.14838216826319695, 'kl': 0.000728607177734375, 'epoch': 0.01} + 1%|▏ | 32/2500 [07:01<8:43:55, 12.74s/it] 1%|▏ | 33/2500 [07:13<8:40:03, 12.65s/it] {'loss': 0.0, 'grad_norm': 0.16962680006174705, 'learning_rate': 9.868e-07, 'completion_length': 161.8214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.000812530517578125, 'epoch': 0.01} + 1%|▏ | 33/2500 [07:13<8:40:03, 12.65s/it] 1%|▏ | 34/2500 [07:26<8:37:06, 12.58s/it] {'loss': 0.0, 'grad_norm': 0.4605039492545489, 'learning_rate': 9.864e-07, 'completion_length': 145.96428680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.0005931854248046875, 'epoch': 0.01} + 1%|▏ | 34/2500 [07:26<8:37:06, 12.58s/it] 1%|▏ | 35/2500 [07:40<8:54:47, 13.02s/it] {'loss': 0.0, 'grad_norm': 0.6142923007094466, 'learning_rate': 9.86e-07, 'completion_length': 171.4821548461914, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07695359364151955, 'kl': 0.0006465911865234375, 'epoch': 0.01} + 1%|▏ | 35/2500 [07:40<8:54:47, 13.02s/it] 1%|▏ | 36/2500 [07:53<8:58:01, 13.10s/it] {'loss': 0.0, 'grad_norm': 0.23783785693480977, 'learning_rate': 9.856e-07, 'completion_length': 164.0357208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.000751495361328125, 'epoch': 0.01} + 1%|▏ | 36/2500 [07:53<8:58:01, 13.10s/it] 1%|▏ | 37/2500 [08:07<9:16:00, 13.54s/it] {'loss': 0.0, 'grad_norm': 1.4959054992583105, 'learning_rate': 9.852e-07, 'completion_length': 154.46429443359375, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.11266788095235825, 'kl': 0.0007476806640625, 'epoch': 0.01} + 1%|▏ | 37/2500 [08:07<9:16:00, 13.54s/it] 2%|▏ | 38/2500 [08:21<9:16:32, 13.56s/it] {'loss': 0.0, 'grad_norm': 0.8343108507904329, 'learning_rate': 9.847999999999999e-07, 'completion_length': 159.44644165039062, 'rewards/accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.23086076974868774, 'kl': 0.0007610321044921875, 'epoch': 0.02} + 2%|▏ | 38/2500 [08:21<9:16:32, 13.56s/it] 2%|▏ | 39/2500 [08:33<8:56:01, 13.07s/it] {'loss': 0.0, 'grad_norm': 0.8069796507174406, 'learning_rate': 9.844e-07, 'completion_length': 138.58929443359375, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.1539071835577488, 'kl': 0.001155853271484375, 'epoch': 0.02} + 2%|▏ | 39/2500 [08:33<8:56:01, 13.07s/it] 2%|▏ | 40/2500 [08:44<8:36:38, 12.60s/it] {'loss': 0.0, 'grad_norm': 0.027849489917223465, 'learning_rate': 9.84e-07, 'completion_length': 138.46428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.000820159912109375, 'epoch': 0.02} + 2%|▏ | 40/2500 [08:44<8:36:38, 12.60s/it] 2%|▏ | 41/2500 [08:59<9:05:24, 13.31s/it] {'loss': 0.0, 'grad_norm': 0.4940303022656745, 'learning_rate': 9.836e-07, 'completion_length': 170.3214340209961, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.14838216453790665, 'kl': 0.00093841552734375, 'epoch': 0.02} + 2%|▏ | 41/2500 [08:59<9:05:24, 13.31s/it] 2%|▏ | 42/2500 [09:14<9:26:30, 13.83s/it] {'loss': 0.0, 'grad_norm': 0.6792491171844884, 'learning_rate': 9.832e-07, 'completion_length': 165.69644165039062, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.14838216453790665, 'kl': 0.00072479248046875, 'epoch': 0.02} + 2%|▏ | 42/2500 [09:14<9:26:30, 13.83s/it] 2%|▏ | 43/2500 [09:29<9:39:25, 14.15s/it] {'loss': 0.0001, 'grad_norm': 0.6216812564119885, 'learning_rate': 9.828e-07, 'completion_length': 184.71429443359375, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.21981073170900345, 'kl': 0.001316070556640625, 'epoch': 0.02} + 2%|▏ | 43/2500 [09:29<9:39:25, 14.15s/it] 2%|▏ | 44/2500 [09:43<9:29:12, 13.91s/it] {'loss': 0.0, 'grad_norm': 0.2761214736728307, 'learning_rate': 9.824e-07, 'completion_length': 151.60714721679688, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.00106048583984375, 'epoch': 0.02} + 2%|▏ | 44/2500 [09:43<9:29:12, 13.91s/it] 2%|▏ | 45/2500 [09:54<9:00:43, 13.22s/it] {'loss': 0.0, 'grad_norm': 0.009697268041524676, 'learning_rate': 9.819999999999999e-07, 'completion_length': 134.57144165039062, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0006542205810546875, 'epoch': 0.02} + 2%|▏ | 45/2500 [09:54<9:00:43, 13.22s/it] 2%|▏ | 46/2500 [10:06<8:45:26, 12.85s/it] {'loss': 0.0, 'grad_norm': 0.016606970325114264, 'learning_rate': 9.816e-07, 'completion_length': 138.98214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0009002685546875, 'epoch': 0.02} + 2%|▏ | 46/2500 [10:06<8:45:26, 12.85s/it] 2%|▏ | 47/2500 [10:20<8:54:38, 13.08s/it] {'loss': 0.0001, 'grad_norm': 0.813776581595907, 'learning_rate': 9.811999999999998e-07, 'completion_length': 161.80358123779297, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0824786126613617, 'kl': 0.001430511474609375, 'epoch': 0.02} + 2%|▏ | 47/2500 [10:20<8:54:38, 13.08s/it] 2%|▏ | 48/2500 [10:33<8:55:39, 13.11s/it] {'loss': 0.0, 'grad_norm': 1.3195955450788919, 'learning_rate': 9.808e-07, 'completion_length': 156.25000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00107574462890625, 'epoch': 0.02} + 2%|▏ | 48/2500 [10:33<8:55:39, 13.11s/it] 2%|▏ | 49/2500 [10:45<8:45:29, 12.86s/it] {'loss': 0.0001, 'grad_norm': 0.6705436582408947, 'learning_rate': 9.804e-07, 'completion_length': 144.35714721679688, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.1428571492433548, 'kl': 0.00128173828125, 'epoch': 0.02} + 2%|▏ | 49/2500 [10:45<8:45:29, 12.86s/it] 2%|▏ | 50/2500 [10:58<8:43:35, 12.82s/it] {'loss': 0.0, 'grad_norm': 0.14148649015433973, 'learning_rate': 9.8e-07, 'completion_length': 149.2321548461914, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00118255615234375, 'epoch': 0.02} + 2%|▏ | 50/2500 [10:58<8:43:35, 12.82s/it] 2%|▏ | 51/2500 [11:10<8:35:26, 12.63s/it] {'loss': 0.0, 'grad_norm': 0.4448646299319555, 'learning_rate': 9.796e-07, 'completion_length': 154.50000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0010833740234375, 'epoch': 0.02} + 2%|▏ | 51/2500 [11:10<8:35:26, 12.63s/it] 2%|▏ | 52/2500 [11:23<8:37:02, 12.67s/it] {'loss': 0.0001, 'grad_norm': 0.6248015328390822, 'learning_rate': 9.791999999999999e-07, 'completion_length': 160.60714721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.001678466796875, 'epoch': 0.02} + 2%|▏ | 52/2500 [11:23<8:37:02, 12.67s/it] 2%|▏ | 53/2500 [11:35<8:22:47, 12.33s/it] {'loss': 0.0, 'grad_norm': 0.4261798493312172, 'learning_rate': 9.788e-07, 'completion_length': 150.23214721679688, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.001041412353515625, 'epoch': 0.02} + 2%|▏ | 53/2500 [11:35<8:22:47, 12.33s/it] 2%|▏ | 54/2500 [11:47<8:23:31, 12.35s/it] {'loss': 0.0001, 'grad_norm': 0.22759563197966115, 'learning_rate': 9.784e-07, 'completion_length': 146.67857360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0013275146484375, 'epoch': 0.02} + 2%|▏ | 54/2500 [11:47<8:23:31, 12.35s/it] 2%|▏ | 55/2500 [12:00<8:25:13, 12.40s/it] {'loss': 0.0001, 'grad_norm': 0.2659877543613785, 'learning_rate': 9.78e-07, 'completion_length': 150.64286041259766, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.001560211181640625, 'epoch': 0.02} + 2%|▏ | 55/2500 [12:00<8:25:13, 12.40s/it] 2%|▏ | 56/2500 [12:13<8:34:42, 12.64s/it] {'loss': 0.0, 'grad_norm': 0.011293234017051474, 'learning_rate': 9.776e-07, 'completion_length': 157.92857360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.001155853271484375, 'epoch': 0.02} + 2%|▏ | 56/2500 [12:13<8:34:42, 12.64s/it] 2%|▏ | 57/2500 [12:26<8:41:53, 12.82s/it] {'loss': 0.0001, 'grad_norm': 0.8683145826260764, 'learning_rate': 9.772e-07, 'completion_length': 151.62500762939453, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.1896214708685875, 'kl': 0.00160980224609375, 'epoch': 0.02} + 2%|▏ | 57/2500 [12:26<8:41:53, 12.82s/it] 2%|▏ | 58/2500 [12:39<8:39:24, 12.76s/it] {'loss': 0.0001, 'grad_norm': 0.8603414659733065, 'learning_rate': 9.768e-07, 'completion_length': 164.00000762939453, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.1896214708685875, 'kl': 0.0016937255859375, 'epoch': 0.02} + 2%|▏ | 58/2500 [12:39<8:39:24, 12.76s/it] 2%|▏ | 59/2500 [12:51<8:35:38, 12.67s/it] {'loss': 0.0001, 'grad_norm': 0.9368528962558738, 'learning_rate': 9.764e-07, 'completion_length': 155.1428680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0013275146484375, 'epoch': 0.02} + 2%|▏ | 59/2500 [12:51<8:35:38, 12.67s/it] 2%|▏ | 60/2500 [13:04<8:38:43, 12.76s/it] {'loss': 0.0001, 'grad_norm': 0.5335643600203626, 'learning_rate': 9.759999999999998e-07, 'completion_length': 165.76786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.00177764892578125, 'epoch': 0.02} + 2%|▏ | 60/2500 [13:04<8:38:43, 12.76s/it] 2%|▏ | 61/2500 [13:16<8:30:11, 12.55s/it] {'loss': 0.0001, 'grad_norm': 0.5456936120576118, 'learning_rate': 9.756e-07, 'completion_length': 153.30357360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.001556396484375, 'epoch': 0.02} + 2%|▏ | 61/2500 [13:16<8:30:11, 12.55s/it] 2%|▏ | 62/2500 [13:30<8:41:15, 12.83s/it] {'loss': 0.0001, 'grad_norm': 0.5801643639322768, 'learning_rate': 9.752e-07, 'completion_length': 162.60714721679688, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.1071428619325161, 'kl': 0.002227783203125, 'epoch': 0.02} + 2%|▏ | 62/2500 [13:30<8:41:15, 12.83s/it] 3%|▎ | 63/2500 [13:42<8:35:37, 12.69s/it] {'loss': 0.0001, 'grad_norm': 0.8217734481619847, 'learning_rate': 9.748e-07, 'completion_length': 150.75000762939453, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.18409645557403564, 'kl': 0.001636505126953125, 'epoch': 0.03} + 3%|▎ | 63/2500 [13:42<8:35:37, 12.69s/it] 3%|▎ | 64/2500 [13:54<8:25:36, 12.45s/it] {'loss': 0.0001, 'grad_norm': 0.6237957278740848, 'learning_rate': 9.744e-07, 'completion_length': 142.17857360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.001811981201171875, 'epoch': 0.03} + 3%|▎ | 64/2500 [13:54<8:25:36, 12.45s/it] 3%|▎ | 65/2500 [14:06<8:24:39, 12.44s/it] {'loss': 0.0, 'grad_norm': 0.24567612455343127, 'learning_rate': 9.74e-07, 'completion_length': 148.21429443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.001232147216796875, 'epoch': 0.03} + 3%|▎ | 65/2500 [14:06<8:24:39, 12.44s/it] 3%|▎ | 66/2500 [14:18<8:22:01, 12.38s/it] {'loss': 0.0, 'grad_norm': 0.37170265756983917, 'learning_rate': 9.735999999999999e-07, 'completion_length': 152.375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00115966796875, 'epoch': 0.03} + 3%|▎ | 66/2500 [14:18<8:22:01, 12.38s/it] 3%|▎ | 67/2500 [14:31<8:26:50, 12.50s/it] {'loss': 0.0001, 'grad_norm': 0.3420325217762366, 'learning_rate': 9.731999999999998e-07, 'completion_length': 165.75000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0018310546875, 'epoch': 0.03} + 3%|▎ | 67/2500 [14:31<8:26:50, 12.50s/it] 3%|▎ | 68/2500 [14:44<8:28:13, 12.54s/it] {'loss': 0.0001, 'grad_norm': 0.4028307600457346, 'learning_rate': 9.728e-07, 'completion_length': 154.9107208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.1071428656578064, 'kl': 0.00171661376953125, 'epoch': 0.03} + 3%|▎ | 68/2500 [14:44<8:28:13, 12.54s/it] 3%|▎ | 69/2500 [14:55<8:14:32, 12.21s/it] {'loss': 0.0001, 'grad_norm': 0.3741248021745401, 'learning_rate': 9.724e-07, 'completion_length': 138.48214721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.001483917236328125, 'epoch': 0.03} + 3%|▎ | 69/2500 [14:55<8:14:32, 12.21s/it] 3%|▎ | 70/2500 [15:08<8:20:59, 12.37s/it] {'loss': 0.0001, 'grad_norm': 0.4882232416467006, 'learning_rate': 9.72e-07, 'completion_length': 160.32144165039062, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.002593994140625, 'epoch': 0.03} + 3%|▎ | 70/2500 [15:08<8:20:59, 12.37s/it] 3%|▎ | 71/2500 [15:22<8:38:26, 12.81s/it] {'loss': 0.0001, 'grad_norm': 0.8117239478266075, 'learning_rate': 9.716e-07, 'completion_length': 182.3214340209961, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.1071428619325161, 'kl': 0.002292633056640625, 'epoch': 0.03} + 3%|▎ | 71/2500 [15:22<8:38:26, 12.81s/it] 3%|▎ | 72/2500 [15:36<8:52:00, 13.15s/it] {'loss': 0.0001, 'grad_norm': 0.544332136557569, 'learning_rate': 9.712e-07, 'completion_length': 172.57144165039062, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.1071428619325161, 'kl': 0.00272369384765625, 'epoch': 0.03} + 3%|▎ | 72/2500 [15:36<8:52:00, 13.15s/it] 3%|▎ | 73/2500 [15:49<8:51:19, 13.14s/it] {'loss': 0.0001, 'grad_norm': 0.6221765391177164, 'learning_rate': 9.707999999999999e-07, 'completion_length': 141.21428680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.002117156982421875, 'epoch': 0.03} + 3%|▎ | 73/2500 [15:49<8:51:19, 13.14s/it] 3%|▎ | 74/2500 [16:02<8:50:47, 13.13s/it] {'loss': 0.0001, 'grad_norm': 0.4393709128478378, 'learning_rate': 9.704e-07, 'completion_length': 159.6607208251953, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.00298309326171875, 'epoch': 0.03} + 3%|▎ | 74/2500 [16:02<8:50:47, 13.13s/it] 3%|▎ | 75/2500 [16:15<8:51:12, 13.14s/it] {'loss': 0.0001, 'grad_norm': 0.4728336682159953, 'learning_rate': 9.7e-07, 'completion_length': 165.76786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.11266788095235825, 'kl': 0.002471923828125, 'epoch': 0.03} + 3%|▎ | 75/2500 [16:15<8:51:12, 13.14s/it] 3%|▎ | 76/2500 [16:28<8:51:58, 13.17s/it] {'loss': 0.0001, 'grad_norm': 0.2734483193963107, 'learning_rate': 9.696e-07, 'completion_length': 167.39286041259766, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.00241851806640625, 'epoch': 0.03} + 3%|▎ | 76/2500 [16:28<8:51:58, 13.17s/it] 3%|▎ | 77/2500 [16:41<8:43:00, 12.95s/it] {'loss': 0.0001, 'grad_norm': 0.3852079093482514, 'learning_rate': 9.692e-07, 'completion_length': 145.67857360839844, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.11266787722706795, 'kl': 0.002471923828125, 'epoch': 0.03} + 3%|▎ | 77/2500 [16:41<8:43:00, 12.95s/it] 3%|▎ | 78/2500 [16:54<8:43:59, 12.98s/it] {'loss': 0.0001, 'grad_norm': 0.2660447822029834, 'learning_rate': 9.688e-07, 'completion_length': 164.9464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.002040863037109375, 'epoch': 0.03} + 3%|▎ | 78/2500 [16:54<8:43:59, 12.98s/it] 3%|▎ | 79/2500 [17:06<8:32:44, 12.71s/it] {'loss': 0.0001, 'grad_norm': 0.49455891931866086, 'learning_rate': 9.684e-07, 'completion_length': 142.42858123779297, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1071428656578064, 'kl': 0.002277374267578125, 'epoch': 0.03} + 3%|▎ | 79/2500 [17:06<8:32:44, 12.71s/it] 3%|▎ | 80/2500 [17:21<9:00:36, 13.40s/it] {'loss': 0.0001, 'grad_norm': 0.20217167299932304, 'learning_rate': 9.679999999999999e-07, 'completion_length': 181.82144165039062, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.00287628173828125, 'epoch': 0.03} + 3%|▎ | 80/2500 [17:21<9:00:36, 13.40s/it] 3%|▎ | 81/2500 [17:34<8:52:32, 13.21s/it] {'loss': 0.0001, 'grad_norm': 0.5803965414741805, 'learning_rate': 9.676e-07, 'completion_length': 155.8928680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.0032501220703125, 'epoch': 0.03} + 3%|▎ | 81/2500 [17:34<8:52:32, 13.21s/it] 3%|▎ | 82/2500 [17:48<9:02:55, 13.47s/it] {'loss': 0.0001, 'grad_norm': 0.013054275271155076, 'learning_rate': 9.671999999999998e-07, 'completion_length': 158.76786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.002033233642578125, 'epoch': 0.03} + 3%|▎ | 82/2500 [17:48<9:02:55, 13.47s/it] 3%|▎ | 83/2500 [18:00<8:48:27, 13.12s/it] {'loss': 0.0001, 'grad_norm': 0.9146599604095818, 'learning_rate': 9.668e-07, 'completion_length': 146.46429443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00174713134765625, 'epoch': 0.03} + 3%|▎ | 83/2500 [18:00<8:48:27, 13.12s/it] 3%|▎ | 84/2500 [18:12<8:38:42, 12.88s/it] {'loss': 0.0001, 'grad_norm': 0.37837489542082914, 'learning_rate': 9.664e-07, 'completion_length': 152.7857208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.002532958984375, 'epoch': 0.03} + 3%|▎ | 84/2500 [18:12<8:38:42, 12.88s/it] 3%|▎ | 85/2500 [18:27<8:54:45, 13.29s/it] {'loss': 0.0001, 'grad_norm': 0.4441902648439005, 'learning_rate': 9.66e-07, 'completion_length': 162.58929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.002777099609375, 'epoch': 0.03} + 3%|▎ | 85/2500 [18:27<8:54:45, 13.29s/it] 3%|▎ | 86/2500 [18:39<8:45:29, 13.06s/it] {'loss': 0.0001, 'grad_norm': 0.17429642333958179, 'learning_rate': 9.656e-07, 'completion_length': 158.64286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00226593017578125, 'epoch': 0.03} + 3%|▎ | 86/2500 [18:39<8:45:29, 13.06s/it] 3%|▎ | 87/2500 [18:52<8:36:51, 12.85s/it] {'loss': 0.0001, 'grad_norm': 0.8645988011436253, 'learning_rate': 9.651999999999999e-07, 'completion_length': 159.05358123779297, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.1539071835577488, 'kl': 0.00304412841796875, 'epoch': 0.03} + 3%|▎ | 87/2500 [18:52<8:36:51, 12.85s/it] 4%|▎ | 88/2500 [19:04<8:28:10, 12.64s/it] {'loss': 0.0001, 'grad_norm': 0.8523134666062613, 'learning_rate': 9.647999999999999e-07, 'completion_length': 146.08929443359375, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.11266788095235825, 'kl': 0.0022735595703125, 'epoch': 0.04} + 4%|▎ | 88/2500 [19:04<8:28:10, 12.64s/it] 4%|▎ | 89/2500 [19:17<8:32:40, 12.76s/it] {'loss': 0.0001, 'grad_norm': 1.0380925264963456, 'learning_rate': 9.644e-07, 'completion_length': 160.1964340209961, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.1428571492433548, 'kl': 0.00330352783203125, 'epoch': 0.04} + 4%|▎ | 89/2500 [19:17<8:32:40, 12.76s/it] 4%|▎ | 90/2500 [19:31<8:47:11, 13.13s/it] {'loss': 0.0001, 'grad_norm': 0.47317310359392173, 'learning_rate': 9.64e-07, 'completion_length': 184.1964340209961, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.23086077719926834, 'kl': 0.0030364990234375, 'epoch': 0.04} + 4%|▎ | 90/2500 [19:31<8:47:11, 13.13s/it] 4%|▎ | 91/2500 [19:44<8:43:47, 13.05s/it] {'loss': 0.0001, 'grad_norm': 0.4046009059623652, 'learning_rate': 9.636e-07, 'completion_length': 161.6964340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0034637451171875, 'epoch': 0.04} + 4%|▎ | 91/2500 [19:44<8:43:47, 13.05s/it] 4%|▎ | 92/2500 [19:57<8:43:19, 13.04s/it] {'loss': 0.0001, 'grad_norm': 0.7009321433707154, 'learning_rate': 9.632e-07, 'completion_length': 168.44644165039062, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.1071428656578064, 'kl': 0.00293731689453125, 'epoch': 0.04} + 4%|▎ | 92/2500 [19:57<8:43:19, 13.04s/it] 4%|▎ | 93/2500 [20:10<8:43:14, 13.04s/it] {'loss': 0.0001, 'grad_norm': 0.5950894477581911, 'learning_rate': 9.628e-07, 'completion_length': 168.2857208251953, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.1896214783191681, 'kl': 0.00353240966796875, 'epoch': 0.04} + 4%|▎ | 93/2500 [20:10<8:43:14, 13.04s/it] 4%|▍ | 94/2500 [20:22<8:36:15, 12.87s/it] {'loss': 0.0001, 'grad_norm': 0.5997691861595185, 'learning_rate': 9.624e-07, 'completion_length': 159.08929443359375, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.14838215708732605, 'kl': 0.0029144287109375, 'epoch': 0.04} + 4%|▍ | 94/2500 [20:22<8:36:15, 12.87s/it] 4%|▍ | 95/2500 [20:35<8:38:00, 12.92s/it] {'loss': 0.0001, 'grad_norm': 0.2215509844794417, 'learning_rate': 9.619999999999999e-07, 'completion_length': 165.3571548461914, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0036468505859375, 'epoch': 0.04} + 4%|▍ | 95/2500 [20:35<8:38:00, 12.92s/it] 4%|▍ | 96/2500 [20:47<8:19:11, 12.46s/it] {'loss': 0.0001, 'grad_norm': 0.6961089404450269, 'learning_rate': 9.616e-07, 'completion_length': 140.6964340209961, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.1539071872830391, 'kl': 0.003204345703125, 'epoch': 0.04} + 4%|▍ | 96/2500 [20:47<8:19:11, 12.46s/it] 4%|▍ | 97/2500 [20:58<8:07:46, 12.18s/it] {'loss': 0.0001, 'grad_norm': 0.27226678227501766, 'learning_rate': 9.612e-07, 'completion_length': 148.9107208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0023193359375, 'epoch': 0.04} + 4%|▍ | 97/2500 [20:58<8:07:46, 12.18s/it] 4%|▍ | 98/2500 [21:10<8:07:40, 12.18s/it] {'loss': 0.0001, 'grad_norm': 0.3506774999871765, 'learning_rate': 9.608e-07, 'completion_length': 155.50000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.0035552978515625, 'epoch': 0.04} + 4%|▍ | 98/2500 [21:10<8:07:40, 12.18s/it] 4%|▍ | 99/2500 [21:27<9:02:49, 13.56s/it] {'loss': 0.0001, 'grad_norm': 0.3075803918813042, 'learning_rate': 9.604e-07, 'completion_length': 142.17858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.00235748291015625, 'epoch': 0.04} + 4%|▍ | 99/2500 [21:27<9:02:49, 13.56s/it] 4%|▍ | 100/2500 [21:40<8:53:00, 13.33s/it] {'loss': 0.0001, 'grad_norm': 0.21092131649220772, 'learning_rate': 9.6e-07, 'completion_length': 147.1964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0032806396484375, 'epoch': 0.04} + 4%|▍ | 100/2500 [21:40<8:53:00, 13.33s/it] 4%|▍ | 101/2500 [25:06<47:29:29, 71.27s/it] {'loss': 0.0001, 'grad_norm': 0.2933587519417455, 'learning_rate': 9.595999999999999e-07, 'completion_length': 164.17857360839844, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.00334930419921875, 'epoch': 0.04} + 4%|▍ | 101/2500 [25:06<47:29:29, 71.27s/it] 4%|▍ | 102/2500 [25:25<37:02:58, 55.62s/it] {'loss': 0.0002, 'grad_norm': 0.3598633153576842, 'learning_rate': 9.592e-07, 'completion_length': 171.6428680419922, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.857142984867096, 'reward_std': 0.0714285746216774, 'kl': 0.0044708251953125, 'epoch': 0.04} + 4%|▍ | 102/2500 [25:25<37:02:58, 55.62s/it] 4%|▍ | 103/2500 [25:48<30:20:37, 45.57s/it] {'loss': 0.0001, 'grad_norm': 0.3537695041458193, 'learning_rate': 9.588e-07, 'completion_length': 164.62500762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0032806396484375, 'epoch': 0.04} + 4%|▍ | 103/2500 [25:48<30:20:37, 45.57s/it] 4%|▍ | 104/2500 [26:06<24:57:39, 37.50s/it] {'loss': 0.0002, 'grad_norm': 1.1573970076682512, 'learning_rate': 9.584e-07, 'completion_length': 161.8214340209961, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.005218505859375, 'epoch': 0.04} + 4%|▍ | 104/2500 [26:06<24:57:39, 37.50s/it] 4%|▍ | 105/2500 [26:25<21:10:04, 31.82s/it] {'loss': 0.0001, 'grad_norm': 0.16094145651339994, 'learning_rate': 9.58e-07, 'completion_length': 155.0357208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00316619873046875, 'epoch': 0.04} + 4%|▍ | 105/2500 [26:25<21:10:04, 31.82s/it] 4%|▍ | 106/2500 [26:43<18:30:57, 27.84s/it] {'loss': 0.0001, 'grad_norm': 0.46134280817560286, 'learning_rate': 9.576e-07, 'completion_length': 158.55358123779297, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.00362396240234375, 'epoch': 0.04} + 4%|▍ | 106/2500 [26:43<18:30:57, 27.84s/it] 4%|▍ | 107/2500 [27:02<16:43:42, 25.17s/it] {'loss': 0.0001, 'grad_norm': 1.0438677637411633, 'learning_rate': 9.572e-07, 'completion_length': 160.6071548461914, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.11266787722706795, 'kl': 0.00366973876953125, 'epoch': 0.04} + 4%|▍ | 107/2500 [27:02<16:43:42, 25.17s/it] 4%|▍ | 108/2500 [27:21<15:26:18, 23.24s/it] {'loss': 0.0001, 'grad_norm': 0.026424473963697473, 'learning_rate': 9.567999999999999e-07, 'completion_length': 164.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0030364990234375, 'epoch': 0.04} + 4%|▍ | 108/2500 [27:21<15:26:18, 23.24s/it] 4%|▍ | 109/2500 [27:40<14:30:00, 21.83s/it] {'loss': 0.0002, 'grad_norm': 0.5693842571577544, 'learning_rate': 9.564e-07, 'completion_length': 160.98214721679688, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.07695358991622925, 'kl': 0.00411224365234375, 'epoch': 0.04} + 4%|▍ | 109/2500 [27:40<14:30:00, 21.83s/it] 4%|▍ | 110/2500 [27:58<13:47:37, 20.78s/it] {'loss': 0.0002, 'grad_norm': 0.5337882637058703, 'learning_rate': 9.559999999999998e-07, 'completion_length': 159.76786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.0042572021484375, 'epoch': 0.04} + 4%|▍ | 110/2500 [27:58<13:47:37, 20.78s/it] 4%|▍ | 111/2500 [28:17<13:31:57, 20.39s/it] {'loss': 0.0001, 'grad_norm': 0.5657959357532742, 'learning_rate': 9.556e-07, 'completion_length': 161.3571548461914, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.00353240966796875, 'epoch': 0.04} + 4%|▍ | 111/2500 [28:17<13:31:57, 20.39s/it] 4%|▍ | 112/2500 [28:37<13:23:45, 20.20s/it] {'loss': 0.0002, 'grad_norm': 0.26108218108572673, 'learning_rate': 9.552e-07, 'completion_length': 160.6607208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.004150390625, 'epoch': 0.04} + 4%|▍ | 112/2500 [28:37<13:23:45, 20.20s/it] 5%|▍ | 113/2500 [28:56<13:10:14, 19.86s/it] {'loss': 0.0001, 'grad_norm': 0.9780133352998511, 'learning_rate': 9.548e-07, 'completion_length': 158.9107208251953, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.1428571492433548, 'kl': 0.00310516357421875, 'epoch': 0.05} + 5%|▍ | 113/2500 [28:56<13:10:14, 19.86s/it] 5%|▍ | 114/2500 [29:14<12:49:48, 19.36s/it] {'loss': 0.0001, 'grad_norm': 0.02244083068944957, 'learning_rate': 9.544e-07, 'completion_length': 154.5, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.002593994140625, 'epoch': 0.05} + 5%|▍ | 114/2500 [29:14<12:49:48, 19.36s/it] 5%|▍ | 115/2500 [29:32<12:27:30, 18.81s/it] {'loss': 0.0001, 'grad_norm': 0.34191406319875456, 'learning_rate': 9.539999999999999e-07, 'completion_length': 148.71429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00363922119140625, 'epoch': 0.05} + 5%|▍ | 115/2500 [29:32<12:27:30, 18.81s/it] 5%|▍ | 116/2500 [29:49<12:11:53, 18.42s/it] {'loss': 0.0001, 'grad_norm': 0.2839931123840804, 'learning_rate': 9.536e-07, 'completion_length': 157.21429443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0031585693359375, 'epoch': 0.05} + 5%|▍ | 116/2500 [29:49<12:11:53, 18.42s/it] 5%|▍ | 117/2500 [30:08<12:10:12, 18.39s/it] {'loss': 0.0002, 'grad_norm': 0.5217553868253103, 'learning_rate': 9.532e-07, 'completion_length': 168.1607208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00397491455078125, 'epoch': 0.05} + 5%|▍ | 117/2500 [30:08<12:10:12, 18.39s/it] 5%|▍ | 118/2500 [30:26<12:02:57, 18.21s/it] {'loss': 0.0001, 'grad_norm': 0.36169020323167494, 'learning_rate': 9.527999999999999e-07, 'completion_length': 163.5357208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.00311279296875, 'epoch': 0.05} + 5%|▍ | 118/2500 [30:26<12:02:57, 18.21s/it] 5%|▍ | 119/2500 [30:44<12:01:57, 18.19s/it] {'loss': 0.0001, 'grad_norm': 0.7735918576826517, 'learning_rate': 9.524e-07, 'completion_length': 158.5357208251953, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.07695359364151955, 'kl': 0.0035552978515625, 'epoch': 0.05} + 5%|▍ | 119/2500 [30:44<12:01:57, 18.19s/it] 5%|▍ | 120/2500 [31:03<12:18:59, 18.63s/it] {'loss': 0.0001, 'grad_norm': 0.3782998131247698, 'learning_rate': 9.52e-07, 'completion_length': 163.37500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00366973876953125, 'epoch': 0.05} + 5%|▍ | 120/2500 [31:03<12:18:59, 18.63s/it] 5%|▍ | 121/2500 [31:22<12:17:11, 18.59s/it] {'loss': 0.0001, 'grad_norm': 0.4137054442397862, 'learning_rate': 9.515999999999999e-07, 'completion_length': 162.80358123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00275421142578125, 'epoch': 0.05} + 5%|▍ | 121/2500 [31:22<12:17:11, 18.59s/it] 5%|▍ | 122/2500 [31:42<12:34:38, 19.04s/it] {'loss': 0.0001, 'grad_norm': 0.3616080867162566, 'learning_rate': 9.512e-07, 'completion_length': 173.0178680419922, 'rewards/accuracy_reward': 0.8035714328289032, 'rewards/format_reward': 1.0, 'reward': 1.8035714626312256, 'reward_std': 0.07695358991622925, 'kl': 0.0033416748046875, 'epoch': 0.05} + 5%|▍ | 122/2500 [31:42<12:34:38, 19.04s/it] 5%|▍ | 123/2500 [32:01<12:30:21, 18.94s/it] {'loss': 0.0001, 'grad_norm': 0.26317837546029915, 'learning_rate': 9.508e-07, 'completion_length': 158.01786041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00292205810546875, 'epoch': 0.05} + 5%|▍ | 123/2500 [32:01<12:30:21, 18.94s/it] 5%|▍ | 124/2500 [32:25<13:39:56, 20.71s/it] {'loss': 0.0001, 'grad_norm': 0.4332148400151169, 'learning_rate': 9.503999999999999e-07, 'completion_length': 164.87500762939453, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.11266787722706795, 'kl': 0.00336456298828125, 'epoch': 0.05} + 5%|▍ | 124/2500 [32:25<13:39:56, 20.71s/it] 5%|▌ | 125/2500 [32:43<13:06:58, 19.88s/it] {'loss': 0.0001, 'grad_norm': 0.8612748989979485, 'learning_rate': 9.499999999999999e-07, 'completion_length': 155.71429443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.00310516357421875, 'epoch': 0.05} + 5%|▌ | 125/2500 [32:43<13:06:58, 19.88s/it] 5%|▌ | 126/2500 [33:02<12:51:11, 19.49s/it] {'loss': 0.0001, 'grad_norm': 0.8005216231570086, 'learning_rate': 9.496e-07, 'completion_length': 156.42858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.00347900390625, 'epoch': 0.05} + 5%|▌ | 126/2500 [33:02<12:51:11, 19.49s/it] 5%|▌ | 127/2500 [33:21<12:40:21, 19.23s/it] {'loss': 0.0001, 'grad_norm': 0.6406058102895139, 'learning_rate': 9.492e-07, 'completion_length': 157.5357208251953, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.00334930419921875, 'epoch': 0.05} + 5%|▌ | 127/2500 [33:21<12:40:21, 19.23s/it] 5%|▌ | 128/2500 [33:39<12:29:22, 18.96s/it] {'loss': 0.0002, 'grad_norm': 0.4941018984687601, 'learning_rate': 9.487999999999999e-07, 'completion_length': 172.75000762939453, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.857142984867096, 'reward_std': 0.1428571529686451, 'kl': 0.0040435791015625, 'epoch': 0.05} + 5%|▌ | 128/2500 [33:39<12:29:22, 18.96s/it] 5%|▌ | 129/2500 [33:57<12:16:39, 18.64s/it] {'loss': 0.0001, 'grad_norm': 0.5665788737513436, 'learning_rate': 9.484e-07, 'completion_length': 150.58929443359375, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1071428619325161, 'kl': 0.00370025634765625, 'epoch': 0.05} + 5%|▌ | 129/2500 [33:57<12:16:39, 18.64s/it] 5%|▌ | 130/2500 [34:15<12:04:52, 18.35s/it] {'loss': 0.0001, 'grad_norm': 0.48594640273575934, 'learning_rate': 9.479999999999999e-07, 'completion_length': 142.39286041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00335693359375, 'epoch': 0.05} + 5%|▌ | 130/2500 [34:15<12:04:52, 18.35s/it] 5%|▌ | 131/2500 [34:33<12:04:41, 18.35s/it] {'loss': 0.0001, 'grad_norm': 0.025197322128300455, 'learning_rate': 9.475999999999999e-07, 'completion_length': 171.21428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003509521484375, 'epoch': 0.05} + 5%|▌ | 131/2500 [34:33<12:04:41, 18.35s/it] 5%|▌ | 132/2500 [34:51<12:00:03, 18.24s/it] {'loss': 0.0001, 'grad_norm': 0.26071100900081395, 'learning_rate': 9.472e-07, 'completion_length': 147.73214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00347137451171875, 'epoch': 0.05} + 5%|▌ | 132/2500 [34:51<12:00:03, 18.24s/it] 5%|▌ | 133/2500 [35:08<11:42:09, 17.80s/it] {'loss': 0.0001, 'grad_norm': 0.01793645879929042, 'learning_rate': 9.468e-07, 'completion_length': 132.39286041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00240325927734375, 'epoch': 0.05} + 5%|▌ | 133/2500 [35:08<11:42:09, 17.80s/it] 5%|▌ | 134/2500 [35:26<11:44:59, 17.88s/it] {'loss': 0.0001, 'grad_norm': 0.025355617197258547, 'learning_rate': 9.464e-07, 'completion_length': 157.1964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003021240234375, 'epoch': 0.05} + 5%|▌ | 134/2500 [35:26<11:44:59, 17.88s/it] 5%|▌ | 135/2500 [35:44<11:49:19, 18.00s/it] {'loss': 0.0002, 'grad_norm': 0.2659210275659812, 'learning_rate': 9.459999999999999e-07, 'completion_length': 165.30357360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00439453125, 'epoch': 0.05} + 5%|▌ | 135/2500 [35:44<11:49:19, 18.00s/it] 5%|▌ | 136/2500 [36:02<11:49:41, 18.01s/it] {'loss': 0.0001, 'grad_norm': 0.8720434004309489, 'learning_rate': 9.456e-07, 'completion_length': 161.5, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.14838216453790665, 'kl': 0.003570556640625, 'epoch': 0.05} + 5%|▌ | 136/2500 [36:02<11:49:41, 18.01s/it] 5%|▌ | 137/2500 [36:21<11:57:49, 18.23s/it] {'loss': 0.0001, 'grad_norm': 0.028261033964772458, 'learning_rate': 9.452e-07, 'completion_length': 159.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0028839111328125, 'epoch': 0.05} + 5%|▌ | 137/2500 [36:21<11:57:49, 18.23s/it] 6%|▌ | 138/2500 [36:41<12:18:50, 18.77s/it] {'loss': 0.0002, 'grad_norm': 0.2894428691921582, 'learning_rate': 9.447999999999999e-07, 'completion_length': 165.42857360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00400543212890625, 'epoch': 0.06} + 6%|▌ | 138/2500 [36:41<12:18:50, 18.77s/it] 6%|▌ | 139/2500 [36:59<12:12:34, 18.62s/it] {'loss': 0.0001, 'grad_norm': 0.20102352786921007, 'learning_rate': 9.444e-07, 'completion_length': 161.64286041259766, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.00362396240234375, 'epoch': 0.06} + 6%|▌ | 139/2500 [36:59<12:12:34, 18.62s/it] 6%|▌ | 140/2500 [37:17<12:07:06, 18.49s/it] {'loss': 0.0001, 'grad_norm': 0.3216064710597968, 'learning_rate': 9.439999999999999e-07, 'completion_length': 149.3571548461914, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0824786126613617, 'kl': 0.00345611572265625, 'epoch': 0.06} + 6%|▌ | 140/2500 [37:17<12:07:06, 18.49s/it] 6%|▌ | 141/2500 [37:36<12:10:28, 18.58s/it] {'loss': 0.0002, 'grad_norm': 0.6626676108124474, 'learning_rate': 9.436e-07, 'completion_length': 166.12500762939453, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.857142984867096, 'reward_std': 0.11266787722706795, 'kl': 0.00421142578125, 'epoch': 0.06} + 6%|▌ | 141/2500 [37:36<12:10:28, 18.58s/it] 6%|▌ | 142/2500 [37:55<12:17:31, 18.77s/it] {'loss': 0.0002, 'grad_norm': 0.47937101993630565, 'learning_rate': 9.432e-07, 'completion_length': 169.8214340209961, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.14838216453790665, 'kl': 0.00411224365234375, 'epoch': 0.06} + 6%|▌ | 142/2500 [37:55<12:17:31, 18.77s/it] 6%|▌ | 143/2500 [38:14<12:23:04, 18.92s/it] {'loss': 0.0002, 'grad_norm': 0.6338709297472527, 'learning_rate': 9.427999999999999e-07, 'completion_length': 161.50000762939453, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.8392857909202576, 'reward_std': 0.14838216453790665, 'kl': 0.00385284423828125, 'epoch': 0.06} + 6%|▌ | 143/2500 [38:14<12:23:04, 18.92s/it] 6%|▌ | 144/2500 [38:32<12:10:25, 18.60s/it] {'loss': 0.0002, 'grad_norm': 0.6176932897396372, 'learning_rate': 9.424e-07, 'completion_length': 151.1607208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1071428619325161, 'kl': 0.00396728515625, 'epoch': 0.06} + 6%|▌ | 144/2500 [38:32<12:10:25, 18.60s/it] 6%|▌ | 145/2500 [38:50<12:03:16, 18.43s/it] {'loss': 0.0001, 'grad_norm': 0.24856087598059218, 'learning_rate': 9.419999999999999e-07, 'completion_length': 147.98214721679688, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.00278472900390625, 'epoch': 0.06} + 6%|▌ | 145/2500 [38:50<12:03:16, 18.43s/it] 6%|▌ | 146/2500 [39:08<11:52:10, 18.15s/it] {'loss': 0.0001, 'grad_norm': 0.19622414924222453, 'learning_rate': 9.415999999999999e-07, 'completion_length': 155.6607208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00296783447265625, 'epoch': 0.06} + 6%|▌ | 146/2500 [39:08<11:52:10, 18.15s/it] 6%|▌ | 147/2500 [39:26<11:56:26, 18.27s/it] {'loss': 0.0001, 'grad_norm': 0.8475374217253502, 'learning_rate': 9.412e-07, 'completion_length': 161.0357208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0030670166015625, 'epoch': 0.06} + 6%|▌ | 147/2500 [39:26<11:56:26, 18.27s/it] 6%|▌ | 148/2500 [39:45<11:59:43, 18.36s/it] {'loss': 0.0002, 'grad_norm': 0.8900783003055508, 'learning_rate': 9.408e-07, 'completion_length': 160.7857208251953, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.2253357619047165, 'kl': 0.0058746337890625, 'epoch': 0.06} + 6%|▌ | 148/2500 [39:45<11:59:43, 18.36s/it] 6%|▌ | 149/2500 [40:03<11:59:50, 18.37s/it] {'loss': 0.0002, 'grad_norm': 0.030162718638747917, 'learning_rate': 9.403999999999999e-07, 'completion_length': 156.7678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00376129150390625, 'epoch': 0.06} + 6%|▌ | 149/2500 [40:03<11:59:50, 18.37s/it] 6%|▌ | 150/2500 [40:22<12:06:08, 18.54s/it] {'loss': 0.0002, 'grad_norm': 0.042709485533498334, 'learning_rate': 9.399999999999999e-07, 'completion_length': 162.73214721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.004669189453125, 'epoch': 0.06} + 6%|▌ | 150/2500 [40:22<12:06:08, 18.54s/it] 6%|▌ | 151/2500 [40:41<12:05:10, 18.52s/it] {'loss': 0.0001, 'grad_norm': 0.24179301017453822, 'learning_rate': 9.396e-07, 'completion_length': 160.5714340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.003448486328125, 'epoch': 0.06} + 6%|▌ | 151/2500 [40:41<12:05:10, 18.52s/it] 6%|▌ | 152/2500 [40:58<11:51:44, 18.19s/it] {'loss': 0.0001, 'grad_norm': 1.2112105000265225, 'learning_rate': 9.391999999999999e-07, 'completion_length': 145.6964340209961, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.00360870361328125, 'epoch': 0.06} + 6%|▌ | 152/2500 [40:58<11:51:44, 18.19s/it] 6%|▌ | 153/2500 [41:18<12:14:47, 18.78s/it] {'loss': 0.0002, 'grad_norm': 0.8875463631334103, 'learning_rate': 9.387999999999999e-07, 'completion_length': 159.46429443359375, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.14838216453790665, 'kl': 0.0045623779296875, 'epoch': 0.06} + 6%|▌ | 153/2500 [41:18<12:14:47, 18.78s/it] 6%|▌ | 154/2500 [41:38<12:20:49, 18.95s/it] {'loss': 0.0002, 'grad_norm': 0.2676357499991341, 'learning_rate': 9.384e-07, 'completion_length': 160.7857208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00395965576171875, 'epoch': 0.06} + 6%|▌ | 154/2500 [41:38<12:20:49, 18.95s/it] 6%|▌ | 155/2500 [41:56<12:15:34, 18.82s/it] {'loss': 0.0001, 'grad_norm': 0.2024598160423818, 'learning_rate': 9.379999999999998e-07, 'completion_length': 143.4821548461914, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00337982177734375, 'epoch': 0.06} + 6%|▌ | 155/2500 [41:56<12:15:34, 18.82s/it] 6%|▌ | 156/2500 [42:14<12:06:53, 18.61s/it] {'loss': 0.0001, 'grad_norm': 0.4140544640652578, 'learning_rate': 9.375999999999999e-07, 'completion_length': 149.05357360839844, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0824786126613617, 'kl': 0.00337982177734375, 'epoch': 0.06} + 6%|▌ | 156/2500 [42:14<12:06:53, 18.61s/it] 6%|▋ | 157/2500 [42:32<11:59:37, 18.43s/it] {'loss': 0.0002, 'grad_norm': 0.03120622168613659, 'learning_rate': 9.372e-07, 'completion_length': 155.80358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004425048828125, 'epoch': 0.06} + 6%|▋ | 157/2500 [42:32<11:59:37, 18.43s/it] 6%|▋ | 158/2500 [42:51<11:58:21, 18.40s/it] {'loss': 0.0002, 'grad_norm': 0.022431447753869703, 'learning_rate': 9.368e-07, 'completion_length': 162.4107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00463104248046875, 'epoch': 0.06} + 6%|▋ | 158/2500 [42:51<11:58:21, 18.40s/it] 6%|▋ | 159/2500 [43:08<11:49:24, 18.18s/it] {'loss': 0.0002, 'grad_norm': 0.4489156288532507, 'learning_rate': 9.363999999999999e-07, 'completion_length': 151.89286041259766, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.00376129150390625, 'epoch': 0.06} + 6%|▋ | 159/2500 [43:08<11:49:24, 18.18s/it] 6%|▋ | 160/2500 [43:27<11:59:24, 18.45s/it] {'loss': 0.0002, 'grad_norm': 0.03727952531272341, 'learning_rate': 9.36e-07, 'completion_length': 160.80358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00445556640625, 'epoch': 0.06} + 6%|▋ | 160/2500 [43:27<11:59:24, 18.45s/it] 6%|▋ | 161/2500 [43:46<11:55:24, 18.35s/it] {'loss': 0.0002, 'grad_norm': 0.47640981314367187, 'learning_rate': 9.356e-07, 'completion_length': 153.83929443359375, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.8214285969734192, 'reward_std': 0.04123930633068085, 'kl': 0.004364013671875, 'epoch': 0.06} + 6%|▋ | 161/2500 [43:46<11:55:24, 18.35s/it] 6%|▋ | 162/2500 [44:03<11:43:37, 18.06s/it] {'loss': 0.0002, 'grad_norm': 0.5993230118030086, 'learning_rate': 9.352e-07, 'completion_length': 144.92858123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.004119873046875, 'epoch': 0.06} + 6%|▋ | 162/2500 [44:03<11:43:37, 18.06s/it] 7%|▋ | 163/2500 [44:23<12:12:07, 18.80s/it] {'loss': 0.0002, 'grad_norm': 0.584868431437334, 'learning_rate': 9.347999999999999e-07, 'completion_length': 177.7678680419922, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.1649572253227234, 'kl': 0.0061187744140625, 'epoch': 0.07} + 7%|▋ | 163/2500 [44:23<12:12:07, 18.80s/it] 7%|▋ | 164/2500 [44:42<12:04:42, 18.61s/it] {'loss': 0.0002, 'grad_norm': 0.2638733150875771, 'learning_rate': 9.344e-07, 'completion_length': 149.4107208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0037994384765625, 'epoch': 0.07} + 7%|▋ | 164/2500 [44:42<12:04:42, 18.61s/it] 7%|▋ | 165/2500 [45:00<12:02:28, 18.56s/it] {'loss': 0.0001, 'grad_norm': 0.612515955142151, 'learning_rate': 9.34e-07, 'completion_length': 137.6607208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.00372314453125, 'epoch': 0.07} + 7%|▋ | 165/2500 [45:00<12:02:28, 18.56s/it] 7%|▋ | 166/2500 [45:18<11:58:43, 18.48s/it] {'loss': 0.0002, 'grad_norm': 0.5778012785736512, 'learning_rate': 9.335999999999999e-07, 'completion_length': 146.96428680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.0051116943359375, 'epoch': 0.07} + 7%|▋ | 166/2500 [45:18<11:58:43, 18.48s/it] 7%|▋ | 167/2500 [45:37<11:59:41, 18.51s/it] {'loss': 0.0002, 'grad_norm': 0.32827395220820293, 'learning_rate': 9.332e-07, 'completion_length': 156.67857360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.00434112548828125, 'epoch': 0.07} + 7%|▋ | 167/2500 [45:37<11:59:41, 18.51s/it] 7%|▋ | 168/2500 [45:55<11:49:57, 18.27s/it] {'loss': 0.0002, 'grad_norm': 1.0332398745029978, 'learning_rate': 9.327999999999999e-07, 'completion_length': 142.32144165039062, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.0047454833984375, 'epoch': 0.07} + 7%|▋ | 168/2500 [45:55<11:49:57, 18.27s/it] 7%|▋ | 169/2500 [46:14<11:59:45, 18.53s/it] {'loss': 0.0001, 'grad_norm': 0.604975834336903, 'learning_rate': 9.324e-07, 'completion_length': 150.76786041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.003509521484375, 'epoch': 0.07} + 7%|▋ | 169/2500 [46:14<11:59:45, 18.53s/it] 7%|▋ | 170/2500 [46:32<11:55:00, 18.41s/it] {'loss': 0.0001, 'grad_norm': 0.49147711929238036, 'learning_rate': 9.32e-07, 'completion_length': 157.1964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0032958984375, 'epoch': 0.07} + 7%|▋ | 170/2500 [46:32<11:55:00, 18.41s/it] 7%|▋ | 171/2500 [46:51<11:59:27, 18.53s/it] {'loss': 0.0001, 'grad_norm': 0.32884336054786417, 'learning_rate': 9.315999999999999e-07, 'completion_length': 154.26786041259766, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.00347900390625, 'epoch': 0.07} + 7%|▋ | 171/2500 [46:51<11:59:27, 18.53s/it] 7%|▋ | 172/2500 [47:10<12:11:54, 18.86s/it] {'loss': 0.0001, 'grad_norm': 0.5301647090541541, 'learning_rate': 9.312e-07, 'completion_length': 149.73214721679688, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.14838216453790665, 'kl': 0.00348663330078125, 'epoch': 0.07} + 7%|▋ | 172/2500 [47:10<12:11:54, 18.86s/it] 7%|▋ | 173/2500 [47:29<12:10:29, 18.84s/it] {'loss': 0.0003, 'grad_norm': 0.0268717783556879, 'learning_rate': 9.307999999999999e-07, 'completion_length': 162.1964340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00628662109375, 'epoch': 0.07} + 7%|▋ | 173/2500 [47:29<12:10:29, 18.84s/it] 7%|▋ | 174/2500 [47:47<11:59:34, 18.56s/it] {'loss': 0.0002, 'grad_norm': 0.4470400821181589, 'learning_rate': 9.303999999999999e-07, 'completion_length': 151.73214721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0048980712890625, 'epoch': 0.07} + 7%|▋ | 174/2500 [47:47<11:59:34, 18.56s/it] 7%|▋ | 175/2500 [48:05<11:55:11, 18.46s/it] {'loss': 0.0001, 'grad_norm': 0.02242521944009264, 'learning_rate': 9.3e-07, 'completion_length': 159.21429443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00372314453125, 'epoch': 0.07} + 7%|▋ | 175/2500 [48:05<11:55:11, 18.46s/it] 7%|▋ | 176/2500 [48:23<11:44:44, 18.19s/it] {'loss': 0.0001, 'grad_norm': 0.47562536793337395, 'learning_rate': 9.296e-07, 'completion_length': 136.60714721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.00278472900390625, 'epoch': 0.07} + 7%|▋ | 176/2500 [48:23<11:44:44, 18.19s/it] 7%|▋ | 177/2500 [48:41<11:41:57, 18.13s/it] {'loss': 0.0001, 'grad_norm': 0.03108289469234393, 'learning_rate': 9.292e-07, 'completion_length': 161.55358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00283050537109375, 'epoch': 0.07} + 7%|▋ | 177/2500 [48:41<11:41:57, 18.13s/it] 7%|▋ | 178/2500 [49:00<11:56:39, 18.52s/it] {'loss': 0.0002, 'grad_norm': 1.6068158207484944, 'learning_rate': 9.287999999999999e-07, 'completion_length': 173.62500762939453, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.11266787722706795, 'kl': 0.0050506591796875, 'epoch': 0.07} + 7%|▋ | 178/2500 [49:00<11:56:39, 18.52s/it] 7%|▋ | 179/2500 [49:18<11:48:49, 18.32s/it] {'loss': 0.0002, 'grad_norm': 0.366207382976694, 'learning_rate': 9.284e-07, 'completion_length': 142.25000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0040283203125, 'epoch': 0.07} + 7%|▋ | 179/2500 [49:18<11:48:49, 18.32s/it] 7%|▋ | 180/2500 [49:35<11:30:37, 17.86s/it] {'loss': 0.0001, 'grad_norm': 0.4735332399116362, 'learning_rate': 9.28e-07, 'completion_length': 145.8571548461914, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.0037384033203125, 'epoch': 0.07} + 7%|▋ | 180/2500 [49:35<11:30:37, 17.86s/it] 7%|▋ | 181/2500 [49:53<11:28:40, 17.82s/it] {'loss': 0.0002, 'grad_norm': 0.021735997533872196, 'learning_rate': 9.275999999999999e-07, 'completion_length': 140.80357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00431060791015625, 'epoch': 0.07} + 7%|▋ | 181/2500 [49:53<11:28:40, 17.82s/it] 7%|▋ | 182/2500 [50:10<11:28:16, 17.82s/it] {'loss': 0.0001, 'grad_norm': 0.2707382285679621, 'learning_rate': 9.272e-07, 'completion_length': 147.87500762939453, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.003326416015625, 'epoch': 0.07} + 7%|▋ | 182/2500 [50:10<11:28:16, 17.82s/it] 7%|▋ | 183/2500 [50:28<11:19:22, 17.59s/it] {'loss': 0.0001, 'grad_norm': 0.018419579999029122, 'learning_rate': 9.268e-07, 'completion_length': 148.60714721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.003509521484375, 'epoch': 0.07} + 7%|▋ | 183/2500 [50:28<11:19:22, 17.59s/it] 7%|▋ | 184/2500 [50:45<11:23:08, 17.70s/it] {'loss': 0.0002, 'grad_norm': 0.4342475290020211, 'learning_rate': 9.263999999999999e-07, 'completion_length': 152.55358123779297, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.1428571492433548, 'kl': 0.0048828125, 'epoch': 0.07} + 7%|▋ | 184/2500 [50:45<11:23:08, 17.70s/it] 7%|▋ | 185/2500 [51:04<11:29:49, 17.88s/it] {'loss': 0.0002, 'grad_norm': 0.019341947431612466, 'learning_rate': 9.26e-07, 'completion_length': 150.3214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0039825439453125, 'epoch': 0.07} + 7%|▋ | 185/2500 [51:04<11:29:49, 17.88s/it] 7%|▋ | 186/2500 [51:22<11:39:14, 18.13s/it] {'loss': 0.0002, 'grad_norm': 0.7527706604952392, 'learning_rate': 9.256e-07, 'completion_length': 154.71429443359375, 'rewards/accuracy_reward': 0.8392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.8392857313156128, 'reward_std': 0.07695358991622925, 'kl': 0.00470733642578125, 'epoch': 0.07} + 7%|▋ | 186/2500 [51:22<11:39:14, 18.13s/it] 7%|▋ | 187/2500 [51:41<11:41:38, 18.20s/it] {'loss': 0.0001, 'grad_norm': 0.7921321039250159, 'learning_rate': 9.251999999999999e-07, 'completion_length': 160.33929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.003406524658203125, 'epoch': 0.07} + 7%|▋ | 187/2500 [51:41<11:41:38, 18.20s/it] 8%|▊ | 188/2500 [51:59<11:39:28, 18.15s/it] {'loss': 0.0001, 'grad_norm': 0.6939426551456743, 'learning_rate': 9.247999999999999e-07, 'completion_length': 153.7678680419922, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.1071428619325161, 'kl': 0.00373077392578125, 'epoch': 0.08} + 8%|▊ | 188/2500 [51:59<11:39:28, 18.15s/it] 8%|▊ | 189/2500 [52:15<11:20:01, 17.66s/it] {'loss': 0.0001, 'grad_norm': 0.2026466041395766, 'learning_rate': 9.244e-07, 'completion_length': 136.6607208251953, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.00336456298828125, 'epoch': 0.08} + 8%|▊ | 189/2500 [52:15<11:20:01, 17.66s/it] 8%|▊ | 190/2500 [52:33<11:20:00, 17.66s/it] {'loss': 0.0001, 'grad_norm': 0.4220953878241498, 'learning_rate': 9.24e-07, 'completion_length': 139.9107208251953, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.1071428619325161, 'kl': 0.0034637451171875, 'epoch': 0.08} + 8%|▊ | 190/2500 [52:33<11:20:00, 17.66s/it] 8%|▊ | 191/2500 [52:52<11:30:53, 17.95s/it] {'loss': 0.0002, 'grad_norm': 3.6786340800575243, 'learning_rate': 9.235999999999999e-07, 'completion_length': 157.3214340209961, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.15943220257759094, 'kl': 0.0057830810546875, 'epoch': 0.08} + 8%|▊ | 191/2500 [52:52<11:30:53, 17.95s/it] 8%|▊ | 192/2500 [53:10<11:39:32, 18.19s/it] {'loss': 0.0002, 'grad_norm': 0.052745441797679975, 'learning_rate': 9.232e-07, 'completion_length': 154.05358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0044403076171875, 'epoch': 0.08} + 8%|▊ | 192/2500 [53:10<11:39:32, 18.19s/it] 8%|▊ | 193/2500 [53:31<12:01:03, 18.75s/it] {'loss': 0.0002, 'grad_norm': 0.4283628524899491, 'learning_rate': 9.227999999999999e-07, 'completion_length': 175.35714721679688, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.0714285746216774, 'kl': 0.0046539306640625, 'epoch': 0.08} + 8%|▊ | 193/2500 [53:31<12:01:03, 18.75s/it] 8%|▊ | 194/2500 [53:48<11:50:21, 18.48s/it] {'loss': 0.0002, 'grad_norm': 0.28788813783774864, 'learning_rate': 9.224e-07, 'completion_length': 142.5178680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0049896240234375, 'epoch': 0.08} + 8%|▊ | 194/2500 [53:48<11:50:21, 18.48s/it] 8%|▊ | 195/2500 [54:06<11:40:23, 18.23s/it] {'loss': 0.0002, 'grad_norm': 0.07793617877503926, 'learning_rate': 9.22e-07, 'completion_length': 156.92858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0040130615234375, 'epoch': 0.08} + 8%|▊ | 195/2500 [54:06<11:40:23, 18.23s/it] 8%|▊ | 196/2500 [54:24<11:37:50, 18.17s/it] {'loss': 0.0001, 'grad_norm': 0.014787280617003738, 'learning_rate': 9.215999999999999e-07, 'completion_length': 161.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003509521484375, 'epoch': 0.08} + 8%|▊ | 196/2500 [54:24<11:37:50, 18.17s/it] 8%|▊ | 197/2500 [54:43<11:46:42, 18.41s/it] {'loss': 0.0002, 'grad_norm': 0.4860194839805526, 'learning_rate': 9.212e-07, 'completion_length': 169.92858123779297, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.1428571529686451, 'kl': 0.0046844482421875, 'epoch': 0.08} + 8%|▊ | 197/2500 [54:43<11:46:42, 18.41s/it] 8%|▊ | 198/2500 [55:03<12:03:35, 18.86s/it] {'loss': 0.0002, 'grad_norm': 0.19999636052920786, 'learning_rate': 9.207999999999999e-07, 'completion_length': 162.4107208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00402069091796875, 'epoch': 0.08} + 8%|▊ | 198/2500 [55:03<12:03:35, 18.86s/it] 8%|▊ | 199/2500 [55:21<11:51:32, 18.55s/it] {'loss': 0.0002, 'grad_norm': 0.43118689415696493, 'learning_rate': 9.203999999999999e-07, 'completion_length': 147.17858123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0049896240234375, 'epoch': 0.08} + 8%|▊ | 199/2500 [55:21<11:51:32, 18.55s/it] 8%|▊ | 200/2500 [55:39<11:52:00, 18.57s/it] {'loss': 0.0001, 'grad_norm': 0.23073039123076872, 'learning_rate': 9.2e-07, 'completion_length': 159.9464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00345611572265625, 'epoch': 0.08} + 8%|▊ | 200/2500 [55:39<11:52:00, 18.57s/it] 8%|▊ | 201/2500 [58:47<44:09:22, 69.14s/it] {'loss': 0.0001, 'grad_norm': 0.2557484655536785, 'learning_rate': 9.196e-07, 'completion_length': 153.3571548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00345611572265625, 'epoch': 0.08} + 8%|▊ | 201/2500 [58:47<44:09:22, 69.14s/it] 8%|▊ | 202/2500 [59:05<34:26:38, 53.96s/it] {'loss': 0.0002, 'grad_norm': 1.081972536189656, 'learning_rate': 9.192e-07, 'completion_length': 183.85714721679688, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.857142984867096, 'reward_std': 0.11266787722706795, 'kl': 0.006072998046875, 'epoch': 0.08} + 8%|▊ | 202/2500 [59:05<34:26:38, 53.96s/it] 8%|▊ | 203/2500 [59:24<27:38:53, 43.33s/it] {'loss': 0.0001, 'grad_norm': 0.8804321333757363, 'learning_rate': 9.187999999999999e-07, 'completion_length': 156.5357208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00339508056640625, 'epoch': 0.08} + 8%|▊ | 203/2500 [59:24<27:38:53, 43.33s/it] 8%|▊ | 204/2500 [59:41<22:44:36, 35.66s/it] {'loss': 0.0002, 'grad_norm': 0.3794923391210515, 'learning_rate': 9.184e-07, 'completion_length': 164.87500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0039520263671875, 'epoch': 0.08} + 8%|▊ | 204/2500 [59:41<22:44:36, 35.66s/it] 8%|▊ | 205/2500 [59:59<19:22:18, 30.39s/it] {'loss': 0.0002, 'grad_norm': 0.33497112380899796, 'learning_rate': 9.18e-07, 'completion_length': 162.6071548461914, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0049896240234375, 'epoch': 0.08} + 8%|▊ | 205/2500 [59:59<19:22:18, 30.39s/it] 8%|▊ | 206/2500 [1:00:18<17:00:50, 26.70s/it] {'loss': 0.0002, 'grad_norm': 0.40268240246734566, 'learning_rate': 9.175999999999999e-07, 'completion_length': 160.5178680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0039825439453125, 'epoch': 0.08} + 8%|▊ | 206/2500 [1:00:18<17:00:50, 26.70s/it] 8%|▊ | 207/2500 [1:00:35<15:16:47, 23.99s/it] {'loss': 0.0003, 'grad_norm': 0.5448821354797603, 'learning_rate': 9.172e-07, 'completion_length': 152.76786041259766, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.006591796875, 'epoch': 0.08} + 8%|▊ | 207/2500 [1:00:35<15:16:47, 23.99s/it] 8%|▊ | 208/2500 [1:00:54<14:11:30, 22.29s/it] {'loss': 0.0002, 'grad_norm': 0.13599967232228624, 'learning_rate': 9.168e-07, 'completion_length': 147.12500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0042877197265625, 'epoch': 0.08} + 8%|▊ | 208/2500 [1:00:54<14:11:30, 22.29s/it] 8%|▊ | 209/2500 [1:01:12<13:24:23, 21.07s/it] {'loss': 0.0002, 'grad_norm': 0.44145168950292496, 'learning_rate': 9.163999999999999e-07, 'completion_length': 164.83929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.005706787109375, 'epoch': 0.08} + 8%|▊ | 209/2500 [1:01:12<13:24:23, 21.07s/it] 8%|▊ | 210/2500 [1:01:31<13:05:49, 20.59s/it] {'loss': 0.0002, 'grad_norm': 0.37381160890376397, 'learning_rate': 9.16e-07, 'completion_length': 168.60714721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.00494384765625, 'epoch': 0.08} + 8%|▊ | 210/2500 [1:01:31<13:05:49, 20.59s/it] 8%|▊ | 211/2500 [1:01:50<12:48:57, 20.16s/it] {'loss': 0.0001, 'grad_norm': 0.0573648335191267, 'learning_rate': 9.156e-07, 'completion_length': 143.71429443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0037384033203125, 'epoch': 0.08} + 8%|▊ | 211/2500 [1:01:50<12:48:57, 20.16s/it] 8%|▊ | 212/2500 [1:02:08<12:22:03, 19.46s/it] {'loss': 0.0002, 'grad_norm': 0.29708454704496745, 'learning_rate': 9.151999999999999e-07, 'completion_length': 160.85714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0045166015625, 'epoch': 0.08} + 8%|▊ | 212/2500 [1:02:08<12:22:03, 19.46s/it] 9%|▊ | 213/2500 [1:02:26<12:07:29, 19.09s/it] {'loss': 0.0002, 'grad_norm': 0.3408685489773149, 'learning_rate': 9.147999999999999e-07, 'completion_length': 146.9107208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005035400390625, 'epoch': 0.09} + 9%|▊ | 213/2500 [1:02:26<12:07:29, 19.09s/it] 9%|▊ | 214/2500 [1:02:46<12:07:56, 19.11s/it] {'loss': 0.0002, 'grad_norm': 1.4517915825237, 'learning_rate': 9.144e-07, 'completion_length': 166.83928680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.0054779052734375, 'epoch': 0.09} + 9%|▊ | 214/2500 [1:02:46<12:07:56, 19.11s/it] 9%|▊ | 215/2500 [1:03:04<11:56:14, 18.81s/it] {'loss': 0.0002, 'grad_norm': 0.2617856643121301, 'learning_rate': 9.14e-07, 'completion_length': 151.4464340209961, 'rewards/accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.04123930633068085, 'kl': 0.0052032470703125, 'epoch': 0.09} + 9%|▊ | 215/2500 [1:03:04<11:56:14, 18.81s/it] 9%|▊ | 216/2500 [1:03:23<12:06:51, 19.09s/it] {'loss': 0.0002, 'grad_norm': 0.2967599895337942, 'learning_rate': 9.135999999999999e-07, 'completion_length': 175.73214721679688, 'rewards/accuracy_reward': 0.7857142984867096, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.0714285746216774, 'kl': 0.00467681884765625, 'epoch': 0.09} + 9%|▊ | 216/2500 [1:03:23<12:06:51, 19.09s/it] 9%|▊ | 217/2500 [1:03:41<11:52:17, 18.72s/it] {'loss': 0.0002, 'grad_norm': 0.20822264198483423, 'learning_rate': 9.132e-07, 'completion_length': 152.14286041259766, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.00421905517578125, 'epoch': 0.09} + 9%|▊ | 217/2500 [1:03:41<11:52:17, 18.72s/it] 9%|▊ | 218/2500 [1:04:00<11:55:29, 18.81s/it] {'loss': 0.0002, 'grad_norm': 0.41823406127021295, 'learning_rate': 9.127999999999999e-07, 'completion_length': 160.35714721679688, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.003875732421875, 'epoch': 0.09} + 9%|▊ | 218/2500 [1:04:00<11:55:29, 18.81s/it] 9%|▉ | 219/2500 [1:04:18<11:47:58, 18.62s/it] {'loss': 0.0002, 'grad_norm': 0.01635938607143911, 'learning_rate': 9.123999999999999e-07, 'completion_length': 148.42858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0044403076171875, 'epoch': 0.09} + 9%|▉ | 219/2500 [1:04:18<11:47:58, 18.62s/it] 9%|▉ | 220/2500 [1:04:37<11:47:49, 18.63s/it] {'loss': 0.0001, 'grad_norm': 0.044972386854977314, 'learning_rate': 9.12e-07, 'completion_length': 155.17857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00356292724609375, 'epoch': 0.09} + 9%|▉ | 220/2500 [1:04:37<11:47:49, 18.63s/it] 9%|▉ | 221/2500 [1:04:57<12:06:08, 19.12s/it] {'loss': 0.0002, 'grad_norm': 0.7073382149820401, 'learning_rate': 9.115999999999999e-07, 'completion_length': 153.3928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00376129150390625, 'epoch': 0.09} + 9%|▉ | 221/2500 [1:04:57<12:06:08, 19.12s/it] 9%|▉ | 222/2500 [1:05:16<12:00:36, 18.98s/it] {'loss': 0.0002, 'grad_norm': 0.32812048764157653, 'learning_rate': 9.112e-07, 'completion_length': 156.6964340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0039825439453125, 'epoch': 0.09} + 9%|▉ | 222/2500 [1:05:16<12:00:36, 18.98s/it] 9%|▉ | 223/2500 [1:05:35<12:00:13, 18.98s/it] {'loss': 0.0002, 'grad_norm': 3.0430624964319994, 'learning_rate': 9.108e-07, 'completion_length': 154.00000762939453, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.0037689208984375, 'epoch': 0.09} + 9%|▉ | 223/2500 [1:05:35<12:00:13, 18.98s/it] 9%|▉ | 224/2500 [1:05:53<11:50:12, 18.72s/it] {'loss': 0.0002, 'grad_norm': 0.021806383617334554, 'learning_rate': 9.103999999999999e-07, 'completion_length': 155.7678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004425048828125, 'epoch': 0.09} + 9%|▉ | 224/2500 [1:05:53<11:50:12, 18.72s/it] 9%|▉ | 225/2500 [1:06:11<11:37:21, 18.39s/it] {'loss': 0.0002, 'grad_norm': 0.025548062250303545, 'learning_rate': 9.1e-07, 'completion_length': 151.3214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004486083984375, 'epoch': 0.09} + 9%|▉ | 225/2500 [1:06:11<11:37:21, 18.39s/it] 9%|▉ | 226/2500 [1:06:29<11:40:22, 18.48s/it] {'loss': 0.0002, 'grad_norm': 0.06122639022412931, 'learning_rate': 9.095999999999999e-07, 'completion_length': 165.67858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0044708251953125, 'epoch': 0.09} + 9%|▉ | 226/2500 [1:06:29<11:40:22, 18.48s/it] 9%|▉ | 227/2500 [1:06:47<11:31:46, 18.26s/it] {'loss': 0.0002, 'grad_norm': 0.4357653346136257, 'learning_rate': 9.092e-07, 'completion_length': 164.94644165039062, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.11266788095235825, 'kl': 0.0052337646484375, 'epoch': 0.09} + 9%|▉ | 227/2500 [1:06:47<11:31:46, 18.26s/it] 9%|▉ | 228/2500 [1:07:06<11:37:01, 18.41s/it] {'loss': 0.0002, 'grad_norm': 0.5170240014854106, 'learning_rate': 9.088e-07, 'completion_length': 163.85714721679688, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.14838216826319695, 'kl': 0.004791259765625, 'epoch': 0.09} + 9%|▉ | 228/2500 [1:07:06<11:37:01, 18.41s/it] 9%|▉ | 229/2500 [1:07:25<11:48:47, 18.73s/it] {'loss': 0.0002, 'grad_norm': 1.6992578306894208, 'learning_rate': 9.084e-07, 'completion_length': 165.10714721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.00555419921875, 'epoch': 0.09} + 9%|▉ | 229/2500 [1:07:25<11:48:47, 18.73s/it] 9%|▉ | 230/2500 [1:07:48<12:28:33, 19.79s/it] {'loss': 0.0002, 'grad_norm': 0.6061431775286364, 'learning_rate': 9.08e-07, 'completion_length': 166.19644165039062, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8750000596046448, 'reward_std': 0.21124479919672012, 'kl': 0.004364013671875, 'epoch': 0.09} + 9%|▉ | 230/2500 [1:07:48<12:28:33, 19.79s/it] 9%|▉ | 231/2500 [1:08:07<12:20:21, 19.58s/it] {'loss': 0.0002, 'grad_norm': 0.2876127343375, 'learning_rate': 9.075999999999999e-07, 'completion_length': 141.1607208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.0040130615234375, 'epoch': 0.09} + 9%|▉ | 231/2500 [1:08:07<12:20:21, 19.58s/it] 9%|▉ | 232/2500 [1:08:29<12:45:01, 20.24s/it] {'loss': 0.0002, 'grad_norm': 2.075046623217286, 'learning_rate': 9.072e-07, 'completion_length': 177.37500762939453, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.11266787722706795, 'kl': 0.0047760009765625, 'epoch': 0.09} + 9%|▉ | 232/2500 [1:08:29<12:45:01, 20.24s/it] 9%|▉ | 233/2500 [1:08:46<12:15:34, 19.47s/it] {'loss': 0.0002, 'grad_norm': 0.2936935128127444, 'learning_rate': 9.068e-07, 'completion_length': 158.87500762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0046844482421875, 'epoch': 0.09} + 9%|▉ | 233/2500 [1:08:46<12:15:34, 19.47s/it] 9%|▉ | 234/2500 [1:09:04<11:55:57, 18.96s/it] {'loss': 0.0001, 'grad_norm': 0.9010155961706476, 'learning_rate': 9.063999999999999e-07, 'completion_length': 148.9464340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.00304412841796875, 'epoch': 0.09} + 9%|▉ | 234/2500 [1:09:04<11:55:57, 18.96s/it] 9%|▉ | 235/2500 [1:09:22<11:44:13, 18.65s/it] {'loss': 0.0002, 'grad_norm': 0.8123406661540453, 'learning_rate': 9.06e-07, 'completion_length': 156.21429443359375, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.14838216453790665, 'kl': 0.0059967041015625, 'epoch': 0.09} + 9%|▉ | 235/2500 [1:09:22<11:44:13, 18.65s/it] 9%|▉ | 236/2500 [1:09:41<11:47:41, 18.75s/it] {'loss': 0.0002, 'grad_norm': 0.7866143460954664, 'learning_rate': 9.056e-07, 'completion_length': 163.75000762939453, 'rewards/accuracy_reward': 0.803571492433548, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.1071428619325161, 'kl': 0.0042266845703125, 'epoch': 0.09} + 9%|▉ | 236/2500 [1:09:41<11:47:41, 18.75s/it] 9%|▉ | 237/2500 [1:09:59<11:37:47, 18.50s/it] {'loss': 0.0001, 'grad_norm': 0.40750155904694585, 'learning_rate': 9.051999999999999e-07, 'completion_length': 146.73214721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0034027099609375, 'epoch': 0.09} + 9%|▉ | 237/2500 [1:09:59<11:37:47, 18.50s/it] 10%|▉ | 238/2500 [1:10:18<11:48:46, 18.80s/it] {'loss': 0.0002, 'grad_norm': 0.4029680871973038, 'learning_rate': 9.048e-07, 'completion_length': 156.71429443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.11266788095235825, 'kl': 0.0041961669921875, 'epoch': 0.1} + 10%|▉ | 238/2500 [1:10:18<11:48:46, 18.80s/it] 10%|▉ | 239/2500 [1:10:37<11:41:56, 18.63s/it] {'loss': 0.0002, 'grad_norm': 0.34302090774937716, 'learning_rate': 9.044e-07, 'completion_length': 145.05358123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0041046142578125, 'epoch': 0.1} + 10%|▉ | 239/2500 [1:10:37<11:41:56, 18.63s/it] 10%|▉ | 240/2500 [1:10:54<11:31:45, 18.37s/it] {'loss': 0.0001, 'grad_norm': 0.03113076172068115, 'learning_rate': 9.039999999999999e-07, 'completion_length': 143.92857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00321197509765625, 'epoch': 0.1} + 10%|▉ | 240/2500 [1:10:54<11:31:45, 18.37s/it] 10%|▉ | 241/2500 [1:11:13<11:36:53, 18.51s/it] {'loss': 0.0002, 'grad_norm': 0.02410537078775126, 'learning_rate': 9.035999999999999e-07, 'completion_length': 161.10714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005035400390625, 'epoch': 0.1} + 10%|▉ | 241/2500 [1:11:13<11:36:53, 18.51s/it] 10%|▉ | 242/2500 [1:11:33<11:49:59, 18.87s/it] {'loss': 0.0002, 'grad_norm': 0.20216537432405296, 'learning_rate': 9.032e-07, 'completion_length': 154.92857360839844, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0041351318359375, 'epoch': 0.1} + 10%|▉ | 242/2500 [1:11:33<11:49:59, 18.87s/it] 10%|▉ | 243/2500 [1:11:51<11:38:22, 18.57s/it] {'loss': 0.0002, 'grad_norm': 0.22338971721993203, 'learning_rate': 9.028e-07, 'completion_length': 156.75, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0043182373046875, 'epoch': 0.1} + 10%|▉ | 243/2500 [1:11:51<11:38:22, 18.57s/it] 10%|▉ | 244/2500 [1:12:09<11:31:29, 18.39s/it] {'loss': 0.0001, 'grad_norm': 0.20110646917821773, 'learning_rate': 9.023999999999999e-07, 'completion_length': 157.50000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0035552978515625, 'epoch': 0.1} + 10%|▉ | 244/2500 [1:12:09<11:31:29, 18.39s/it] 10%|▉ | 245/2500 [1:12:29<11:47:34, 18.83s/it] {'loss': 0.0002, 'grad_norm': 0.8387050876876841, 'learning_rate': 9.02e-07, 'completion_length': 160.12500762939453, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.14838216826319695, 'kl': 0.004364013671875, 'epoch': 0.1} + 10%|▉ | 245/2500 [1:12:29<11:47:34, 18.83s/it] 10%|▉ | 246/2500 [1:12:47<11:39:10, 18.61s/it] {'loss': 0.0002, 'grad_norm': 0.3913844873702737, 'learning_rate': 9.015999999999999e-07, 'completion_length': 162.98214721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.004669189453125, 'epoch': 0.1} + 10%|▉ | 246/2500 [1:12:47<11:39:10, 18.61s/it] 10%|▉ | 247/2500 [1:13:06<11:42:29, 18.71s/it] {'loss': 0.0002, 'grad_norm': 0.763930162406954, 'learning_rate': 9.011999999999999e-07, 'completion_length': 162.17858123779297, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.1071428619325161, 'kl': 0.0052337646484375, 'epoch': 0.1} + 10%|▉ | 247/2500 [1:13:06<11:42:29, 18.71s/it] 10%|▉ | 248/2500 [1:13:25<11:53:43, 19.02s/it] {'loss': 0.0002, 'grad_norm': 0.49823477521493653, 'learning_rate': 9.008e-07, 'completion_length': 164.2857208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.11266788095235825, 'kl': 0.0047760009765625, 'epoch': 0.1} + 10%|▉ | 248/2500 [1:13:25<11:53:43, 19.02s/it] 10%|▉ | 249/2500 [1:13:44<11:51:39, 18.97s/it] {'loss': 0.0002, 'grad_norm': 0.5245711708247978, 'learning_rate': 9.004e-07, 'completion_length': 157.62500762939453, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0043792724609375, 'epoch': 0.1} + 10%|▉ | 249/2500 [1:13:44<11:51:39, 18.97s/it] 10%|█ | 250/2500 [1:14:02<11:42:17, 18.73s/it] {'loss': 0.0002, 'grad_norm': 0.3700129132313073, 'learning_rate': 9e-07, 'completion_length': 143.92858123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.004364013671875, 'epoch': 0.1} + 10%|█ | 250/2500 [1:14:02<11:42:17, 18.73s/it] 10%|█ | 251/2500 [1:14:20<11:30:17, 18.42s/it] {'loss': 0.0002, 'grad_norm': 0.06020877359526779, 'learning_rate': 8.995999999999999e-07, 'completion_length': 156.46429443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00423431396484375, 'epoch': 0.1} + 10%|█ | 251/2500 [1:14:20<11:30:17, 18.42s/it] 10%|█ | 252/2500 [1:14:39<11:32:43, 18.49s/it] {'loss': 0.0002, 'grad_norm': 0.04856726477204105, 'learning_rate': 8.992e-07, 'completion_length': 130.1428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0055084228515625, 'epoch': 0.1} + 10%|█ | 252/2500 [1:14:39<11:32:43, 18.49s/it] 10%|█ | 253/2500 [1:14:57<11:28:20, 18.38s/it] {'loss': 0.0002, 'grad_norm': 1.9038972880617604, 'learning_rate': 8.988e-07, 'completion_length': 151.89286041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00397491455078125, 'epoch': 0.1} + 10%|█ | 253/2500 [1:14:57<11:28:20, 18.38s/it] 10%|█ | 254/2500 [1:15:16<11:31:38, 18.48s/it] {'loss': 0.0002, 'grad_norm': 0.07863888138755397, 'learning_rate': 8.983999999999999e-07, 'completion_length': 150.71428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0037841796875, 'epoch': 0.1} + 10%|█ | 254/2500 [1:15:16<11:31:38, 18.48s/it] 10%|█ | 255/2500 [1:15:34<11:26:16, 18.34s/it] {'loss': 0.0002, 'grad_norm': 0.02455953275548817, 'learning_rate': 8.98e-07, 'completion_length': 149.85714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0051116943359375, 'epoch': 0.1} + 10%|█ | 255/2500 [1:15:34<11:26:16, 18.34s/it] 10%|█ | 256/2500 [1:15:53<11:36:01, 18.61s/it] {'loss': 0.0002, 'grad_norm': 0.4180621290858132, 'learning_rate': 8.975999999999999e-07, 'completion_length': 169.9107208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.0052947998046875, 'epoch': 0.1} + 10%|█ | 256/2500 [1:15:53<11:36:01, 18.61s/it] 10%|█ | 257/2500 [1:16:13<11:57:35, 19.20s/it] {'loss': 0.0003, 'grad_norm': 0.5451231895563889, 'learning_rate': 8.972e-07, 'completion_length': 182.6607208251953, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.1071428619325161, 'kl': 0.006927490234375, 'epoch': 0.1} + 10%|█ | 257/2500 [1:16:13<11:57:35, 19.20s/it] 10%|█ | 258/2500 [1:16:32<11:48:03, 18.95s/it] {'loss': 0.0002, 'grad_norm': 0.7681245515557953, 'learning_rate': 8.968e-07, 'completion_length': 163.83929443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0048065185546875, 'epoch': 0.1} + 10%|█ | 258/2500 [1:16:32<11:48:03, 18.95s/it] 10%|█ | 259/2500 [1:16:49<11:29:18, 18.46s/it] {'loss': 0.0002, 'grad_norm': 0.33918609417874307, 'learning_rate': 8.963999999999999e-07, 'completion_length': 133.41072463989258, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0040435791015625, 'epoch': 0.1} + 10%|█ | 259/2500 [1:16:49<11:29:18, 18.46s/it] 10%|█ | 260/2500 [1:17:07<11:24:28, 18.33s/it] {'loss': 0.0003, 'grad_norm': 0.4812896255085057, 'learning_rate': 8.96e-07, 'completion_length': 158.17858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.007080078125, 'epoch': 0.1} + 10%|█ | 260/2500 [1:17:07<11:24:28, 18.33s/it] 10%|█ | 261/2500 [1:17:26<11:32:04, 18.55s/it] {'loss': 0.0002, 'grad_norm': 0.2611819691624644, 'learning_rate': 8.955999999999999e-07, 'completion_length': 155.87500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0051727294921875, 'epoch': 0.1} + 10%|█ | 261/2500 [1:17:26<11:32:04, 18.55s/it] 10%|█ | 262/2500 [1:17:46<11:43:13, 18.85s/it] {'loss': 0.0002, 'grad_norm': 0.42340698803460036, 'learning_rate': 8.951999999999999e-07, 'completion_length': 159.76786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.005035400390625, 'epoch': 0.1} + 10%|█ | 262/2500 [1:17:46<11:43:13, 18.85s/it] 11%|█ | 263/2500 [1:18:05<11:51:47, 19.09s/it] {'loss': 0.0001, 'grad_norm': 0.24058914621390515, 'learning_rate': 8.948e-07, 'completion_length': 141.75000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00335693359375, 'epoch': 0.11} + 11%|█ | 263/2500 [1:18:05<11:51:47, 19.09s/it] 11%|█ | 264/2500 [1:18:24<11:45:46, 18.94s/it] {'loss': 0.0002, 'grad_norm': 0.42713325329560703, 'learning_rate': 8.944e-07, 'completion_length': 152.42857360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.0048828125, 'epoch': 0.11} + 11%|█ | 264/2500 [1:18:24<11:45:46, 18.94s/it] 11%|█ | 265/2500 [1:18:43<11:43:27, 18.88s/it] {'loss': 0.0002, 'grad_norm': 0.38348218873819667, 'learning_rate': 8.939999999999999e-07, 'completion_length': 151.6607208251953, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.00403594970703125, 'epoch': 0.11} + 11%|█ | 265/2500 [1:18:43<11:43:27, 18.88s/it] 11%|█ | 266/2500 [1:19:02<11:47:47, 19.01s/it] {'loss': 0.0002, 'grad_norm': 0.7467897134979447, 'learning_rate': 8.935999999999999e-07, 'completion_length': 169.42857360839844, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.11266788095235825, 'kl': 0.005218505859375, 'epoch': 0.11} + 11%|█ | 266/2500 [1:19:02<11:47:47, 19.01s/it] 11%|█ | 267/2500 [1:19:22<11:53:59, 19.18s/it] {'loss': 0.0002, 'grad_norm': 0.273620363902959, 'learning_rate': 8.932e-07, 'completion_length': 155.00000762939453, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0051116943359375, 'epoch': 0.11} + 11%|█ | 267/2500 [1:19:22<11:53:59, 19.18s/it] 11%|█ | 268/2500 [1:19:40<11:49:22, 19.07s/it] {'loss': 0.0002, 'grad_norm': 0.4827860482387603, 'learning_rate': 8.928e-07, 'completion_length': 164.87500762939453, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.07695359364151955, 'kl': 0.0052032470703125, 'epoch': 0.11} + 11%|█ | 268/2500 [1:19:40<11:49:22, 19.07s/it] 11%|█ | 269/2500 [1:19:58<11:34:44, 18.68s/it] {'loss': 0.0002, 'grad_norm': 0.34380515128049033, 'learning_rate': 8.923999999999999e-07, 'completion_length': 159.62500762939453, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0053558349609375, 'epoch': 0.11} + 11%|█ | 269/2500 [1:19:58<11:34:44, 18.68s/it] 11%|█ | 270/2500 [1:20:18<11:43:39, 18.93s/it] {'loss': 0.0002, 'grad_norm': 0.060020597122707395, 'learning_rate': 8.92e-07, 'completion_length': 145.0178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0051422119140625, 'epoch': 0.11} + 11%|█ | 270/2500 [1:20:18<11:43:39, 18.93s/it] 11%|█ | 271/2500 [1:20:36<11:35:40, 18.73s/it] {'loss': 0.0002, 'grad_norm': 0.2825119596895801, 'learning_rate': 8.915999999999999e-07, 'completion_length': 152.39286041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00458526611328125, 'epoch': 0.11} + 11%|█ | 271/2500 [1:20:36<11:35:40, 18.73s/it] 11%|█ | 272/2500 [1:20:58<12:10:15, 19.67s/it] {'loss': 0.0002, 'grad_norm': 0.6110376302330078, 'learning_rate': 8.911999999999999e-07, 'completion_length': 164.55358123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00506591796875, 'epoch': 0.11} + 11%|█ | 272/2500 [1:20:58<12:10:15, 19.67s/it] 11%|█ | 273/2500 [1:21:17<12:04:02, 19.51s/it] {'loss': 0.0003, 'grad_norm': 0.3137082166016969, 'learning_rate': 8.908e-07, 'completion_length': 165.6964340209961, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.006927490234375, 'epoch': 0.11} + 11%|█ | 273/2500 [1:21:17<12:04:02, 19.51s/it] 11%|█ | 274/2500 [1:21:35<11:52:50, 19.21s/it] {'loss': 0.0002, 'grad_norm': 0.24472604472685733, 'learning_rate': 8.904e-07, 'completion_length': 143.96429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00531005859375, 'epoch': 0.11} + 11%|█ | 274/2500 [1:21:35<11:52:50, 19.21s/it] 11%|█ | 275/2500 [1:21:57<12:15:04, 19.82s/it] {'loss': 0.0003, 'grad_norm': 0.5459780289107774, 'learning_rate': 8.9e-07, 'completion_length': 175.37500762939453, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.0065765380859375, 'epoch': 0.11} + 11%|█ | 275/2500 [1:21:57<12:15:04, 19.82s/it] 11%|█ | 276/2500 [1:22:16<12:10:40, 19.71s/it] {'loss': 0.0003, 'grad_norm': 0.019912761435049855, 'learning_rate': 8.895999999999999e-07, 'completion_length': 169.87500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00714111328125, 'epoch': 0.11} + 11%|█ | 276/2500 [1:22:16<12:10:40, 19.71s/it] 11%|█ | 277/2500 [1:22:34<11:53:10, 19.25s/it] {'loss': 0.0002, 'grad_norm': 0.3672642828044069, 'learning_rate': 8.892e-07, 'completion_length': 160.00000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.0057373046875, 'epoch': 0.11} + 11%|█ | 277/2500 [1:22:34<11:53:10, 19.25s/it] 11%|█ | 278/2500 [1:22:52<11:37:22, 18.83s/it] {'loss': 0.0001, 'grad_norm': 0.2545314077967072, 'learning_rate': 8.888e-07, 'completion_length': 140.7857208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00357818603515625, 'epoch': 0.11} + 11%|█ | 278/2500 [1:22:52<11:37:22, 18.83s/it] 11%|█ | 279/2500 [1:23:11<11:38:55, 18.88s/it] {'loss': 0.0004, 'grad_norm': 0.3593273763353753, 'learning_rate': 8.883999999999999e-07, 'completion_length': 165.17858123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.009002685546875, 'epoch': 0.11} + 11%|█ | 279/2500 [1:23:11<11:38:55, 18.88s/it] 11%|█ | 280/2500 [1:23:30<11:41:57, 18.97s/it] {'loss': 0.0002, 'grad_norm': 0.33842366927147943, 'learning_rate': 8.88e-07, 'completion_length': 158.1964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0049591064453125, 'epoch': 0.11} + 11%|█ | 280/2500 [1:23:30<11:41:57, 18.97s/it] 11%|█ | 281/2500 [1:23:49<11:36:25, 18.83s/it] {'loss': 0.0002, 'grad_norm': 0.04731317818054488, 'learning_rate': 8.875999999999999e-07, 'completion_length': 143.1428680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0057373046875, 'epoch': 0.11} + 11%|█ | 281/2500 [1:23:49<11:36:25, 18.83s/it] 11%|█▏ | 282/2500 [1:24:07<11:27:21, 18.59s/it] {'loss': 0.0003, 'grad_norm': 0.5088116931208526, 'learning_rate': 8.872e-07, 'completion_length': 162.08929443359375, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.1428571492433548, 'kl': 0.0068359375, 'epoch': 0.11} + 11%|█▏ | 282/2500 [1:24:07<11:27:21, 18.59s/it] 11%|█▏ | 283/2500 [1:24:26<11:29:14, 18.65s/it] {'loss': 0.0002, 'grad_norm': 0.02188767196836689, 'learning_rate': 8.868e-07, 'completion_length': 154.33929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0048370361328125, 'epoch': 0.11} + 11%|█▏ | 283/2500 [1:24:26<11:29:14, 18.65s/it] 11%|█▏ | 284/2500 [1:24:45<11:38:20, 18.91s/it] {'loss': 0.0002, 'grad_norm': 0.5578304626278748, 'learning_rate': 8.863999999999999e-07, 'completion_length': 154.9464340209961, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.1071428619325161, 'kl': 0.00583648681640625, 'epoch': 0.11} + 11%|█▏ | 284/2500 [1:24:45<11:38:20, 18.91s/it] 11%|█▏ | 285/2500 [1:25:04<11:32:43, 18.76s/it] {'loss': 0.0005, 'grad_norm': 0.5412295255631462, 'learning_rate': 8.86e-07, 'completion_length': 157.55358123779297, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.11266787722706795, 'kl': 0.012298583984375, 'epoch': 0.11} + 11%|█▏ | 285/2500 [1:25:04<11:32:43, 18.76s/it] 11%|█▏ | 286/2500 [1:25:22<11:33:46, 18.80s/it] {'loss': 0.0003, 'grad_norm': 0.1856467774235777, 'learning_rate': 8.856e-07, 'completion_length': 154.67858123779297, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.006439208984375, 'epoch': 0.11} + 11%|█▏ | 286/2500 [1:25:22<11:33:46, 18.80s/it] 11%|█▏ | 287/2500 [1:25:41<11:27:02, 18.63s/it] {'loss': 0.0002, 'grad_norm': 0.3270703738139708, 'learning_rate': 8.851999999999999e-07, 'completion_length': 151.1607208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00543212890625, 'epoch': 0.11} + 11%|█▏ | 287/2500 [1:25:41<11:27:02, 18.63s/it] 12%|█▏ | 288/2500 [1:26:00<11:36:05, 18.88s/it] {'loss': 0.0002, 'grad_norm': 0.39447999264031935, 'learning_rate': 8.848e-07, 'completion_length': 161.98214721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0056304931640625, 'epoch': 0.12} + 12%|█▏ | 288/2500 [1:26:00<11:36:05, 18.88s/it] 12%|█▏ | 289/2500 [1:26:22<12:10:55, 19.84s/it] {'loss': 0.0003, 'grad_norm': 0.35913271299204463, 'learning_rate': 8.844e-07, 'completion_length': 164.00000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.007965087890625, 'epoch': 0.12} + 12%|█▏ | 289/2500 [1:26:22<12:10:55, 19.84s/it] 12%|█▏ | 290/2500 [1:26:41<12:01:54, 19.60s/it] {'loss': 0.0002, 'grad_norm': 1.1751335048141034, 'learning_rate': 8.839999999999999e-07, 'completion_length': 161.4464340209961, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.0060272216796875, 'epoch': 0.12} + 12%|█▏ | 290/2500 [1:26:41<12:01:54, 19.60s/it] 12%|█▏ | 291/2500 [1:27:02<12:10:24, 19.84s/it] {'loss': 0.0002, 'grad_norm': 0.4520233791464316, 'learning_rate': 8.836e-07, 'completion_length': 152.14286041259766, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.005462646484375, 'epoch': 0.12} + 12%|█▏ | 291/2500 [1:27:02<12:10:24, 19.84s/it] 12%|█▏ | 292/2500 [1:27:22<12:10:17, 19.84s/it] {'loss': 0.0003, 'grad_norm': 0.4577103265729601, 'learning_rate': 8.832e-07, 'completion_length': 161.33929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0062713623046875, 'epoch': 0.12} + 12%|█▏ | 292/2500 [1:27:22<12:10:17, 19.84s/it] 12%|█▏ | 293/2500 [1:27:41<12:01:25, 19.61s/it] {'loss': 0.0002, 'grad_norm': 0.04598586550224527, 'learning_rate': 8.827999999999999e-07, 'completion_length': 149.8928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00445556640625, 'epoch': 0.12} + 12%|█▏ | 293/2500 [1:27:41<12:01:25, 19.61s/it] 12%|█▏ | 294/2500 [1:28:01<12:07:26, 19.79s/it] {'loss': 0.0003, 'grad_norm': 0.32037140491859156, 'learning_rate': 8.823999999999999e-07, 'completion_length': 167.71428680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.006561279296875, 'epoch': 0.12} + 12%|█▏ | 294/2500 [1:28:01<12:07:26, 19.79s/it] 12%|█▏ | 295/2500 [1:28:18<11:40:23, 19.06s/it] {'loss': 0.0002, 'grad_norm': 0.03624311685920557, 'learning_rate': 8.82e-07, 'completion_length': 142.62500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0050201416015625, 'epoch': 0.12} + 12%|█▏ | 295/2500 [1:28:18<11:40:23, 19.06s/it] 12%|█▏ | 296/2500 [1:28:36<11:28:49, 18.75s/it] {'loss': 0.0002, 'grad_norm': 0.023413439139619702, 'learning_rate': 8.816000000000001e-07, 'completion_length': 137.7678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00383758544921875, 'epoch': 0.12} + 12%|█▏ | 296/2500 [1:28:36<11:28:49, 18.75s/it] 12%|█▏ | 297/2500 [1:28:55<11:32:28, 18.86s/it] {'loss': 0.0002, 'grad_norm': 0.8826772149621402, 'learning_rate': 8.811999999999999e-07, 'completion_length': 158.10714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0057830810546875, 'epoch': 0.12} + 12%|█▏ | 297/2500 [1:28:55<11:32:28, 18.86s/it] 12%|█▏ | 298/2500 [1:29:13<11:24:30, 18.65s/it] {'loss': 0.0003, 'grad_norm': 0.3794528094770087, 'learning_rate': 8.808e-07, 'completion_length': 155.6428680419922, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.00653076171875, 'epoch': 0.12} + 12%|█▏ | 298/2500 [1:29:13<11:24:30, 18.65s/it] 12%|█▏ | 299/2500 [1:29:33<11:35:04, 18.95s/it] {'loss': 0.0003, 'grad_norm': 0.33843503843677974, 'learning_rate': 8.804e-07, 'completion_length': 172.0178680419922, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.006500244140625, 'epoch': 0.12} + 12%|█▏ | 299/2500 [1:29:33<11:35:04, 18.95s/it] 12%|█▏ | 300/2500 [1:29:52<11:35:43, 18.97s/it] {'loss': 0.0003, 'grad_norm': 0.4326534422251437, 'learning_rate': 8.799999999999999e-07, 'completion_length': 166.55358123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.00628662109375, 'epoch': 0.12} + 12%|█▏ | 300/2500 [1:29:52<11:35:43, 18.97s/it] 12%|█▏ | 301/2500 [1:33:20<46:09:14, 75.56s/it] {'loss': 0.0002, 'grad_norm': 0.4242344368119915, 'learning_rate': 8.796e-07, 'completion_length': 173.69644165039062, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.00604248046875, 'epoch': 0.12} + 12%|█▏ | 301/2500 [1:33:20<46:09:14, 75.56s/it] 12%|█▏ | 302/2500 [1:33:38<35:39:14, 58.40s/it] {'loss': 0.0002, 'grad_norm': 0.8019122119986012, 'learning_rate': 8.792e-07, 'completion_length': 159.0178680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005859375, 'epoch': 0.12} + 12%|█▏ | 302/2500 [1:33:38<35:39:14, 58.40s/it] 12%|█▏ | 303/2500 [1:33:57<28:24:30, 46.55s/it] {'loss': 0.0002, 'grad_norm': 0.03675631645660095, 'learning_rate': 8.788e-07, 'completion_length': 161.42858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.005279541015625, 'epoch': 0.12} + 12%|█▏ | 303/2500 [1:33:57<28:24:30, 46.55s/it] 12%|█▏ | 304/2500 [1:34:15<23:14:09, 38.09s/it] {'loss': 0.0002, 'grad_norm': 0.027389951323039237, 'learning_rate': 8.783999999999999e-07, 'completion_length': 147.50000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00439453125, 'epoch': 0.12} + 12%|█▏ | 304/2500 [1:34:15<23:14:09, 38.09s/it] 12%|█▏ | 305/2500 [1:34:34<19:34:41, 32.11s/it] {'loss': 0.0002, 'grad_norm': 0.8130509979947798, 'learning_rate': 8.78e-07, 'completion_length': 146.01786041259766, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.00579833984375, 'epoch': 0.12} + 12%|█▏ | 305/2500 [1:34:34<19:34:41, 32.11s/it] 12%|█▏ | 306/2500 [1:34:52<17:00:34, 27.91s/it] {'loss': 0.0002, 'grad_norm': 0.02271454963985769, 'learning_rate': 8.776e-07, 'completion_length': 147.3214340209961, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.0054779052734375, 'epoch': 0.12} + 12%|█▏ | 306/2500 [1:34:52<17:00:34, 27.91s/it] 12%|█▏ | 307/2500 [1:35:10<15:20:12, 25.18s/it] {'loss': 0.0003, 'grad_norm': 0.475626023664785, 'learning_rate': 8.771999999999999e-07, 'completion_length': 151.58929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.0070648193359375, 'epoch': 0.12} + 12%|█▏ | 307/2500 [1:35:10<15:20:12, 25.18s/it] 12%|█▏ | 308/2500 [1:35:30<14:22:13, 23.60s/it] {'loss': 0.0003, 'grad_norm': 0.6509275649889771, 'learning_rate': 8.768e-07, 'completion_length': 167.71429443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0076446533203125, 'epoch': 0.12} + 12%|█▏ | 308/2500 [1:35:30<14:22:13, 23.60s/it] 12%|█▏ | 309/2500 [1:35:48<13:18:50, 21.88s/it] {'loss': 0.0002, 'grad_norm': 0.22020573597701848, 'learning_rate': 8.763999999999999e-07, 'completion_length': 159.21429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0049896240234375, 'epoch': 0.12} + 12%|█▏ | 309/2500 [1:35:48<13:18:50, 21.88s/it] 12%|█▏ | 310/2500 [1:36:08<12:59:07, 21.35s/it] {'loss': 0.0002, 'grad_norm': 0.37625623160180466, 'learning_rate': 8.76e-07, 'completion_length': 162.87500762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0062255859375, 'epoch': 0.12} + 12%|█▏ | 310/2500 [1:36:08<12:59:07, 21.35s/it] 12%|█▏ | 311/2500 [1:36:27<12:31:11, 20.59s/it] {'loss': 0.0002, 'grad_norm': 0.7634834116796175, 'learning_rate': 8.756e-07, 'completion_length': 160.91072845458984, 'rewards/accuracy_reward': 0.803571492433548, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.1071428619325161, 'kl': 0.006256103515625, 'epoch': 0.12} + 12%|█▏ | 311/2500 [1:36:27<12:31:11, 20.59s/it] 12%|█▏ | 312/2500 [1:36:46<12:11:19, 20.05s/it] {'loss': 0.0002, 'grad_norm': 0.18680950482167114, 'learning_rate': 8.751999999999999e-07, 'completion_length': 172.07144165039062, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00531005859375, 'epoch': 0.12} + 12%|█▏ | 312/2500 [1:36:46<12:11:19, 20.05s/it] 13%|█▎ | 313/2500 [1:37:04<11:45:03, 19.34s/it] {'loss': 0.0002, 'grad_norm': 0.023956939709960167, 'learning_rate': 8.748e-07, 'completion_length': 139.25000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0051116943359375, 'epoch': 0.13} + 13%|█▎ | 313/2500 [1:37:04<11:45:03, 19.34s/it] 13%|█▎ | 314/2500 [1:37:22<11:36:38, 19.12s/it] {'loss': 0.0002, 'grad_norm': 0.42136534342084686, 'learning_rate': 8.743999999999999e-07, 'completion_length': 148.48214721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.004608154296875, 'epoch': 0.13} + 13%|█▎ | 314/2500 [1:37:22<11:36:38, 19.12s/it] 13%|█▎ | 315/2500 [1:37:40<11:24:35, 18.80s/it] {'loss': 0.0003, 'grad_norm': 0.3478262749293571, 'learning_rate': 8.739999999999999e-07, 'completion_length': 151.3928680419922, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.00677490234375, 'epoch': 0.13} + 13%|█▎ | 315/2500 [1:37:40<11:24:35, 18.80s/it] 13%|█▎ | 316/2500 [1:38:00<11:29:12, 18.93s/it] {'loss': 0.0003, 'grad_norm': 0.8578804189050392, 'learning_rate': 8.736e-07, 'completion_length': 165.35714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0067901611328125, 'epoch': 0.13} + 13%|█▎ | 316/2500 [1:38:00<11:29:12, 18.93s/it] 13%|█▎ | 317/2500 [1:38:18<11:19:05, 18.66s/it] {'loss': 0.0003, 'grad_norm': 0.3982848178323086, 'learning_rate': 8.732e-07, 'completion_length': 161.2321548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00848388671875, 'epoch': 0.13} + 13%|█▎ | 317/2500 [1:38:18<11:19:05, 18.66s/it] 13%|█▎ | 318/2500 [1:38:36<11:12:29, 18.49s/it] {'loss': 0.0002, 'grad_norm': 0.23126113306963939, 'learning_rate': 8.728e-07, 'completion_length': 155.4464340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00439453125, 'epoch': 0.13} + 13%|█▎ | 318/2500 [1:38:36<11:12:29, 18.49s/it] 13%|█▎ | 319/2500 [1:38:54<11:11:54, 18.48s/it] {'loss': 0.0004, 'grad_norm': 0.707660733080653, 'learning_rate': 8.723999999999999e-07, 'completion_length': 173.21428680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.010467529296875, 'epoch': 0.13} + 13%|█▎ | 319/2500 [1:38:54<11:11:54, 18.48s/it] 13%|█▎ | 320/2500 [1:39:11<10:57:14, 18.09s/it] {'loss': 0.0003, 'grad_norm': 0.8647919626879068, 'learning_rate': 8.72e-07, 'completion_length': 158.4464340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.006439208984375, 'epoch': 0.13} + 13%|█▎ | 320/2500 [1:39:11<10:57:14, 18.09s/it] 13%|█▎ | 321/2500 [1:39:29<10:52:00, 17.95s/it] {'loss': 0.0002, 'grad_norm': 0.38684252677850717, 'learning_rate': 8.716e-07, 'completion_length': 157.39286041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.00588226318359375, 'epoch': 0.13} + 13%|█▎ | 321/2500 [1:39:29<10:52:00, 17.95s/it] 13%|█▎ | 322/2500 [1:39:47<10:55:34, 18.06s/it] {'loss': 0.0003, 'grad_norm': 0.295837960095206, 'learning_rate': 8.711999999999999e-07, 'completion_length': 170.4464340209961, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0064239501953125, 'epoch': 0.13} + 13%|█▎ | 322/2500 [1:39:47<10:55:34, 18.06s/it] 13%|█▎ | 323/2500 [1:40:04<10:45:22, 17.79s/it] {'loss': 0.0002, 'grad_norm': 0.3616676866572269, 'learning_rate': 8.708e-07, 'completion_length': 147.23214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00531005859375, 'epoch': 0.13} + 13%|█▎ | 323/2500 [1:40:04<10:45:22, 17.79s/it] 13%|█▎ | 324/2500 [1:40:23<10:59:40, 18.19s/it] {'loss': 0.0003, 'grad_norm': 0.06263740715859051, 'learning_rate': 8.704e-07, 'completion_length': 165.0714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006439208984375, 'epoch': 0.13} + 13%|█▎ | 324/2500 [1:40:23<10:59:40, 18.19s/it] 13%|█▎ | 325/2500 [1:40:42<11:03:53, 18.31s/it] {'loss': 0.0003, 'grad_norm': 0.28817475991930613, 'learning_rate': 8.699999999999999e-07, 'completion_length': 162.23214721679688, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0069580078125, 'epoch': 0.13} + 13%|█▎ | 325/2500 [1:40:42<11:03:53, 18.31s/it] 13%|█▎ | 326/2500 [1:41:01<11:10:01, 18.49s/it] {'loss': 0.0002, 'grad_norm': 0.02308612237977786, 'learning_rate': 8.696e-07, 'completion_length': 163.67857360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00537109375, 'epoch': 0.13} + 13%|█▎ | 326/2500 [1:41:01<11:10:01, 18.49s/it] 13%|█▎ | 327/2500 [1:41:18<10:54:19, 18.07s/it] {'loss': 0.0002, 'grad_norm': 0.01958273177828779, 'learning_rate': 8.692e-07, 'completion_length': 140.0, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045623779296875, 'epoch': 0.13} + 13%|█▎ | 327/2500 [1:41:18<10:54:19, 18.07s/it] 13%|█▎ | 328/2500 [1:41:36<10:57:33, 18.16s/it] {'loss': 0.0003, 'grad_norm': 0.5239016717551936, 'learning_rate': 8.687999999999999e-07, 'completion_length': 151.35714721679688, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.1428571492433548, 'kl': 0.0069427490234375, 'epoch': 0.13} + 13%|█▎ | 328/2500 [1:41:36<10:57:33, 18.16s/it] 13%|█▎ | 329/2500 [1:41:55<10:59:39, 18.23s/it] {'loss': 0.0002, 'grad_norm': 0.2760071225636064, 'learning_rate': 8.683999999999999e-07, 'completion_length': 165.46429443359375, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0054779052734375, 'epoch': 0.13} + 13%|█▎ | 329/2500 [1:41:55<10:59:39, 18.23s/it] 13%|█▎ | 330/2500 [1:42:13<11:01:48, 18.30s/it] {'loss': 0.0002, 'grad_norm': 0.5083788113615119, 'learning_rate': 8.68e-07, 'completion_length': 163.33928680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.004791259765625, 'epoch': 0.13} + 13%|█▎ | 330/2500 [1:42:13<11:01:48, 18.30s/it] 13%|█▎ | 331/2500 [1:42:32<11:05:23, 18.41s/it] {'loss': 0.0002, 'grad_norm': 0.02945934139962814, 'learning_rate': 8.676e-07, 'completion_length': 145.76786041259766, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.004974365234375, 'epoch': 0.13} + 13%|█▎ | 331/2500 [1:42:32<11:05:23, 18.41s/it] 13%|█▎ | 332/2500 [1:42:51<11:10:54, 18.57s/it] {'loss': 0.0003, 'grad_norm': 0.30438141347821984, 'learning_rate': 8.671999999999999e-07, 'completion_length': 163.87500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0063323974609375, 'epoch': 0.13} + 13%|█▎ | 332/2500 [1:42:51<11:10:54, 18.57s/it] 13%|█▎ | 333/2500 [1:43:12<11:33:33, 19.20s/it] {'loss': 0.0002, 'grad_norm': 0.7586404714856847, 'learning_rate': 8.668e-07, 'completion_length': 177.12500762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.11266788095235825, 'kl': 0.0060882568359375, 'epoch': 0.13} + 13%|█▎ | 333/2500 [1:43:12<11:33:33, 19.20s/it] 13%|█▎ | 334/2500 [1:43:30<11:28:09, 19.06s/it] {'loss': 0.0003, 'grad_norm': 1.5514251409937758, 'learning_rate': 8.663999999999999e-07, 'completion_length': 160.5357208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0070648193359375, 'epoch': 0.13} + 13%|█▎ | 334/2500 [1:43:30<11:28:09, 19.06s/it] 13%|█▎ | 335/2500 [1:43:49<11:28:34, 19.08s/it] {'loss': 0.0003, 'grad_norm': 0.32010370728011556, 'learning_rate': 8.659999999999999e-07, 'completion_length': 153.21429443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0063323974609375, 'epoch': 0.13} + 13%|█▎ | 335/2500 [1:43:49<11:28:34, 19.08s/it] 13%|█▎ | 336/2500 [1:44:09<11:30:28, 19.14s/it] {'loss': 0.0002, 'grad_norm': 0.5270691042575821, 'learning_rate': 8.656e-07, 'completion_length': 151.83929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00440216064453125, 'epoch': 0.13} + 13%|█▎ | 336/2500 [1:44:09<11:30:28, 19.14s/it] 13%|█▎ | 337/2500 [1:44:29<11:43:05, 19.50s/it] {'loss': 0.0003, 'grad_norm': 0.54861150093727, 'learning_rate': 8.651999999999999e-07, 'completion_length': 157.67857360839844, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.14838216453790665, 'kl': 0.007476806640625, 'epoch': 0.13} + 13%|█▎ | 337/2500 [1:44:29<11:43:05, 19.50s/it] 14%|█▎ | 338/2500 [1:44:47<11:30:44, 19.17s/it] {'loss': 0.0003, 'grad_norm': 0.3881914239775421, 'learning_rate': 8.648e-07, 'completion_length': 165.46429443359375, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.0070037841796875, 'epoch': 0.14} + 14%|█▎ | 338/2500 [1:44:47<11:30:44, 19.17s/it] 14%|█▎ | 339/2500 [1:45:04<11:04:42, 18.46s/it] {'loss': 0.0002, 'grad_norm': 0.4423086093545395, 'learning_rate': 8.643999999999999e-07, 'completion_length': 136.6428680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00506591796875, 'epoch': 0.14} + 14%|█▎ | 339/2500 [1:45:04<11:04:42, 18.46s/it] 14%|█▎ | 340/2500 [1:45:23<11:04:35, 18.46s/it] {'loss': 0.0003, 'grad_norm': 0.26833096396723277, 'learning_rate': 8.639999999999999e-07, 'completion_length': 162.05358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0062713623046875, 'epoch': 0.14} + 14%|█▎ | 340/2500 [1:45:23<11:04:35, 18.46s/it] 14%|█▎ | 341/2500 [1:45:42<11:14:35, 18.75s/it] {'loss': 0.0003, 'grad_norm': 0.32825261927849464, 'learning_rate': 8.636e-07, 'completion_length': 164.05358123779297, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.00634765625, 'epoch': 0.14} + 14%|█▎ | 341/2500 [1:45:42<11:14:35, 18.75s/it] 14%|█▎ | 342/2500 [1:46:00<11:08:09, 18.58s/it] {'loss': 0.0003, 'grad_norm': 0.4992390086634576, 'learning_rate': 8.632e-07, 'completion_length': 159.57144165039062, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.00634765625, 'epoch': 0.14} + 14%|█▎ | 342/2500 [1:46:00<11:08:09, 18.58s/it] 14%|█▎ | 343/2500 [1:46:19<11:04:10, 18.48s/it] {'loss': 0.0002, 'grad_norm': 0.38126447591042895, 'learning_rate': 8.628e-07, 'completion_length': 159.4464340209961, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695358991622925, 'kl': 0.00616455078125, 'epoch': 0.14} + 14%|█▎ | 343/2500 [1:46:19<11:04:10, 18.48s/it] 14%|█▍ | 344/2500 [1:46:37<11:05:09, 18.51s/it] {'loss': 0.0003, 'grad_norm': 0.051580953824847064, 'learning_rate': 8.624e-07, 'completion_length': 150.35714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0063018798828125, 'epoch': 0.14} + 14%|█▍ | 344/2500 [1:46:37<11:05:09, 18.51s/it] 14%|█▍ | 345/2500 [1:46:54<10:48:42, 18.06s/it] {'loss': 0.0002, 'grad_norm': 0.1143786330285186, 'learning_rate': 8.62e-07, 'completion_length': 146.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00531005859375, 'epoch': 0.14} + 14%|█▍ | 345/2500 [1:46:54<10:48:42, 18.06s/it] 14%|█▍ | 346/2500 [1:47:12<10:44:17, 17.95s/it] {'loss': 0.0002, 'grad_norm': 2.05535109501599, 'learning_rate': 8.616e-07, 'completion_length': 155.6071548461914, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0060577392578125, 'epoch': 0.14} + 14%|█▍ | 346/2500 [1:47:12<10:44:17, 17.95s/it] 14%|█▍ | 347/2500 [1:47:31<10:54:41, 18.25s/it] {'loss': 0.0002, 'grad_norm': 0.41879405661701136, 'learning_rate': 8.611999999999999e-07, 'completion_length': 150.76786041259766, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0047760009765625, 'epoch': 0.14} + 14%|█▍ | 347/2500 [1:47:31<10:54:41, 18.25s/it] 14%|█▍ | 348/2500 [1:47:48<10:47:54, 18.06s/it] {'loss': 0.0002, 'grad_norm': 0.1825018552106633, 'learning_rate': 8.608e-07, 'completion_length': 145.5357208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004364013671875, 'epoch': 0.14} + 14%|█▍ | 348/2500 [1:47:48<10:47:54, 18.06s/it] 14%|█▍ | 349/2500 [1:48:07<10:55:54, 18.30s/it] {'loss': 0.0002, 'grad_norm': 1.0661397181279575, 'learning_rate': 8.604000000000001e-07, 'completion_length': 154.96429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00617218017578125, 'epoch': 0.14} + 14%|█▍ | 349/2500 [1:48:07<10:55:54, 18.30s/it] 14%|█▍ | 350/2500 [1:48:26<11:01:25, 18.46s/it] {'loss': 0.0003, 'grad_norm': 0.4580428828106186, 'learning_rate': 8.599999999999999e-07, 'completion_length': 168.75000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.00640869140625, 'epoch': 0.14} + 14%|█▍ | 350/2500 [1:48:26<11:01:25, 18.46s/it] 14%|█▍ | 351/2500 [1:48:44<10:51:44, 18.20s/it] {'loss': 0.0002, 'grad_norm': 0.32003258794778533, 'learning_rate': 8.596e-07, 'completion_length': 152.6964340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.004241943359375, 'epoch': 0.14} + 14%|█▍ | 351/2500 [1:48:44<10:51:44, 18.20s/it] 14%|█▍ | 352/2500 [1:49:01<10:42:37, 17.95s/it] {'loss': 0.0002, 'grad_norm': 0.5367140129692651, 'learning_rate': 8.592e-07, 'completion_length': 153.80358123779297, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.006011962890625, 'epoch': 0.14} + 14%|█▍ | 352/2500 [1:49:01<10:42:37, 17.95s/it] 14%|█▍ | 353/2500 [1:49:19<10:46:32, 18.07s/it] {'loss': 0.0003, 'grad_norm': 4.255072632884478, 'learning_rate': 8.587999999999999e-07, 'completion_length': 157.08929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.006378173828125, 'epoch': 0.14} + 14%|█▍ | 353/2500 [1:49:19<10:46:32, 18.07s/it] 14%|█▍ | 354/2500 [1:49:38<10:50:27, 18.19s/it] {'loss': 0.0003, 'grad_norm': 0.9734044539247716, 'learning_rate': 8.584e-07, 'completion_length': 162.5178680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.1428571492433548, 'kl': 0.006439208984375, 'epoch': 0.14} + 14%|█▍ | 354/2500 [1:49:38<10:50:27, 18.19s/it] 14%|█▍ | 355/2500 [1:49:55<10:41:36, 17.95s/it] {'loss': 0.0003, 'grad_norm': 0.2249009777779209, 'learning_rate': 8.58e-07, 'completion_length': 150.6964340209961, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0065155029296875, 'epoch': 0.14} + 14%|█▍ | 355/2500 [1:49:55<10:41:36, 17.95s/it] 14%|█▍ | 356/2500 [1:50:12<10:31:30, 17.67s/it] {'loss': 0.0002, 'grad_norm': 0.49198155789045195, 'learning_rate': 8.576e-07, 'completion_length': 144.75000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005645751953125, 'epoch': 0.14} + 14%|█▍ | 356/2500 [1:50:12<10:31:30, 17.67s/it] 14%|█▍ | 357/2500 [1:50:31<10:38:47, 17.88s/it] {'loss': 0.0003, 'grad_norm': 0.48678932645549877, 'learning_rate': 8.571999999999999e-07, 'completion_length': 170.83929443359375, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.14838215708732605, 'kl': 0.0068511962890625, 'epoch': 0.14} + 14%|█▍ | 357/2500 [1:50:31<10:38:47, 17.88s/it] 14%|█▍ | 358/2500 [1:50:50<10:53:51, 18.32s/it] {'loss': 0.0002, 'grad_norm': 0.2872992329874426, 'learning_rate': 8.568e-07, 'completion_length': 156.80358123779297, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0048675537109375, 'epoch': 0.14} + 14%|█▍ | 358/2500 [1:50:50<10:53:51, 18.32s/it] 14%|█▍ | 359/2500 [1:51:08<10:45:23, 18.09s/it] {'loss': 0.0002, 'grad_norm': 0.4573811336946604, 'learning_rate': 8.564e-07, 'completion_length': 150.7857208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0055084228515625, 'epoch': 0.14} + 14%|█▍ | 359/2500 [1:51:08<10:45:23, 18.09s/it] 14%|█▍ | 360/2500 [1:51:27<10:54:12, 18.34s/it] {'loss': 0.0002, 'grad_norm': 0.3970798472293535, 'learning_rate': 8.559999999999999e-07, 'completion_length': 162.6964340209961, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0050201416015625, 'epoch': 0.14} + 14%|█▍ | 360/2500 [1:51:27<10:54:12, 18.34s/it] 14%|█▍ | 361/2500 [1:51:45<11:00:24, 18.52s/it] {'loss': 0.0002, 'grad_norm': 0.1594186678931565, 'learning_rate': 8.556e-07, 'completion_length': 158.08929443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0048828125, 'epoch': 0.14} + 14%|█▍ | 361/2500 [1:51:45<11:00:24, 18.52s/it] 14%|█▍ | 362/2500 [1:52:04<10:59:40, 18.51s/it] {'loss': 0.0004, 'grad_norm': 0.4007243943243891, 'learning_rate': 8.551999999999999e-07, 'completion_length': 170.4107208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.010101318359375, 'epoch': 0.14} + 14%|█▍ | 362/2500 [1:52:04<10:59:40, 18.51s/it] 15%|█▍ | 363/2500 [1:52:23<11:08:02, 18.76s/it] {'loss': 0.0002, 'grad_norm': 0.020041329463809494, 'learning_rate': 8.548e-07, 'completion_length': 158.01786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0054931640625, 'epoch': 0.15} + 15%|█▍ | 363/2500 [1:52:23<11:08:02, 18.76s/it] 15%|█▍ | 364/2500 [1:52:41<10:55:07, 18.40s/it] {'loss': 0.0002, 'grad_norm': 0.8682114564605734, 'learning_rate': 8.544e-07, 'completion_length': 148.0714340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00592041015625, 'epoch': 0.15} + 15%|█▍ | 364/2500 [1:52:41<10:55:07, 18.40s/it] 15%|█▍ | 365/2500 [1:52:58<10:43:58, 18.10s/it] {'loss': 0.0003, 'grad_norm': 0.36362953485073385, 'learning_rate': 8.539999999999999e-07, 'completion_length': 152.1964340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0067901611328125, 'epoch': 0.15} + 15%|█▍ | 365/2500 [1:52:58<10:43:58, 18.10s/it] 15%|█▍ | 366/2500 [1:53:16<10:44:26, 18.12s/it] {'loss': 0.0003, 'grad_norm': 0.5840826952719076, 'learning_rate': 8.536e-07, 'completion_length': 160.71429443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.007568359375, 'epoch': 0.15} + 15%|█▍ | 366/2500 [1:53:16<10:44:26, 18.12s/it] 15%|█▍ | 367/2500 [1:53:35<10:48:52, 18.25s/it] {'loss': 0.0002, 'grad_norm': 0.022178790394260357, 'learning_rate': 8.531999999999999e-07, 'completion_length': 159.1428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005096435546875, 'epoch': 0.15} + 15%|█▍ | 367/2500 [1:53:35<10:48:52, 18.25s/it] 15%|█▍ | 368/2500 [1:53:53<10:49:17, 18.27s/it] {'loss': 0.0003, 'grad_norm': 0.6672953348455006, 'learning_rate': 8.528e-07, 'completion_length': 156.85714721679688, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.0062713623046875, 'epoch': 0.15} + 15%|█▍ | 368/2500 [1:53:53<10:49:17, 18.27s/it] 15%|█▍ | 369/2500 [1:54:13<11:03:14, 18.67s/it] {'loss': 0.0003, 'grad_norm': 0.3599577132025652, 'learning_rate': 8.524e-07, 'completion_length': 169.30357360839844, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07695358991622925, 'kl': 0.0079193115234375, 'epoch': 0.15} + 15%|█▍ | 369/2500 [1:54:13<11:03:14, 18.67s/it] 15%|█▍ | 370/2500 [1:54:32<11:03:15, 18.68s/it] {'loss': 0.0003, 'grad_norm': 0.5025124424453162, 'learning_rate': 8.52e-07, 'completion_length': 168.5178680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.00787353515625, 'epoch': 0.15} + 15%|█▍ | 370/2500 [1:54:32<11:03:15, 18.68s/it] 15%|█▍ | 371/2500 [1:54:49<10:54:34, 18.45s/it] {'loss': 0.0002, 'grad_norm': 0.01914292309330038, 'learning_rate': 8.516e-07, 'completion_length': 149.25000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00482177734375, 'epoch': 0.15} + 15%|█▍ | 371/2500 [1:54:50<10:54:34, 18.45s/it] 15%|█▍ | 372/2500 [1:55:09<11:03:46, 18.72s/it] {'loss': 0.0003, 'grad_norm': 1.1435559868143794, 'learning_rate': 8.511999999999999e-07, 'completion_length': 163.39286041259766, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.0066986083984375, 'epoch': 0.15} + 15%|█▍ | 372/2500 [1:55:09<11:03:46, 18.72s/it] 15%|█▍ | 373/2500 [1:55:27<11:01:14, 18.65s/it] {'loss': 0.0003, 'grad_norm': 0.21876698042858028, 'learning_rate': 8.508e-07, 'completion_length': 165.75000762939453, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.0069580078125, 'epoch': 0.15} + 15%|█▍ | 373/2500 [1:55:27<11:01:14, 18.65s/it] 15%|█▍ | 374/2500 [1:55:45<10:53:10, 18.43s/it] {'loss': 0.0002, 'grad_norm': 0.2553008914546184, 'learning_rate': 8.504e-07, 'completion_length': 150.96429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00527191162109375, 'epoch': 0.15} + 15%|█▍ | 374/2500 [1:55:45<10:53:10, 18.43s/it] 15%|█▌ | 375/2500 [1:56:05<11:11:13, 18.95s/it] {'loss': 0.0003, 'grad_norm': 0.2505567589151752, 'learning_rate': 8.499999999999999e-07, 'completion_length': 167.00000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00627899169921875, 'epoch': 0.15} + 15%|█▌ | 375/2500 [1:56:05<11:11:13, 18.95s/it] 15%|█▌ | 376/2500 [1:56:24<11:09:40, 18.92s/it] {'loss': 0.0003, 'grad_norm': 0.2631021483361884, 'learning_rate': 8.496e-07, 'completion_length': 167.1607208251953, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.008209228515625, 'epoch': 0.15} + 15%|█▌ | 376/2500 [1:56:24<11:09:40, 18.92s/it] 15%|█▌ | 377/2500 [1:56:44<11:13:14, 19.03s/it] {'loss': 0.0002, 'grad_norm': 0.2811456047810402, 'learning_rate': 8.492e-07, 'completion_length': 165.6071548461914, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.005889892578125, 'epoch': 0.15} + 15%|█▌ | 377/2500 [1:56:44<11:13:14, 19.03s/it] 15%|█▌ | 378/2500 [1:57:02<11:08:02, 18.89s/it] {'loss': 0.0003, 'grad_norm': 0.02007312303531381, 'learning_rate': 8.487999999999999e-07, 'completion_length': 162.05358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00665283203125, 'epoch': 0.15} + 15%|█▌ | 378/2500 [1:57:02<11:08:02, 18.89s/it] 15%|█▌ | 379/2500 [1:57:19<10:47:21, 18.31s/it] {'loss': 0.0002, 'grad_norm': 0.01860412348913907, 'learning_rate': 8.484e-07, 'completion_length': 145.875, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005157470703125, 'epoch': 0.15} + 15%|█▌ | 379/2500 [1:57:19<10:47:21, 18.31s/it] 15%|█▌ | 380/2500 [1:57:37<10:43:30, 18.21s/it] {'loss': 0.0003, 'grad_norm': 0.6601840907875447, 'learning_rate': 8.48e-07, 'completion_length': 164.6428680419922, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.785714328289032, 'reward_std': 0.1539071798324585, 'kl': 0.00848388671875, 'epoch': 0.15} + 15%|█▌ | 380/2500 [1:57:37<10:43:30, 18.21s/it] 15%|█▌ | 381/2500 [1:57:55<10:45:31, 18.28s/it] {'loss': 0.0002, 'grad_norm': 0.8406677511238668, 'learning_rate': 8.475999999999999e-07, 'completion_length': 146.42857360839844, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.0061492919921875, 'epoch': 0.15} + 15%|█▌ | 381/2500 [1:57:55<10:45:31, 18.28s/it] 15%|█▌ | 382/2500 [1:58:14<10:49:00, 18.39s/it] {'loss': 0.0002, 'grad_norm': 0.15629809242553866, 'learning_rate': 8.471999999999999e-07, 'completion_length': 160.3214340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0059356689453125, 'epoch': 0.15} + 15%|█▌ | 382/2500 [1:58:14<10:49:00, 18.39s/it] 15%|█▌ | 383/2500 [1:58:33<10:50:53, 18.45s/it] {'loss': 0.0003, 'grad_norm': 0.29960601010161636, 'learning_rate': 8.468e-07, 'completion_length': 166.4464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0064239501953125, 'epoch': 0.15} + 15%|█▌ | 383/2500 [1:58:33<10:50:53, 18.45s/it] 15%|█▌ | 384/2500 [1:58:51<10:52:34, 18.50s/it] {'loss': 0.0002, 'grad_norm': 0.5572756702922187, 'learning_rate': 8.464e-07, 'completion_length': 146.0357208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.00414276123046875, 'epoch': 0.15} + 15%|█▌ | 384/2500 [1:58:51<10:52:34, 18.50s/it] 15%|█▌ | 385/2500 [1:59:10<10:50:26, 18.45s/it] {'loss': 0.0002, 'grad_norm': 0.42567519761857237, 'learning_rate': 8.459999999999999e-07, 'completion_length': 151.42857360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0056610107421875, 'epoch': 0.15} + 15%|█▌ | 385/2500 [1:59:10<10:50:26, 18.45s/it] 15%|█▌ | 386/2500 [1:59:28<10:44:05, 18.28s/it] {'loss': 0.0002, 'grad_norm': 1.1613665989656201, 'learning_rate': 8.456e-07, 'completion_length': 146.23214721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.00518798828125, 'epoch': 0.15} + 15%|█▌ | 386/2500 [1:59:28<10:44:05, 18.28s/it] 15%|█▌ | 387/2500 [1:59:46<10:45:32, 18.33s/it] {'loss': 0.0002, 'grad_norm': 0.2237581276174186, 'learning_rate': 8.451999999999999e-07, 'completion_length': 142.44644165039062, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00445556640625, 'epoch': 0.15} + 15%|█▌ | 387/2500 [1:59:46<10:45:32, 18.33s/it] 16%|█▌ | 388/2500 [2:00:04<10:37:29, 18.11s/it] {'loss': 0.0002, 'grad_norm': 0.025061175889581108, 'learning_rate': 8.447999999999999e-07, 'completion_length': 143.21429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0056304931640625, 'epoch': 0.16} + 16%|█▌ | 388/2500 [2:00:04<10:37:29, 18.11s/it] 16%|█▌ | 389/2500 [2:00:22<10:45:19, 18.34s/it] {'loss': 0.0002, 'grad_norm': 0.3323353668527655, 'learning_rate': 8.444e-07, 'completion_length': 158.21429443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0052947998046875, 'epoch': 0.16} + 16%|█▌ | 389/2500 [2:00:22<10:45:19, 18.34s/it] 16%|█▌ | 390/2500 [2:00:40<10:39:42, 18.19s/it] {'loss': 0.0002, 'grad_norm': 0.20756509851530297, 'learning_rate': 8.439999999999999e-07, 'completion_length': 149.17858123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0056915283203125, 'epoch': 0.16} + 16%|█▌ | 390/2500 [2:00:40<10:39:42, 18.19s/it] 16%|█▌ | 391/2500 [2:00:58<10:29:57, 17.92s/it] {'loss': 0.0002, 'grad_norm': 0.036374953417738644, 'learning_rate': 8.436e-07, 'completion_length': 158.50000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00457763671875, 'epoch': 0.16} + 16%|█▌ | 391/2500 [2:00:58<10:29:57, 17.92s/it] 16%|█▌ | 392/2500 [2:01:17<10:40:59, 18.24s/it] {'loss': 0.0002, 'grad_norm': 0.26209808345810093, 'learning_rate': 8.431999999999999e-07, 'completion_length': 146.46428680419922, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0052642822265625, 'epoch': 0.16} + 16%|█▌ | 392/2500 [2:01:17<10:40:59, 18.24s/it] 16%|█▌ | 393/2500 [2:01:35<10:36:58, 18.14s/it] {'loss': 0.0002, 'grad_norm': 0.3788571800620408, 'learning_rate': 8.428e-07, 'completion_length': 155.1607208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0058746337890625, 'epoch': 0.16} + 16%|█▌ | 393/2500 [2:01:35<10:36:58, 18.14s/it] 16%|█▌ | 394/2500 [2:01:53<10:41:10, 18.27s/it] {'loss': 0.0003, 'grad_norm': 0.027764935211995818, 'learning_rate': 8.424e-07, 'completion_length': 157.87500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00738525390625, 'epoch': 0.16} + 16%|█▌ | 394/2500 [2:01:53<10:41:10, 18.27s/it] 16%|█▌ | 395/2500 [2:02:11<10:36:13, 18.13s/it] {'loss': 0.0003, 'grad_norm': 0.02883981533265674, 'learning_rate': 8.419999999999999e-07, 'completion_length': 153.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0067901611328125, 'epoch': 0.16} + 16%|█▌ | 395/2500 [2:02:11<10:36:13, 18.13s/it] 16%|█▌ | 396/2500 [2:02:30<10:44:06, 18.37s/it] {'loss': 0.0002, 'grad_norm': 1.2846551836773816, 'learning_rate': 8.416e-07, 'completion_length': 151.96429443359375, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.0060882568359375, 'epoch': 0.16} + 16%|█▌ | 396/2500 [2:02:30<10:44:06, 18.37s/it] 16%|█▌ | 397/2500 [2:02:48<10:38:15, 18.21s/it] {'loss': 0.0002, 'grad_norm': 0.026159118184487185, 'learning_rate': 8.411999999999999e-07, 'completion_length': 157.83929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.006134033203125, 'epoch': 0.16} + 16%|█▌ | 397/2500 [2:02:48<10:38:15, 18.21s/it] 16%|█▌ | 398/2500 [2:03:04<10:21:38, 17.74s/it] {'loss': 0.0002, 'grad_norm': 0.09609348626992868, 'learning_rate': 8.408e-07, 'completion_length': 140.9107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0039825439453125, 'epoch': 0.16} + 16%|█▌ | 398/2500 [2:03:04<10:21:38, 17.74s/it] 16%|█▌ | 399/2500 [2:03:23<10:35:34, 18.15s/it] {'loss': 0.0003, 'grad_norm': 0.026039160796668275, 'learning_rate': 8.404e-07, 'completion_length': 150.2857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0069732666015625, 'epoch': 0.16} + 16%|█▌ | 399/2500 [2:03:23<10:35:34, 18.15s/it] 16%|█▌ | 400/2500 [2:03:41<10:27:59, 17.94s/it] {'loss': 0.0002, 'grad_norm': 0.5326658575033939, 'learning_rate': 8.399999999999999e-07, 'completion_length': 154.3571548461914, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.1071428619325161, 'kl': 0.006195068359375, 'epoch': 0.16} + 16%|█▌ | 400/2500 [2:03:41<10:27:59, 17.94s/it] 16%|█▌ | 401/2500 [2:07:04<42:48:40, 73.43s/it] {'loss': 0.0002, 'grad_norm': 0.018314771022362442, 'learning_rate': 8.396e-07, 'completion_length': 142.83928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00472259521484375, 'epoch': 0.16} + 16%|█▌ | 401/2500 [2:07:04<42:48:40, 73.43s/it] 16%|█▌ | 402/2500 [2:07:22<33:13:12, 57.00s/it] {'loss': 0.0003, 'grad_norm': 0.16247346266875368, 'learning_rate': 8.391999999999999e-07, 'completion_length': 162.00000762939453, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.006317138671875, 'epoch': 0.16} + 16%|█▌ | 402/2500 [2:07:22<33:13:12, 57.00s/it] 16%|█▌ | 403/2500 [2:07:41<26:27:54, 45.43s/it] {'loss': 0.0002, 'grad_norm': 0.19682845955317124, 'learning_rate': 8.387999999999999e-07, 'completion_length': 164.0357208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00616455078125, 'epoch': 0.16} + 16%|█▌ | 403/2500 [2:07:41<26:27:54, 45.43s/it] 16%|█▌ | 404/2500 [2:08:00<21:48:23, 37.45s/it] {'loss': 0.0002, 'grad_norm': 0.022587846495832765, 'learning_rate': 8.384e-07, 'completion_length': 166.3214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0061492919921875, 'epoch': 0.16} + 16%|█▌ | 404/2500 [2:08:00<21:48:23, 37.45s/it] 16%|█▌ | 405/2500 [2:08:18<18:23:04, 31.59s/it] {'loss': 0.0002, 'grad_norm': 0.04087081533771532, 'learning_rate': 8.38e-07, 'completion_length': 147.3214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004730224609375, 'epoch': 0.16} + 16%|█▌ | 405/2500 [2:08:18<18:23:04, 31.59s/it] 16%|█▌ | 406/2500 [2:08:35<15:54:52, 27.36s/it] {'loss': 0.0002, 'grad_norm': 0.03659005352013964, 'learning_rate': 8.375999999999999e-07, 'completion_length': 145.75000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0054779052734375, 'epoch': 0.16} + 16%|█▌ | 406/2500 [2:08:35<15:54:52, 27.36s/it] 16%|█▋ | 407/2500 [2:08:53<14:14:03, 24.48s/it] {'loss': 0.0003, 'grad_norm': 0.34208945240402516, 'learning_rate': 8.372e-07, 'completion_length': 155.08929443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0084381103515625, 'epoch': 0.16} + 16%|█▋ | 407/2500 [2:08:53<14:14:03, 24.48s/it] 16%|█▋ | 408/2500 [2:09:10<13:00:31, 22.39s/it] {'loss': 0.0003, 'grad_norm': 0.03160983006053219, 'learning_rate': 8.368e-07, 'completion_length': 161.21428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0075225830078125, 'epoch': 0.16} + 16%|█▋ | 408/2500 [2:09:10<13:00:31, 22.39s/it] 16%|█▋ | 409/2500 [2:09:29<12:15:38, 21.11s/it] {'loss': 0.0004, 'grad_norm': 0.39846121174186855, 'learning_rate': 8.363999999999999e-07, 'completion_length': 172.60714721679688, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1181928999722004, 'kl': 0.009552001953125, 'epoch': 0.16} + 16%|█▋ | 409/2500 [2:09:29<12:15:38, 21.11s/it] 16%|█▋ | 410/2500 [2:09:46<11:38:51, 20.06s/it] {'loss': 0.0002, 'grad_norm': 0.19795952366315245, 'learning_rate': 8.359999999999999e-07, 'completion_length': 155.55358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0051422119140625, 'epoch': 0.16} + 16%|█▋ | 410/2500 [2:09:46<11:38:51, 20.06s/it] 16%|█▋ | 411/2500 [2:10:06<11:38:57, 20.08s/it] {'loss': 0.0002, 'grad_norm': 0.20234158300442215, 'learning_rate': 8.356e-07, 'completion_length': 165.17858123779297, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.0050506591796875, 'epoch': 0.16} + 16%|█▋ | 411/2500 [2:10:06<11:38:57, 20.08s/it] 16%|█▋ | 412/2500 [2:10:24<11:15:43, 19.42s/it] {'loss': 0.0002, 'grad_norm': 0.014815439070875451, 'learning_rate': 8.352000000000001e-07, 'completion_length': 154.50000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0060882568359375, 'epoch': 0.16} + 16%|█▋ | 412/2500 [2:10:24<11:15:43, 19.42s/it] 17%|█▋ | 413/2500 [2:10:43<11:05:45, 19.14s/it] {'loss': 0.0003, 'grad_norm': 0.577623103899781, 'learning_rate': 8.347999999999999e-07, 'completion_length': 162.98214721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.0068511962890625, 'epoch': 0.17} + 17%|█▋ | 413/2500 [2:10:43<11:05:45, 19.14s/it] 17%|█▋ | 414/2500 [2:11:01<10:53:12, 18.79s/it] {'loss': 0.0002, 'grad_norm': 0.01817837404807306, 'learning_rate': 8.344e-07, 'completion_length': 168.0178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00567626953125, 'epoch': 0.17} + 17%|█▋ | 414/2500 [2:11:01<10:53:12, 18.79s/it] 17%|█▋ | 415/2500 [2:11:19<10:45:11, 18.57s/it] {'loss': 0.0003, 'grad_norm': 0.44276924930738115, 'learning_rate': 8.34e-07, 'completion_length': 146.80358123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.00665283203125, 'epoch': 0.17} + 17%|█▋ | 415/2500 [2:11:19<10:45:11, 18.57s/it] 17%|█▋ | 416/2500 [2:11:37<10:46:06, 18.60s/it] {'loss': 0.0003, 'grad_norm': 0.2392254543491441, 'learning_rate': 8.335999999999999e-07, 'completion_length': 166.25000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00640869140625, 'epoch': 0.17} + 17%|█▋ | 416/2500 [2:11:37<10:46:06, 18.60s/it] 17%|█▋ | 417/2500 [2:11:55<10:35:44, 18.31s/it] {'loss': 0.0002, 'grad_norm': 0.4110183202613614, 'learning_rate': 8.332e-07, 'completion_length': 140.3214340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0048675537109375, 'epoch': 0.17} + 17%|█▋ | 417/2500 [2:11:55<10:35:44, 18.31s/it] 17%|█▋ | 418/2500 [2:12:13<10:31:56, 18.21s/it] {'loss': 0.0002, 'grad_norm': 0.3109162395043459, 'learning_rate': 8.328e-07, 'completion_length': 152.25000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0059967041015625, 'epoch': 0.17} + 17%|█▋ | 418/2500 [2:12:13<10:31:56, 18.21s/it] 17%|█▋ | 419/2500 [2:12:33<10:46:07, 18.63s/it] {'loss': 0.0002, 'grad_norm': 0.5964645256538879, 'learning_rate': 8.324e-07, 'completion_length': 155.9107208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.0059814453125, 'epoch': 0.17} + 17%|█▋ | 419/2500 [2:12:33<10:46:07, 18.63s/it] 17%|█▋ | 420/2500 [2:12:50<10:36:32, 18.36s/it] {'loss': 0.0003, 'grad_norm': 3.561026249627075, 'learning_rate': 8.319999999999999e-07, 'completion_length': 146.50000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.00738525390625, 'epoch': 0.17} + 17%|█▋ | 420/2500 [2:12:50<10:36:32, 18.36s/it] 17%|█▋ | 421/2500 [2:13:07<10:23:15, 17.99s/it] {'loss': 0.0002, 'grad_norm': 0.43273922645526175, 'learning_rate': 8.316e-07, 'completion_length': 139.23214721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00482177734375, 'epoch': 0.17} + 17%|█▋ | 421/2500 [2:13:07<10:23:15, 17.99s/it] 17%|█▋ | 422/2500 [2:13:26<10:30:12, 18.20s/it] {'loss': 0.0002, 'grad_norm': 0.48139668127545404, 'learning_rate': 8.312e-07, 'completion_length': 161.7678680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0052947998046875, 'epoch': 0.17} + 17%|█▋ | 422/2500 [2:13:26<10:30:12, 18.20s/it] 17%|█▋ | 423/2500 [2:13:45<10:38:34, 18.45s/it] {'loss': 0.0002, 'grad_norm': 0.34820010241241595, 'learning_rate': 8.308e-07, 'completion_length': 155.4107208251953, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.00616455078125, 'epoch': 0.17} + 17%|█▋ | 423/2500 [2:13:45<10:38:34, 18.45s/it] 17%|█▋ | 424/2500 [2:14:02<10:25:35, 18.08s/it] {'loss': 0.0003, 'grad_norm': 1.1747154781874494, 'learning_rate': 8.304e-07, 'completion_length': 139.58929443359375, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.11266787722706795, 'kl': 0.00689697265625, 'epoch': 0.17} + 17%|█▋ | 424/2500 [2:14:02<10:25:35, 18.08s/it] 17%|█▋ | 425/2500 [2:14:20<10:16:08, 17.82s/it] {'loss': 0.0002, 'grad_norm': 0.4164156684368373, 'learning_rate': 8.299999999999999e-07, 'completion_length': 148.05358123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0047607421875, 'epoch': 0.17} + 17%|█▋ | 425/2500 [2:14:20<10:16:08, 17.82s/it] 17%|█▋ | 426/2500 [2:14:39<10:29:58, 18.22s/it] {'loss': 0.0002, 'grad_norm': 0.05309524498768394, 'learning_rate': 8.296e-07, 'completion_length': 149.8214340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00567626953125, 'epoch': 0.17} + 17%|█▋ | 426/2500 [2:14:39<10:29:58, 18.22s/it] 17%|█▋ | 427/2500 [2:14:56<10:15:49, 17.82s/it] {'loss': 0.0002, 'grad_norm': 0.1491957729554956, 'learning_rate': 8.292e-07, 'completion_length': 147.83929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006195068359375, 'epoch': 0.17} + 17%|█▋ | 427/2500 [2:14:56<10:15:49, 17.82s/it] 17%|█▋ | 428/2500 [2:15:14<10:18:11, 17.90s/it] {'loss': 0.0002, 'grad_norm': 0.4549682167109727, 'learning_rate': 8.287999999999999e-07, 'completion_length': 157.85714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0052337646484375, 'epoch': 0.17} + 17%|█▋ | 428/2500 [2:15:14<10:18:11, 17.90s/it] 17%|█▋ | 429/2500 [2:15:36<11:03:10, 19.21s/it] {'loss': 0.0002, 'grad_norm': 0.4925356301056704, 'learning_rate': 8.284e-07, 'completion_length': 165.4821548461914, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8928571939468384, 'reward_std': 0.10410194098949432, 'kl': 0.00543212890625, 'epoch': 0.17} + 17%|█▋ | 429/2500 [2:15:36<11:03:10, 19.21s/it] 17%|█▋ | 430/2500 [2:15:54<10:48:32, 18.80s/it] {'loss': 0.0002, 'grad_norm': 1.685148723383936, 'learning_rate': 8.28e-07, 'completion_length': 149.58928680419922, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695358991622925, 'kl': 0.0045318603515625, 'epoch': 0.17} + 17%|█▋ | 430/2500 [2:15:54<10:48:32, 18.80s/it] 17%|█▋ | 431/2500 [2:16:11<10:36:36, 18.46s/it] {'loss': 0.0003, 'grad_norm': 0.01913371250109772, 'learning_rate': 8.275999999999999e-07, 'completion_length': 157.07144165039062, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.007476806640625, 'epoch': 0.17} + 17%|█▋ | 431/2500 [2:16:11<10:36:36, 18.46s/it] 17%|█▋ | 432/2500 [2:16:29<10:24:38, 18.12s/it] {'loss': 0.0003, 'grad_norm': 0.2856110099832292, 'learning_rate': 8.272e-07, 'completion_length': 156.1964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0064849853515625, 'epoch': 0.17} + 17%|█▋ | 432/2500 [2:16:29<10:24:38, 18.12s/it] 17%|█▋ | 433/2500 [2:16:46<10:17:25, 17.92s/it] {'loss': 0.0003, 'grad_norm': 0.26935687348161275, 'learning_rate': 8.268e-07, 'completion_length': 149.33929443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.007232666015625, 'epoch': 0.17} + 17%|█▋ | 433/2500 [2:16:46<10:17:25, 17.92s/it] 17%|█▋ | 434/2500 [2:17:03<10:10:08, 17.72s/it] {'loss': 0.0003, 'grad_norm': 0.38910965920253004, 'learning_rate': 8.263999999999999e-07, 'completion_length': 145.33929443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.1071428656578064, 'kl': 0.006439208984375, 'epoch': 0.17} + 17%|█▋ | 434/2500 [2:17:03<10:10:08, 17.72s/it] 17%|█▋ | 435/2500 [2:17:23<10:25:45, 18.18s/it] {'loss': 0.0003, 'grad_norm': 0.36279740644692055, 'learning_rate': 8.259999999999999e-07, 'completion_length': 169.17858123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.007080078125, 'epoch': 0.17} + 17%|█▋ | 435/2500 [2:17:23<10:25:45, 18.18s/it] 17%|█▋ | 436/2500 [2:17:41<10:25:59, 18.20s/it] {'loss': 0.0002, 'grad_norm': 0.020759212858682805, 'learning_rate': 8.256e-07, 'completion_length': 166.8928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00447845458984375, 'epoch': 0.17} + 17%|█▋ | 436/2500 [2:17:41<10:25:59, 18.20s/it] 17%|█▋ | 437/2500 [2:18:00<10:38:13, 18.56s/it] {'loss': 0.0003, 'grad_norm': 0.02057888185434288, 'learning_rate': 8.252000000000001e-07, 'completion_length': 155.21429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006317138671875, 'epoch': 0.17} + 17%|█▋ | 437/2500 [2:18:00<10:38:13, 18.56s/it] 18%|█▊ | 438/2500 [2:18:18<10:27:01, 18.25s/it] {'loss': 0.0002, 'grad_norm': 1.728664387889621, 'learning_rate': 8.247999999999999e-07, 'completion_length': 157.4464340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.005859375, 'epoch': 0.18} + 18%|█▊ | 438/2500 [2:18:18<10:27:01, 18.25s/it] 18%|█▊ | 439/2500 [2:18:36<10:29:55, 18.34s/it] {'loss': 0.0002, 'grad_norm': 0.0226563163415976, 'learning_rate': 8.244e-07, 'completion_length': 152.42858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0055999755859375, 'epoch': 0.18} + 18%|█▊ | 439/2500 [2:18:36<10:29:55, 18.34s/it] 18%|█▊ | 440/2500 [2:18:55<10:32:03, 18.41s/it] {'loss': 0.0002, 'grad_norm': 0.35007095479636335, 'learning_rate': 8.24e-07, 'completion_length': 158.00000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.00592041015625, 'epoch': 0.18} + 18%|█▊ | 440/2500 [2:18:55<10:32:03, 18.41s/it] 18%|█▊ | 441/2500 [2:19:14<10:35:08, 18.51s/it] {'loss': 0.0002, 'grad_norm': 0.029422705991305937, 'learning_rate': 8.235999999999999e-07, 'completion_length': 141.48214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0041351318359375, 'epoch': 0.18} + 18%|█▊ | 441/2500 [2:19:14<10:35:08, 18.51s/it] 18%|█▊ | 442/2500 [2:19:34<10:52:40, 19.03s/it] {'loss': 0.0002, 'grad_norm': 0.4476293273413355, 'learning_rate': 8.232e-07, 'completion_length': 159.60714721679688, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.1539071798324585, 'kl': 0.006103515625, 'epoch': 0.18} + 18%|█▊ | 442/2500 [2:19:34<10:52:40, 19.03s/it] 18%|█▊ | 443/2500 [2:19:53<10:56:01, 19.14s/it] {'loss': 0.0003, 'grad_norm': 0.4084140954128988, 'learning_rate': 8.228e-07, 'completion_length': 146.8214340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0063323974609375, 'epoch': 0.18} + 18%|█▊ | 443/2500 [2:19:53<10:56:01, 19.14s/it] 18%|█▊ | 444/2500 [2:20:12<10:46:19, 18.86s/it] {'loss': 0.0002, 'grad_norm': 0.01918990495310585, 'learning_rate': 8.224e-07, 'completion_length': 164.37500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00445556640625, 'epoch': 0.18} + 18%|█▊ | 444/2500 [2:20:12<10:46:19, 18.86s/it] 18%|█▊ | 445/2500 [2:20:30<10:41:59, 18.74s/it] {'loss': 0.0003, 'grad_norm': 0.41522024889254483, 'learning_rate': 8.219999999999999e-07, 'completion_length': 172.1607208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.0066680908203125, 'epoch': 0.18} + 18%|█▊ | 445/2500 [2:20:30<10:41:59, 18.74s/it] 18%|█▊ | 446/2500 [2:20:49<10:38:14, 18.64s/it] {'loss': 0.0002, 'grad_norm': 1.7280816239617367, 'learning_rate': 8.216e-07, 'completion_length': 146.82144165039062, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0046539306640625, 'epoch': 0.18} + 18%|█▊ | 446/2500 [2:20:49<10:38:14, 18.64s/it] 18%|█▊ | 447/2500 [2:21:06<10:27:27, 18.34s/it] {'loss': 0.0002, 'grad_norm': 0.022872385685801192, 'learning_rate': 8.212e-07, 'completion_length': 150.55358123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0060882568359375, 'epoch': 0.18} + 18%|█▊ | 447/2500 [2:21:06<10:27:27, 18.34s/it] 18%|█▊ | 448/2500 [2:21:24<10:27:26, 18.35s/it] {'loss': 0.0003, 'grad_norm': 0.3320825068401449, 'learning_rate': 8.207999999999999e-07, 'completion_length': 171.5714340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0078582763671875, 'epoch': 0.18} + 18%|█▊ | 448/2500 [2:21:24<10:27:26, 18.35s/it] 18%|█▊ | 449/2500 [2:21:43<10:26:29, 18.33s/it] {'loss': 0.0003, 'grad_norm': 0.3370396044362829, 'learning_rate': 8.204e-07, 'completion_length': 159.8928680419922, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.1071428656578064, 'kl': 0.006561279296875, 'epoch': 0.18} + 18%|█▊ | 449/2500 [2:21:43<10:26:29, 18.33s/it] 18%|█▊ | 450/2500 [2:22:01<10:23:31, 18.25s/it] {'loss': 0.0002, 'grad_norm': 0.19694825402502952, 'learning_rate': 8.199999999999999e-07, 'completion_length': 154.67857360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004241943359375, 'epoch': 0.18} + 18%|█▊ | 450/2500 [2:22:01<10:23:31, 18.25s/it] 18%|█▊ | 451/2500 [2:22:19<10:20:12, 18.16s/it] {'loss': 0.0002, 'grad_norm': 0.4152666431197066, 'learning_rate': 8.196e-07, 'completion_length': 151.17858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.00434112548828125, 'epoch': 0.18} + 18%|█▊ | 451/2500 [2:22:19<10:20:12, 18.16s/it] 18%|█▊ | 452/2500 [2:22:38<10:30:04, 18.46s/it] {'loss': 0.0003, 'grad_norm': 1.4155466626026056, 'learning_rate': 8.192e-07, 'completion_length': 159.73214721679688, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.04123930633068085, 'kl': 0.0072479248046875, 'epoch': 0.18} + 18%|█▊ | 452/2500 [2:22:38<10:30:04, 18.46s/it] 18%|█▊ | 453/2500 [2:22:56<10:29:42, 18.46s/it] {'loss': 0.0002, 'grad_norm': 0.41231811067252805, 'learning_rate': 8.187999999999999e-07, 'completion_length': 171.39286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0054931640625, 'epoch': 0.18} + 18%|█▊ | 453/2500 [2:22:56<10:29:42, 18.46s/it] 18%|█▊ | 454/2500 [2:23:15<10:35:48, 18.65s/it] {'loss': 0.0003, 'grad_norm': 0.24134568020241054, 'learning_rate': 8.184e-07, 'completion_length': 171.6964340209961, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.857142984867096, 'reward_std': 0.0714285746216774, 'kl': 0.0068817138671875, 'epoch': 0.18} + 18%|█▊ | 454/2500 [2:23:15<10:35:48, 18.65s/it] 18%|█▊ | 455/2500 [2:23:34<10:31:09, 18.52s/it] {'loss': 0.0002, 'grad_norm': 0.045276781694889495, 'learning_rate': 8.179999999999999e-07, 'completion_length': 166.07144165039062, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0056610107421875, 'epoch': 0.18} + 18%|█▊ | 455/2500 [2:23:34<10:31:09, 18.52s/it] 18%|█▊ | 456/2500 [2:23:54<10:50:25, 19.09s/it] {'loss': 0.0002, 'grad_norm': 0.32296192751762615, 'learning_rate': 8.175999999999999e-07, 'completion_length': 149.67858123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0038604736328125, 'epoch': 0.18} + 18%|█▊ | 456/2500 [2:23:54<10:50:25, 19.09s/it] 18%|█▊ | 457/2500 [2:24:16<11:20:16, 19.98s/it] {'loss': 0.0002, 'grad_norm': 0.8311281601035484, 'learning_rate': 8.172e-07, 'completion_length': 148.2857208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0056304931640625, 'epoch': 0.18} + 18%|█▊ | 457/2500 [2:24:16<11:20:16, 19.98s/it] 18%|█▊ | 458/2500 [2:24:40<11:54:57, 21.01s/it] {'loss': 0.0002, 'grad_norm': 0.24957039438805517, 'learning_rate': 8.168e-07, 'completion_length': 175.42857360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0051422119140625, 'epoch': 0.18} + 18%|█▊ | 458/2500 [2:24:40<11:54:57, 21.01s/it] 18%|█▊ | 459/2500 [2:25:01<11:58:19, 21.12s/it] {'loss': 0.0002, 'grad_norm': 0.23031899663451547, 'learning_rate': 8.163999999999999e-07, 'completion_length': 146.8928680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0042724609375, 'epoch': 0.18} + 18%|█▊ | 459/2500 [2:25:01<11:58:19, 21.12s/it] 18%|█▊ | 460/2500 [2:25:24<12:13:51, 21.58s/it] {'loss': 0.0002, 'grad_norm': 0.03383512139468016, 'learning_rate': 8.159999999999999e-07, 'completion_length': 158.76786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045318603515625, 'epoch': 0.18} + 18%|█▊ | 460/2500 [2:25:24<12:13:51, 21.58s/it] 18%|█▊ | 461/2500 [2:25:46<12:16:40, 21.68s/it] {'loss': 0.0002, 'grad_norm': 0.383704233876472, 'learning_rate': 8.156e-07, 'completion_length': 162.80358123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.0061187744140625, 'epoch': 0.18} + 18%|█▊ | 461/2500 [2:25:46<12:16:40, 21.68s/it] 18%|█▊ | 462/2500 [2:26:07<12:18:40, 21.75s/it] {'loss': 0.0002, 'grad_norm': 0.23175656509930267, 'learning_rate': 8.152e-07, 'completion_length': 163.25000762939453, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.006195068359375, 'epoch': 0.18} + 18%|█▊ | 462/2500 [2:26:07<12:18:40, 21.75s/it] 19%|��▊ | 463/2500 [2:26:30<12:28:44, 22.05s/it] {'loss': 0.0002, 'grad_norm': 0.038391899317583744, 'learning_rate': 8.147999999999999e-07, 'completion_length': 155.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0050811767578125, 'epoch': 0.19} + 19%|█▊ | 463/2500 [2:26:30<12:28:44, 22.05s/it] 19%|█▊ | 464/2500 [2:26:53<12:40:38, 22.42s/it] {'loss': 0.0002, 'grad_norm': 0.03062137038778199, 'learning_rate': 8.144e-07, 'completion_length': 155.83929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00494384765625, 'epoch': 0.19} + 19%|█▊ | 464/2500 [2:26:53<12:40:38, 22.42s/it] 19%|█▊ | 465/2500 [2:27:17<12:47:29, 22.63s/it] {'loss': 0.0002, 'grad_norm': 0.34250543372725023, 'learning_rate': 8.14e-07, 'completion_length': 153.25000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0057525634765625, 'epoch': 0.19} + 19%|█▊ | 465/2500 [2:27:17<12:47:29, 22.63s/it] 19%|█▊ | 466/2500 [2:27:39<12:44:05, 22.54s/it] {'loss': 0.0002, 'grad_norm': 0.2804080653074512, 'learning_rate': 8.135999999999999e-07, 'completion_length': 156.83929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00409698486328125, 'epoch': 0.19} + 19%|█▊ | 466/2500 [2:27:39<12:44:05, 22.54s/it] 19%|█▊ | 467/2500 [2:28:02<12:45:40, 22.60s/it] {'loss': 0.0002, 'grad_norm': 0.051696979425763005, 'learning_rate': 8.132e-07, 'completion_length': 158.2857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0057220458984375, 'epoch': 0.19} + 19%|█▊ | 467/2500 [2:28:02<12:45:40, 22.60s/it] 19%|█▊ | 468/2500 [2:28:24<12:41:42, 22.49s/it] {'loss': 0.0002, 'grad_norm': 0.3726609276606978, 'learning_rate': 8.128e-07, 'completion_length': 139.07144165039062, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00414276123046875, 'epoch': 0.19} + 19%|█▊ | 468/2500 [2:28:24<12:41:42, 22.49s/it] 19%|█▉ | 469/2500 [2:28:47<12:43:37, 22.56s/it] {'loss': 0.0002, 'grad_norm': 0.20206879950604417, 'learning_rate': 8.123999999999999e-07, 'completion_length': 153.21429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005828857421875, 'epoch': 0.19} + 19%|█▉ | 469/2500 [2:28:47<12:43:37, 22.56s/it] 19%|█▉ | 470/2500 [2:29:09<12:43:39, 22.57s/it] {'loss': 0.0002, 'grad_norm': 0.22357372997914998, 'learning_rate': 8.12e-07, 'completion_length': 154.30357360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0042877197265625, 'epoch': 0.19} + 19%|█▉ | 470/2500 [2:29:09<12:43:39, 22.57s/it] 19%|█▉ | 471/2500 [2:29:33<12:59:13, 23.04s/it] {'loss': 0.0002, 'grad_norm': 2.3538126258503027, 'learning_rate': 8.116e-07, 'completion_length': 166.6964340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.00592041015625, 'epoch': 0.19} + 19%|█▉ | 471/2500 [2:29:33<12:59:13, 23.04s/it] 19%|█▉ | 472/2500 [2:29:55<12:39:32, 22.47s/it] {'loss': 0.0002, 'grad_norm': 0.019827440023319386, 'learning_rate': 8.112e-07, 'completion_length': 151.46429443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0042724609375, 'epoch': 0.19} + 19%|█▉ | 472/2500 [2:29:55<12:39:32, 22.47s/it] 19%|█▉ | 473/2500 [2:30:17<12:35:25, 22.36s/it] {'loss': 0.0002, 'grad_norm': 0.19825006485500338, 'learning_rate': 8.107999999999999e-07, 'completion_length': 149.4464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0045166015625, 'epoch': 0.19} + 19%|█▉ | 473/2500 [2:30:17<12:35:25, 22.36s/it] 19%|█▉ | 474/2500 [2:30:40<12:41:10, 22.54s/it] {'loss': 0.0002, 'grad_norm': 0.7054054020549971, 'learning_rate': 8.104e-07, 'completion_length': 164.51786041259766, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.1181928962469101, 'kl': 0.005157470703125, 'epoch': 0.19} + 19%|█▉ | 474/2500 [2:30:40<12:41:10, 22.54s/it] 19%|█▉ | 475/2500 [2:31:02<12:39:03, 22.49s/it] {'loss': 0.0003, 'grad_norm': 0.9313504101059066, 'learning_rate': 8.1e-07, 'completion_length': 172.71429443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.008209228515625, 'epoch': 0.19} + 19%|█▉ | 475/2500 [2:31:02<12:39:03, 22.49s/it] 19%|█▉ | 476/2500 [2:31:23<12:28:13, 22.18s/it] {'loss': 0.0002, 'grad_norm': 0.0336933404424004, 'learning_rate': 8.095999999999999e-07, 'completion_length': 151.8214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0053253173828125, 'epoch': 0.19} + 19%|█▉ | 476/2500 [2:31:23<12:28:13, 22.18s/it] 19%|█▉ | 477/2500 [2:31:46<12:33:50, 22.36s/it] {'loss': 0.0002, 'grad_norm': 0.2687728493064335, 'learning_rate': 8.092e-07, 'completion_length': 153.30357360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.00469970703125, 'epoch': 0.19} + 19%|█▉ | 477/2500 [2:31:46<12:33:50, 22.36s/it] 19%|█▉ | 478/2500 [2:32:08<12:28:52, 22.22s/it] {'loss': 0.0002, 'grad_norm': 1.069451988058492, 'learning_rate': 8.087999999999999e-07, 'completion_length': 145.87500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004638671875, 'epoch': 0.19} + 19%|█▉ | 478/2500 [2:32:08<12:28:52, 22.22s/it] 19%|█▉ | 479/2500 [2:32:30<12:28:09, 22.21s/it] {'loss': 0.0002, 'grad_norm': 0.07424418367523979, 'learning_rate': 8.084e-07, 'completion_length': 155.05357360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.005035400390625, 'epoch': 0.19} + 19%|█▉ | 479/2500 [2:32:30<12:28:09, 22.21s/it] 19%|█▉ | 480/2500 [2:32:54<12:47:41, 22.80s/it] {'loss': 0.0003, 'grad_norm': 0.3769289297813307, 'learning_rate': 8.08e-07, 'completion_length': 181.5178680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.0073699951171875, 'epoch': 0.19} + 19%|█▉ | 480/2500 [2:32:54<12:47:41, 22.80s/it] 19%|█▉ | 481/2500 [2:33:18<12:55:43, 23.05s/it] {'loss': 0.0002, 'grad_norm': 0.29189737218936995, 'learning_rate': 8.075999999999999e-07, 'completion_length': 170.92858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0055694580078125, 'epoch': 0.19} + 19%|█▉ | 481/2500 [2:33:18<12:55:43, 23.05s/it] 19%|█▉ | 482/2500 [2:33:41<12:50:57, 22.92s/it] {'loss': 0.0003, 'grad_norm': 0.7057318346013193, 'learning_rate': 8.072e-07, 'completion_length': 168.6607208251953, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.0062713623046875, 'epoch': 0.19} + 19%|█▉ | 482/2500 [2:33:41<12:50:57, 22.92s/it] 19%|█▉ | 483/2500 [2:34:02<12:29:39, 22.30s/it] {'loss': 0.0002, 'grad_norm': 0.019877025038262962, 'learning_rate': 8.067999999999999e-07, 'completion_length': 155.60714721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00567626953125, 'epoch': 0.19} + 19%|█▉ | 483/2500 [2:34:02<12:29:39, 22.30s/it] 19%|█▉ | 484/2500 [2:34:24<12:34:40, 22.46s/it] {'loss': 0.0002, 'grad_norm': 0.034016052815379866, 'learning_rate': 8.064e-07, 'completion_length': 160.0357208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.005950927734375, 'epoch': 0.19} + 19%|█▉ | 484/2500 [2:34:24<12:34:40, 22.46s/it] 19%|█▉ | 485/2500 [2:34:48<12:43:45, 22.74s/it] {'loss': 0.0002, 'grad_norm': 0.2823375313366844, 'learning_rate': 8.06e-07, 'completion_length': 155.7678680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0062103271484375, 'epoch': 0.19} + 19%|█▉ | 485/2500 [2:34:48<12:43:45, 22.74s/it] 19%|█▉ | 486/2500 [2:35:11<12:48:47, 22.90s/it] {'loss': 0.0002, 'grad_norm': 0.18726000683872565, 'learning_rate': 8.056e-07, 'completion_length': 156.26786041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0052337646484375, 'epoch': 0.19} + 19%|█▉ | 486/2500 [2:35:11<12:48:47, 22.90s/it] 19%|█▉ | 487/2500 [2:35:33<12:41:41, 22.70s/it] {'loss': 0.0002, 'grad_norm': 0.41518519034309237, 'learning_rate': 8.052e-07, 'completion_length': 160.58929443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005859375, 'epoch': 0.19} + 19%|█▉ | 487/2500 [2:35:33<12:41:41, 22.70s/it] 20%|█▉ | 488/2500 [2:35:57<12:51:54, 23.02s/it] {'loss': 0.0003, 'grad_norm': 0.8190059038938612, 'learning_rate': 8.047999999999999e-07, 'completion_length': 167.26786041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.008148193359375, 'epoch': 0.2} + 20%|█▉ | 488/2500 [2:35:57<12:51:54, 23.02s/it] 20%|█▉ | 489/2500 [2:36:20<12:50:07, 22.98s/it] {'loss': 0.0002, 'grad_norm': 0.2695495435495721, 'learning_rate': 8.044e-07, 'completion_length': 160.83929443359375, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.005096435546875, 'epoch': 0.2} + 20%|█▉ | 489/2500 [2:36:20<12:50:07, 22.98s/it] 20%|█▉ | 490/2500 [2:36:44<13:01:55, 23.34s/it] {'loss': 0.0002, 'grad_norm': 0.4189915024010795, 'learning_rate': 8.04e-07, 'completion_length': 160.35714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00518798828125, 'epoch': 0.2} + 20%|█▉ | 490/2500 [2:36:44<13:01:55, 23.34s/it] 20%|█▉ | 491/2500 [2:37:06<12:49:34, 22.98s/it] {'loss': 0.0002, 'grad_norm': 0.23220850058097312, 'learning_rate': 8.035999999999999e-07, 'completion_length': 144.67858123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00414276123046875, 'epoch': 0.2} + 20%|█▉ | 491/2500 [2:37:06<12:49:34, 22.98s/it] 20%|█▉ | 492/2500 [2:37:28<12:36:26, 22.60s/it] {'loss': 0.0002, 'grad_norm': 0.04790861015243064, 'learning_rate': 8.032e-07, 'completion_length': 145.44644165039062, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0052337646484375, 'epoch': 0.2} + 20%|█▉ | 492/2500 [2:37:28<12:36:26, 22.60s/it] 20%|█▉ | 493/2500 [2:37:51<12:38:24, 22.67s/it] {'loss': 0.0002, 'grad_norm': 0.01824551220253725, 'learning_rate': 8.028e-07, 'completion_length': 156.50000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.004180908203125, 'epoch': 0.2} + 20%|█▉ | 493/2500 [2:37:51<12:38:24, 22.67s/it] 20%|█▉ | 494/2500 [2:38:13<12:37:54, 22.67s/it] {'loss': 0.0003, 'grad_norm': 0.3026007732212036, 'learning_rate': 8.023999999999999e-07, 'completion_length': 164.6428680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0073699951171875, 'epoch': 0.2} + 20%|█▉ | 494/2500 [2:38:13<12:37:54, 22.67s/it] 20%|█▉ | 495/2500 [2:38:35<12:26:29, 22.34s/it] {'loss': 0.0002, 'grad_norm': 0.026005164420037207, 'learning_rate': 8.02e-07, 'completion_length': 160.7678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0048065185546875, 'epoch': 0.2} + 20%|█▉ | 495/2500 [2:38:35<12:26:29, 22.34s/it] 20%|█▉ | 496/2500 [2:38:57<12:18:30, 22.11s/it] {'loss': 0.0003, 'grad_norm': 0.6972537476434894, 'learning_rate': 8.016e-07, 'completion_length': 158.2321548461914, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.07695359364151955, 'kl': 0.0081939697265625, 'epoch': 0.2} + 20%|█▉ | 496/2500 [2:38:57<12:18:30, 22.11s/it] 20%|█▉ | 497/2500 [2:39:19<12:17:47, 22.10s/it] {'loss': 0.0002, 'grad_norm': 0.02516963532364619, 'learning_rate': 8.012e-07, 'completion_length': 156.9464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00469970703125, 'epoch': 0.2} + 20%|█▉ | 497/2500 [2:39:19<12:17:47, 22.10s/it] 20%|█▉ | 498/2500 [2:39:41<12:23:47, 22.29s/it] {'loss': 0.0003, 'grad_norm': 0.42138989910325214, 'learning_rate': 8.007999999999999e-07, 'completion_length': 162.67857360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0064544677734375, 'epoch': 0.2} + 20%|█▉ | 498/2500 [2:39:41<12:23:47, 22.29s/it] 20%|█▉ | 499/2500 [2:40:04<12:29:50, 22.48s/it] {'loss': 0.0002, 'grad_norm': 0.1944380922113119, 'learning_rate': 8.004e-07, 'completion_length': 165.89286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006103515625, 'epoch': 0.2} + 20%|█▉ | 499/2500 [2:40:04<12:29:50, 22.48s/it] 20%|██ | 500/2500 [2:40:27<12:30:29, 22.51s/it] {'loss': 0.0003, 'grad_norm': 0.41718241975550524, 'learning_rate': 8e-07, 'completion_length': 166.3214340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.006256103515625, 'epoch': 0.2} + 20%|██ | 500/2500 [2:40:27<12:30:29, 22.51s/it] 20%|██ | 501/2500 [2:43:50<42:32:58, 76.63s/it] {'loss': 0.0002, 'grad_norm': 0.7126683181086774, 'learning_rate': 7.995999999999999e-07, 'completion_length': 157.60714721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005615234375, 'epoch': 0.2} + 20%|██ | 501/2500 [2:43:50<42:32:58, 76.63s/it] 20%|██ | 502/2500 [2:44:12<33:29:12, 60.34s/it] {'loss': 0.0002, 'grad_norm': 0.027144884760843247, 'learning_rate': 7.992e-07, 'completion_length': 154.14286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00518798828125, 'epoch': 0.2} + 20%|██ | 502/2500 [2:44:12<33:29:12, 60.34s/it] 20%|██ | 503/2500 [2:44:36<27:19:44, 49.27s/it] {'loss': 0.0002, 'grad_norm': 0.022106059853863734, 'learning_rate': 7.987999999999999e-07, 'completion_length': 174.08929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0056915283203125, 'epoch': 0.2} + 20%|██ | 503/2500 [2:44:36<27:19:44, 49.27s/it] 20%|██ | 504/2500 [2:44:57<22:38:42, 40.84s/it] {'loss': 0.0002, 'grad_norm': 0.36852785005690686, 'learning_rate': 7.984e-07, 'completion_length': 144.7857208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.00437164306640625, 'epoch': 0.2} + 20%|██ | 504/2500 [2:44:57<22:38:42, 40.84s/it] 20%|██ | 505/2500 [2:45:20<19:45:39, 35.66s/it] {'loss': 0.0003, 'grad_norm': 0.3498002926500047, 'learning_rate': 7.98e-07, 'completion_length': 179.58929443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.006317138671875, 'epoch': 0.2} + 20%|██ | 505/2500 [2:45:20<19:45:39, 35.66s/it] 20%|██ | 506/2500 [2:45:41<17:19:13, 31.27s/it] {'loss': 0.0002, 'grad_norm': 0.019000470723484833, 'learning_rate': 7.975999999999999e-07, 'completion_length': 150.51786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00457000732421875, 'epoch': 0.2} + 20%|██ | 506/2500 [2:45:41<17:19:13, 31.27s/it] 20%|██ | 507/2500 [2:46:05<16:02:51, 28.99s/it] {'loss': 0.0002, 'grad_norm': 0.23012043042001876, 'learning_rate': 7.972e-07, 'completion_length': 167.17857360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0054779052734375, 'epoch': 0.2} + 20%|██ | 507/2500 [2:46:05<16:02:51, 28.99s/it] 20%|██ | 508/2500 [2:46:27<14:52:15, 26.88s/it] {'loss': 0.0002, 'grad_norm': 0.3256175778054415, 'learning_rate': 7.967999999999999e-07, 'completion_length': 152.9107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0061492919921875, 'epoch': 0.2} + 20%|██ | 508/2500 [2:46:27<14:52:15, 26.88s/it] 20%|██ | 509/2500 [2:46:50<14:13:38, 25.73s/it] {'loss': 0.0002, 'grad_norm': 0.017455611503975465, 'learning_rate': 7.964e-07, 'completion_length': 147.3928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003997802734375, 'epoch': 0.2} + 20%|██ | 509/2500 [2:46:50<14:13:38, 25.73s/it] 20%|██ | 510/2500 [2:47:15<14:10:00, 25.63s/it] {'loss': 0.0002, 'grad_norm': 0.4087347085865097, 'learning_rate': 7.96e-07, 'completion_length': 161.8214340209961, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.005096435546875, 'epoch': 0.2} + 20%|██ | 510/2500 [2:47:15<14:10:00, 25.63s/it] 20%|██ | 511/2500 [2:47:38<13:36:05, 24.62s/it] {'loss': 0.0002, 'grad_norm': 0.38843300872064784, 'learning_rate': 7.956e-07, 'completion_length': 163.5178680419922, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1071428619325161, 'kl': 0.0062408447265625, 'epoch': 0.2} + 20%|██ | 511/2500 [2:47:38<13:36:05, 24.62s/it] 20%|██ | 512/2500 [2:47:59<13:07:29, 23.77s/it] {'loss': 0.0002, 'grad_norm': 0.030967361427241755, 'learning_rate': 7.952e-07, 'completion_length': 162.51786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004974365234375, 'epoch': 0.2} + 20%|██ | 512/2500 [2:48:00<13:07:29, 23.77s/it] 21%|██ | 513/2500 [2:48:21<12:49:06, 23.22s/it] {'loss': 0.0002, 'grad_norm': 0.24975784212684785, 'learning_rate': 7.947999999999999e-07, 'completion_length': 162.17858123779297, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0061492919921875, 'epoch': 0.21} + 21%|██ | 513/2500 [2:48:21<12:49:06, 23.22s/it] 21%|██ | 514/2500 [2:48:45<12:51:44, 23.32s/it] {'loss': 0.0003, 'grad_norm': 0.3854340231556536, 'learning_rate': 7.944e-07, 'completion_length': 162.23214721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.007232666015625, 'epoch': 0.21} + 21%|██ | 514/2500 [2:48:45<12:51:44, 23.32s/it] 21%|██ | 515/2500 [2:49:08<12:47:37, 23.20s/it] {'loss': 0.0002, 'grad_norm': 0.23973847468980777, 'learning_rate': 7.94e-07, 'completion_length': 156.98214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00446319580078125, 'epoch': 0.21} + 21%|██ | 515/2500 [2:49:08<12:47:37, 23.20s/it] 21%|██ | 516/2500 [2:49:33<13:03:28, 23.69s/it] {'loss': 0.0003, 'grad_norm': 0.6872354334962709, 'learning_rate': 7.935999999999999e-07, 'completion_length': 182.9107208251953, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.12371791899204254, 'kl': 0.0066986083984375, 'epoch': 0.21} + 21%|██ | 516/2500 [2:49:33<13:03:28, 23.69s/it] 21%|██ | 517/2500 [2:49:54<12:36:14, 22.88s/it] {'loss': 0.0002, 'grad_norm': 0.03675404059567345, 'learning_rate': 7.932e-07, 'completion_length': 143.42857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0043792724609375, 'epoch': 0.21} + 21%|██ | 517/2500 [2:49:54<12:36:14, 22.88s/it] 21%|██ | 518/2500 [2:50:16<12:32:51, 22.79s/it] {'loss': 0.0003, 'grad_norm': 0.4557583926104573, 'learning_rate': 7.928e-07, 'completion_length': 156.80357360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0064239501953125, 'epoch': 0.21} + 21%|██ | 518/2500 [2:50:16<12:32:51, 22.79s/it] 21%|██ | 519/2500 [2:50:40<12:38:02, 22.96s/it] {'loss': 0.0002, 'grad_norm': 0.023975235634529985, 'learning_rate': 7.923999999999999e-07, 'completion_length': 167.1964340209961, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.005767822265625, 'epoch': 0.21} + 21%|██ | 519/2500 [2:50:40<12:38:02, 22.96s/it] 21%|██ | 520/2500 [2:51:03<12:41:22, 23.07s/it] {'loss': 0.0002, 'grad_norm': 0.19493290762385418, 'learning_rate': 7.92e-07, 'completion_length': 144.0178680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0047607421875, 'epoch': 0.21} + 21%|██ | 520/2500 [2:51:03<12:41:22, 23.07s/it] 21%|██ | 521/2500 [2:51:26<12:42:44, 23.13s/it] {'loss': 0.0003, 'grad_norm': 0.36101536791413824, 'learning_rate': 7.916e-07, 'completion_length': 176.58928680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0073089599609375, 'epoch': 0.21} + 21%|██ | 521/2500 [2:51:26<12:42:44, 23.13s/it] 21%|██ | 522/2500 [2:51:48<12:29:49, 22.74s/it] {'loss': 0.0002, 'grad_norm': 0.2381580101488969, 'learning_rate': 7.911999999999999e-07, 'completion_length': 160.92858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0048828125, 'epoch': 0.21} + 21%|██ | 522/2500 [2:51:48<12:29:49, 22.74s/it] 21%|██ | 523/2500 [2:52:11<12:33:48, 22.88s/it] {'loss': 0.0002, 'grad_norm': 0.2086009876783063, 'learning_rate': 7.907999999999999e-07, 'completion_length': 169.7321548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00543212890625, 'epoch': 0.21} + 21%|██ | 523/2500 [2:52:11<12:33:48, 22.88s/it] 21%|██ | 524/2500 [2:52:34<12:34:18, 22.90s/it] {'loss': 0.0003, 'grad_norm': 0.034210957486348836, 'learning_rate': 7.904e-07, 'completion_length': 158.75, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00628662109375, 'epoch': 0.21} + 21%|██ | 524/2500 [2:52:34<12:34:18, 22.90s/it] 21%|██ | 525/2500 [2:52:57<12:35:16, 22.95s/it] {'loss': 0.0003, 'grad_norm': 0.6716136492595927, 'learning_rate': 7.9e-07, 'completion_length': 168.08929443359375, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.11266788095235825, 'kl': 0.0068817138671875, 'epoch': 0.21} + 21%|██ | 525/2500 [2:52:57<12:35:16, 22.95s/it] 21%|██ | 526/2500 [2:53:19<12:24:24, 22.63s/it] {'loss': 0.0002, 'grad_norm': 0.15973214304051814, 'learning_rate': 7.895999999999999e-07, 'completion_length': 161.2857208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00606536865234375, 'epoch': 0.21} + 21%|██ | 526/2500 [2:53:19<12:24:24, 22.63s/it] 21%|██ | 527/2500 [2:53:43<12:34:34, 22.95s/it] {'loss': 0.0002, 'grad_norm': 0.24644947565945802, 'learning_rate': 7.892e-07, 'completion_length': 153.8571548461914, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0054931640625, 'epoch': 0.21} + 21%|██ | 527/2500 [2:53:43<12:34:34, 22.95s/it] 21%|██ | 528/2500 [2:54:05<12:26:57, 22.73s/it] {'loss': 0.0002, 'grad_norm': 0.3010006768156352, 'learning_rate': 7.887999999999999e-07, 'completion_length': 155.57144165039062, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.005218505859375, 'epoch': 0.21} + 21%|██ | 528/2500 [2:54:05<12:26:57, 22.73s/it] 21%|██ | 529/2500 [2:54:28<12:31:29, 22.88s/it] {'loss': 0.0003, 'grad_norm': 0.46077202504762493, 'learning_rate': 7.883999999999999e-07, 'completion_length': 167.3928680419922, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1071428619325161, 'kl': 0.0063629150390625, 'epoch': 0.21} + 21%|██ | 529/2500 [2:54:28<12:31:29, 22.88s/it] 21%|██ | 530/2500 [2:54:51<12:26:10, 22.73s/it] {'loss': 0.0002, 'grad_norm': 0.38450226784659763, 'learning_rate': 7.88e-07, 'completion_length': 156.05358123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.004852294921875, 'epoch': 0.21} + 21%|██ | 530/2500 [2:54:51<12:26:10, 22.73s/it] 21%|██ | 531/2500 [2:55:11<12:04:33, 22.08s/it] {'loss': 0.0001, 'grad_norm': 0.39155876223720426, 'learning_rate': 7.875999999999999e-07, 'completion_length': 138.57144165039062, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.00347900390625, 'epoch': 0.21} + 21%|██ | 531/2500 [2:55:11<12:04:33, 22.08s/it] 21%|██▏ | 532/2500 [2:55:34<12:14:27, 22.39s/it] {'loss': 0.0003, 'grad_norm': 0.24077203903257421, 'learning_rate': 7.872e-07, 'completion_length': 159.17858123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0080413818359375, 'epoch': 0.21} + 21%|██▏ | 532/2500 [2:55:34<12:14:27, 22.39s/it] 21%|██▏ | 533/2500 [2:55:57<12:20:45, 22.60s/it] {'loss': 0.0002, 'grad_norm': 0.3620835378674048, 'learning_rate': 7.868e-07, 'completion_length': 158.9464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00431060791015625, 'epoch': 0.21} + 21%|██▏ | 533/2500 [2:55:57<12:20:45, 22.60s/it] 21%|██▏ | 534/2500 [2:56:20<12:20:14, 22.59s/it] {'loss': 0.0002, 'grad_norm': 0.04364615996067358, 'learning_rate': 7.864e-07, 'completion_length': 152.58929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005889892578125, 'epoch': 0.21} + 21%|██▏ | 534/2500 [2:56:20<12:20:14, 22.59s/it] 21%|██▏ | 535/2500 [2:56:42<12:13:20, 22.39s/it] {'loss': 0.0003, 'grad_norm': 0.02600162553758705, 'learning_rate': 7.86e-07, 'completion_length': 155.6964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00775146484375, 'epoch': 0.21} + 21%|██▏ | 535/2500 [2:56:42<12:13:20, 22.39s/it] 21%|██▏ | 536/2500 [2:57:05<12:15:32, 22.47s/it] {'loss': 0.0002, 'grad_norm': 0.6979246169033904, 'learning_rate': 7.855999999999999e-07, 'completion_length': 171.1607208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.11266788095235825, 'kl': 0.00555419921875, 'epoch': 0.21} + 21%|██▏ | 536/2500 [2:57:05<12:15:32, 22.47s/it] 21%|██▏ | 537/2500 [2:57:28<12:19:20, 22.60s/it] {'loss': 0.0002, 'grad_norm': 0.14862473382289565, 'learning_rate': 7.852e-07, 'completion_length': 160.33928680419922, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.00482177734375, 'epoch': 0.21} + 21%|██▏ | 537/2500 [2:57:28<12:19:20, 22.60s/it] 22%|██▏ | 538/2500 [2:57:50<12:14:38, 22.47s/it] {'loss': 0.0002, 'grad_norm': 0.31515780117045705, 'learning_rate': 7.848e-07, 'completion_length': 148.3214340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0057525634765625, 'epoch': 0.22} + 22%|██▏ | 538/2500 [2:57:50<12:14:38, 22.47s/it] 22%|██▏ | 539/2500 [2:58:10<11:57:01, 21.94s/it] {'loss': 0.0002, 'grad_norm': 0.8352856329448785, 'learning_rate': 7.844e-07, 'completion_length': 134.42857360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00431060791015625, 'epoch': 0.22} + 22%|██▏ | 539/2500 [2:58:10<11:57:01, 21.94s/it] 22%|██▏ | 540/2500 [2:58:31<11:46:16, 21.62s/it] {'loss': 0.0002, 'grad_norm': 0.5232029452529157, 'learning_rate': 7.84e-07, 'completion_length': 142.10714721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0048370361328125, 'epoch': 0.22} + 22%|██▏ | 540/2500 [2:58:31<11:46:16, 21.62s/it] 22%|██▏ | 541/2500 [2:58:54<12:01:24, 22.10s/it] {'loss': 0.0002, 'grad_norm': 0.22883086995116933, 'learning_rate': 7.835999999999999e-07, 'completion_length': 165.6428680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.006195068359375, 'epoch': 0.22} + 22%|██▏ | 541/2500 [2:58:54<12:01:24, 22.10s/it] 22%|██▏ | 542/2500 [2:59:16<11:55:20, 21.92s/it] {'loss': 0.0002, 'grad_norm': 0.47299774546062945, 'learning_rate': 7.832e-07, 'completion_length': 143.50000762939453, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.14838216826319695, 'kl': 0.005889892578125, 'epoch': 0.22} + 22%|██▏ | 542/2500 [2:59:16<11:55:20, 21.92s/it] 22%|██▏ | 543/2500 [2:59:38<12:00:18, 22.08s/it] {'loss': 0.0002, 'grad_norm': 0.5996274842550605, 'learning_rate': 7.828e-07, 'completion_length': 153.8928680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.005218505859375, 'epoch': 0.22} + 22%|██▏ | 543/2500 [2:59:38<12:00:18, 22.08s/it] 22%|██▏ | 544/2500 [3:00:01<12:08:15, 22.34s/it] {'loss': 0.0002, 'grad_norm': 0.03346634004156934, 'learning_rate': 7.823999999999999e-07, 'completion_length': 152.8214340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0056915283203125, 'epoch': 0.22} + 22%|██▏ | 544/2500 [3:00:01<12:08:15, 22.34s/it] 22%|██▏ | 545/2500 [3:00:23<12:03:37, 22.21s/it] {'loss': 0.0002, 'grad_norm': 0.07673819635548804, 'learning_rate': 7.82e-07, 'completion_length': 150.76786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0059814453125, 'epoch': 0.22} + 22%|██▏ | 545/2500 [3:00:23<12:03:37, 22.21s/it] 22%|██▏ | 546/2500 [3:00:48<12:31:31, 23.08s/it] {'loss': 0.0002, 'grad_norm': 0.41277160447928474, 'learning_rate': 7.816e-07, 'completion_length': 174.6428680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0052032470703125, 'epoch': 0.22} + 22%|██▏ | 546/2500 [3:00:48<12:31:31, 23.08s/it] 22%|██▏ | 547/2500 [3:01:11<12:31:07, 23.08s/it] {'loss': 0.0002, 'grad_norm': 0.335087160214277, 'learning_rate': 7.811999999999999e-07, 'completion_length': 162.51786041259766, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0058135986328125, 'epoch': 0.22} + 22%|██▏ | 547/2500 [3:01:11<12:31:07, 23.08s/it] 22%|██▏ | 548/2500 [3:01:35<12:36:13, 23.24s/it] {'loss': 0.0002, 'grad_norm': 0.24603588287194272, 'learning_rate': 7.808e-07, 'completion_length': 168.2321548461914, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.0059967041015625, 'epoch': 0.22} + 22%|██▏ | 548/2500 [3:01:35<12:36:13, 23.24s/it] 22%|██▏ | 549/2500 [3:01:58<12:34:29, 23.20s/it] {'loss': 0.0002, 'grad_norm': 0.5963079703048696, 'learning_rate': 7.804e-07, 'completion_length': 164.62500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0059661865234375, 'epoch': 0.22} + 22%|██▏ | 549/2500 [3:01:58<12:34:29, 23.20s/it] 22%|██▏ | 550/2500 [3:02:20<12:23:53, 22.89s/it] {'loss': 0.0002, 'grad_norm': 0.016238957547961255, 'learning_rate': 7.799999999999999e-07, 'completion_length': 150.17858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00482177734375, 'epoch': 0.22} + 22%|██▏ | 550/2500 [3:02:20<12:23:53, 22.89s/it] 22%|██▏ | 551/2500 [3:02:43<12:20:52, 22.81s/it] {'loss': 0.0003, 'grad_norm': 0.820856863428055, 'learning_rate': 7.795999999999999e-07, 'completion_length': 172.3928680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.006439208984375, 'epoch': 0.22} + 22%|██▏ | 551/2500 [3:02:43<12:20:52, 22.81s/it] 22%|██▏ | 552/2500 [3:03:05<12:11:54, 22.54s/it] {'loss': 0.0002, 'grad_norm': 0.17195537346312215, 'learning_rate': 7.792e-07, 'completion_length': 159.87500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0057830810546875, 'epoch': 0.22} + 22%|██▏ | 552/2500 [3:03:05<12:11:54, 22.54s/it] 22%|██▏ | 553/2500 [3:03:27<12:08:53, 22.46s/it] {'loss': 0.0002, 'grad_norm': 0.8774324901867598, 'learning_rate': 7.788000000000001e-07, 'completion_length': 152.87500762939453, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.18409645557403564, 'kl': 0.005615234375, 'epoch': 0.22} + 22%|██▏ | 553/2500 [3:03:27<12:08:53, 22.46s/it] 22%|██▏ | 554/2500 [3:03:49<12:01:40, 22.25s/it] {'loss': 0.0002, 'grad_norm': 0.18218617544557827, 'learning_rate': 7.783999999999999e-07, 'completion_length': 169.98214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0056610107421875, 'epoch': 0.22} + 22%|██▏ | 554/2500 [3:03:49<12:01:40, 22.25s/it] 22%|██▏ | 555/2500 [3:04:11<12:00:37, 22.23s/it] {'loss': 0.0002, 'grad_norm': 0.37667996853797187, 'learning_rate': 7.78e-07, 'completion_length': 167.6607208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.00555419921875, 'epoch': 0.22} + 22%|██▏ | 555/2500 [3:04:11<12:00:37, 22.23s/it] 22%|██▏ | 556/2500 [3:04:34<12:03:52, 22.34s/it] {'loss': 0.0003, 'grad_norm': 0.2603038939939058, 'learning_rate': 7.776e-07, 'completion_length': 177.35714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.008209228515625, 'epoch': 0.22} + 22%|██▏ | 556/2500 [3:04:34<12:03:52, 22.34s/it] 22%|██▏ | 557/2500 [3:04:57<12:16:25, 22.74s/it] {'loss': 0.0003, 'grad_norm': 0.8605380566120727, 'learning_rate': 7.771999999999999e-07, 'completion_length': 166.32144165039062, 'rewards/accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.1539071798324585, 'kl': 0.008056640625, 'epoch': 0.22} + 22%|██▏ | 557/2500 [3:04:57<12:16:25, 22.74s/it] 22%|██▏ | 558/2500 [3:05:20<12:11:37, 22.60s/it] {'loss': 0.0002, 'grad_norm': 0.022945814370089627, 'learning_rate': 7.768e-07, 'completion_length': 158.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0059661865234375, 'epoch': 0.22} + 22%|██▏ | 558/2500 [3:05:20<12:11:37, 22.60s/it] 22%|██▏ | 559/2500 [3:05:43<12:13:26, 22.67s/it] {'loss': 0.0002, 'grad_norm': 0.023869710269212807, 'learning_rate': 7.764e-07, 'completion_length': 159.0357208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005096435546875, 'epoch': 0.22} + 22%|██▏ | 559/2500 [3:05:43<12:13:26, 22.67s/it] 22%|██▏ | 560/2500 [3:06:05<12:10:21, 22.59s/it] {'loss': 0.0003, 'grad_norm': 0.6105620859207698, 'learning_rate': 7.76e-07, 'completion_length': 151.1607208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0066986083984375, 'epoch': 0.22} + 22%|██▏ | 560/2500 [3:06:05<12:10:21, 22.59s/it] 22%|██▏ | 561/2500 [3:06:27<12:03:26, 22.39s/it] {'loss': 0.0002, 'grad_norm': 0.38227366694223736, 'learning_rate': 7.755999999999999e-07, 'completion_length': 140.10714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0052032470703125, 'epoch': 0.22} + 22%|██▏ | 561/2500 [3:06:27<12:03:26, 22.39s/it] 22%|██▏ | 562/2500 [3:06:50<12:09:24, 22.58s/it] {'loss': 0.0003, 'grad_norm': 0.34507534333850554, 'learning_rate': 7.752e-07, 'completion_length': 166.50000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.00738525390625, 'epoch': 0.22} + 22%|██▏ | 562/2500 [3:06:50<12:09:24, 22.58s/it] 23%|██▎ | 563/2500 [3:07:13<12:14:29, 22.75s/it] {'loss': 0.0002, 'grad_norm': 0.40879924477549867, 'learning_rate': 7.748e-07, 'completion_length': 152.8214340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0058441162109375, 'epoch': 0.23} + 23%|██▎ | 563/2500 [3:07:13<12:14:29, 22.75s/it] 23%|██▎ | 564/2500 [3:07:35<12:09:17, 22.60s/it] {'loss': 0.0003, 'grad_norm': 0.0385168665274374, 'learning_rate': 7.743999999999999e-07, 'completion_length': 155.75000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0073394775390625, 'epoch': 0.23} + 23%|██▎ | 564/2500 [3:07:35<12:09:17, 22.60s/it] 23%|██▎ | 565/2500 [3:07:57<11:58:55, 22.29s/it] {'loss': 0.0003, 'grad_norm': 0.8641843361605883, 'learning_rate': 7.74e-07, 'completion_length': 157.0714340209961, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.1181928962469101, 'kl': 0.00860595703125, 'epoch': 0.23} + 23%|██▎ | 565/2500 [3:07:57<11:58:55, 22.29s/it] 23%|██▎ | 566/2500 [3:08:18<11:45:17, 21.88s/it] {'loss': 0.0002, 'grad_norm': 0.2476786305087356, 'learning_rate': 7.735999999999999e-07, 'completion_length': 156.39286422729492, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00598907470703125, 'epoch': 0.23} + 23%|██▎ | 566/2500 [3:08:18<11:45:17, 21.88s/it] 23%|██▎ | 567/2500 [3:08:40<11:50:07, 22.04s/it] {'loss': 0.0002, 'grad_norm': 0.06992456719019198, 'learning_rate': 7.732e-07, 'completion_length': 168.75000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0052947998046875, 'epoch': 0.23} + 23%|██▎ | 567/2500 [3:08:40<11:50:07, 22.04s/it] 23%|██▎ | 568/2500 [3:09:02<11:44:55, 21.89s/it] {'loss': 0.0003, 'grad_norm': 0.5802379640188604, 'learning_rate': 7.728e-07, 'completion_length': 153.83928680419922, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.1181928999722004, 'kl': 0.007354736328125, 'epoch': 0.23} + 23%|██▎ | 568/2500 [3:09:02<11:44:55, 21.89s/it] 23%|██▎ | 569/2500 [3:09:25<11:58:58, 22.34s/it] {'loss': 0.0002, 'grad_norm': 0.18474675646695984, 'learning_rate': 7.723999999999999e-07, 'completion_length': 165.5714340209961, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0060882568359375, 'epoch': 0.23} + 23%|██▎ | 569/2500 [3:09:25<11:58:58, 22.34s/it] 23%|██▎ | 570/2500 [3:09:47<11:55:16, 22.24s/it] {'loss': 0.0003, 'grad_norm': 0.6512321045590969, 'learning_rate': 7.72e-07, 'completion_length': 159.6071548461914, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.11266788095235825, 'kl': 0.0062713623046875, 'epoch': 0.23} + 23%|██▎ | 570/2500 [3:09:47<11:55:16, 22.24s/it] 23%|██▎ | 571/2500 [3:10:10<12:03:07, 22.49s/it] {'loss': 0.0003, 'grad_norm': 0.04230059481808382, 'learning_rate': 7.716e-07, 'completion_length': 160.9464340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0067138671875, 'epoch': 0.23} + 23%|██▎ | 571/2500 [3:10:10<12:03:07, 22.49s/it] 23%|██▎ | 572/2500 [3:10:32<11:55:47, 22.28s/it] {'loss': 0.0002, 'grad_norm': 0.8962924086344175, 'learning_rate': 7.711999999999999e-07, 'completion_length': 159.33929443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0050811767578125, 'epoch': 0.23} + 23%|██▎ | 572/2500 [3:10:32<11:55:47, 22.28s/it] 23%|██▎ | 573/2500 [3:10:54<11:54:02, 22.23s/it] {'loss': 0.0002, 'grad_norm': 0.7282302522002283, 'learning_rate': 7.708e-07, 'completion_length': 163.3214340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0050201416015625, 'epoch': 0.23} + 23%|██▎ | 573/2500 [3:10:54<11:54:02, 22.23s/it] 23%|██▎ | 574/2500 [3:11:16<11:49:29, 22.10s/it] {'loss': 0.0002, 'grad_norm': 0.34582572926726834, 'learning_rate': 7.704e-07, 'completion_length': 153.00000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0050201416015625, 'epoch': 0.23} + 23%|██▎ | 574/2500 [3:11:16<11:49:29, 22.10s/it] 23%|██▎ | 575/2500 [3:11:38<11:52:36, 22.21s/it] {'loss': 0.0003, 'grad_norm': 0.3540462005980446, 'learning_rate': 7.699999999999999e-07, 'completion_length': 163.75000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0075225830078125, 'epoch': 0.23} + 23%|██▎ | 575/2500 [3:11:38<11:52:36, 22.21s/it] 23%|██▎ | 576/2500 [3:12:01<11:54:41, 22.29s/it] {'loss': 0.0002, 'grad_norm': 0.8522920987797311, 'learning_rate': 7.695999999999999e-07, 'completion_length': 151.80357360839844, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.14838216826319695, 'kl': 0.005340576171875, 'epoch': 0.23} + 23%|██▎ | 576/2500 [3:12:01<11:54:41, 22.29s/it] 23%|██▎ | 577/2500 [3:12:23<11:53:05, 22.25s/it] {'loss': 0.0002, 'grad_norm': 0.04732567826386273, 'learning_rate': 7.692e-07, 'completion_length': 153.25000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005645751953125, 'epoch': 0.23} + 23%|██▎ | 577/2500 [3:12:23<11:53:05, 22.25s/it] 23%|██▎ | 578/2500 [3:12:45<11:52:25, 22.24s/it] {'loss': 0.0003, 'grad_norm': 0.207371032311609, 'learning_rate': 7.688000000000001e-07, 'completion_length': 153.7857208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0069427490234375, 'epoch': 0.23} + 23%|██▎ | 578/2500 [3:12:45<11:52:25, 22.24s/it] 23%|██▎ | 579/2500 [3:13:07<11:47:43, 22.10s/it] {'loss': 0.0002, 'grad_norm': 0.7971673692266251, 'learning_rate': 7.683999999999999e-07, 'completion_length': 157.1428680419922, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.0059814453125, 'epoch': 0.23} + 23%|██▎ | 579/2500 [3:13:07<11:47:43, 22.10s/it] 23%|██▎ | 580/2500 [3:13:30<11:55:51, 22.37s/it] {'loss': 0.0002, 'grad_norm': 1.1558959633665438, 'learning_rate': 7.68e-07, 'completion_length': 169.55358123779297, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.11266788095235825, 'kl': 0.005645751953125, 'epoch': 0.23} + 23%|██▎ | 580/2500 [3:13:30<11:55:51, 22.37s/it] 23%|██▎ | 581/2500 [3:13:52<11:48:27, 22.15s/it] {'loss': 0.0002, 'grad_norm': 0.2033244322483886, 'learning_rate': 7.676e-07, 'completion_length': 161.50000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0052032470703125, 'epoch': 0.23} + 23%|██▎ | 581/2500 [3:13:52<11:48:27, 22.15s/it] 23%|██▎ | 582/2500 [3:14:14<11:51:25, 22.26s/it] {'loss': 0.0002, 'grad_norm': 0.4067764933584922, 'learning_rate': 7.671999999999999e-07, 'completion_length': 149.30358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005096435546875, 'epoch': 0.23} + 23%|██▎ | 582/2500 [3:14:14<11:51:25, 22.26s/it] 23%|██▎ | 583/2500 [3:14:37<11:57:19, 22.45s/it] {'loss': 0.0002, 'grad_norm': 0.03671736637889763, 'learning_rate': 7.668e-07, 'completion_length': 159.3214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0056610107421875, 'epoch': 0.23} + 23%|██▎ | 583/2500 [3:14:37<11:57:19, 22.45s/it] 23%|██▎ | 584/2500 [3:14:59<11:48:41, 22.19s/it] {'loss': 0.0003, 'grad_norm': 0.5675046149994348, 'learning_rate': 7.664e-07, 'completion_length': 151.0178680419922, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695358991622925, 'kl': 0.00653076171875, 'epoch': 0.23} + 23%|██▎ | 584/2500 [3:14:59<11:48:41, 22.19s/it] 23%|██▎ | 585/2500 [3:15:20<11:43:08, 22.03s/it] {'loss': 0.0002, 'grad_norm': 0.23725129376573162, 'learning_rate': 7.66e-07, 'completion_length': 156.6607208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0059356689453125, 'epoch': 0.23} + 23%|██▎ | 585/2500 [3:15:20<11:43:08, 22.03s/it] 23%|██▎ | 586/2500 [3:15:43<11:48:37, 22.21s/it] {'loss': 0.0003, 'grad_norm': 0.37330603179891153, 'learning_rate': 7.655999999999999e-07, 'completion_length': 165.58929443359375, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.1181928999722004, 'kl': 0.0074005126953125, 'epoch': 0.23} + 23%|██▎ | 586/2500 [3:15:43<11:48:37, 22.21s/it] 23%|██▎ | 587/2500 [3:16:05<11:51:26, 22.31s/it] {'loss': 0.0004, 'grad_norm': 1.3895020490736618, 'learning_rate': 7.652e-07, 'completion_length': 161.12500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0089263916015625, 'epoch': 0.23} + 23%|██▎ | 587/2500 [3:16:05<11:51:26, 22.31s/it] 24%|██▎ | 588/2500 [3:16:27<11:48:25, 22.23s/it] {'loss': 0.0002, 'grad_norm': 0.293591957930923, 'learning_rate': 7.648e-07, 'completion_length': 154.87500762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.00506591796875, 'epoch': 0.24} + 24%|██▎ | 588/2500 [3:16:27<11:48:25, 22.23s/it] 24%|██▎ | 589/2500 [3:16:49<11:45:23, 22.15s/it] {'loss': 0.0002, 'grad_norm': 0.1533570425655581, 'learning_rate': 7.643999999999999e-07, 'completion_length': 153.89286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0052947998046875, 'epoch': 0.24} + 24%|██▎ | 589/2500 [3:16:49<11:45:23, 22.15s/it] 24%|██▎ | 590/2500 [3:17:11<11:36:10, 21.87s/it] {'loss': 0.0002, 'grad_norm': 0.02492736189958532, 'learning_rate': 7.64e-07, 'completion_length': 146.4107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0044708251953125, 'epoch': 0.24} + 24%|██▎ | 590/2500 [3:17:11<11:36:10, 21.87s/it] 24%|██▎ | 591/2500 [3:17:34<11:46:15, 22.20s/it] {'loss': 0.0003, 'grad_norm': 0.5277625505223524, 'learning_rate': 7.635999999999999e-07, 'completion_length': 165.00000762939453, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.1428571492433548, 'kl': 0.0068359375, 'epoch': 0.24} + 24%|██▎ | 591/2500 [3:17:34<11:46:15, 22.20s/it] 24%|██▎ | 592/2500 [3:17:56<11:48:04, 22.27s/it] {'loss': 0.0002, 'grad_norm': 0.30703285121905316, 'learning_rate': 7.632e-07, 'completion_length': 153.87500762939453, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.0062103271484375, 'epoch': 0.24} + 24%|██▎ | 592/2500 [3:17:56<11:48:04, 22.27s/it] 24%|██▎ | 593/2500 [3:18:18<11:44:41, 22.17s/it] {'loss': 0.0003, 'grad_norm': 0.3130208259104836, 'learning_rate': 7.628e-07, 'completion_length': 153.0357208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0068511962890625, 'epoch': 0.24} + 24%|██▎ | 593/2500 [3:18:18<11:44:41, 22.17s/it] 24%|██▍ | 594/2500 [3:18:41<11:50:05, 22.35s/it] {'loss': 0.0002, 'grad_norm': 0.12808429180060021, 'learning_rate': 7.623999999999999e-07, 'completion_length': 161.2857208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0055999755859375, 'epoch': 0.24} + 24%|██▍ | 594/2500 [3:18:41<11:50:05, 22.35s/it] 24%|██▍ | 595/2500 [3:19:04<11:56:28, 22.57s/it] {'loss': 0.0003, 'grad_norm': 0.4442461428739521, 'learning_rate': 7.62e-07, 'completion_length': 161.17857360839844, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035714626312256, 'reward_std': 0.14838216453790665, 'kl': 0.007110595703125, 'epoch': 0.24} + 24%|██▍ | 595/2500 [3:19:04<11:56:28, 22.57s/it] 24%|██▍ | 596/2500 [3:19:25<11:44:10, 22.19s/it] {'loss': 0.0002, 'grad_norm': 0.020128245978067238, 'learning_rate': 7.616e-07, 'completion_length': 136.9464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045318603515625, 'epoch': 0.24} + 24%|██▍ | 596/2500 [3:19:25<11:44:10, 22.19s/it] 24%|██▍ | 597/2500 [3:19:49<12:04:00, 22.83s/it] {'loss': 0.0003, 'grad_norm': 0.38833871120546315, 'learning_rate': 7.611999999999999e-07, 'completion_length': 164.5357208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0071563720703125, 'epoch': 0.24} + 24%|██▍ | 597/2500 [3:19:49<12:04:00, 22.83s/it] 24%|██▍ | 598/2500 [3:20:12<11:58:54, 22.68s/it] {'loss': 0.0003, 'grad_norm': 0.23074829161317445, 'learning_rate': 7.608e-07, 'completion_length': 156.10714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0070953369140625, 'epoch': 0.24} + 24%|██▍ | 598/2500 [3:20:12<11:58:54, 22.68s/it] 24%|██▍ | 599/2500 [3:20:35<12:00:57, 22.76s/it] {'loss': 0.0002, 'grad_norm': 0.4071323848093758, 'learning_rate': 7.604e-07, 'completion_length': 161.9107208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.0058135986328125, 'epoch': 0.24} + 24%|██▍ | 599/2500 [3:20:35<12:00:57, 22.76s/it] 24%|██▍ | 600/2500 [3:20:56<11:49:41, 22.41s/it] {'loss': 0.0003, 'grad_norm': 0.026288071264582803, 'learning_rate': 7.599999999999999e-07, 'completion_length': 156.875, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0076446533203125, 'epoch': 0.24} + 24%|██▍ | 600/2500 [3:20:56<11:49:41, 22.41s/it] 24%|██▍ | 601/2500 [3:24:29<41:53:24, 79.41s/it] {'loss': 0.0002, 'grad_norm': 0.1811402405885257, 'learning_rate': 7.596e-07, 'completion_length': 175.92858123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0058441162109375, 'epoch': 0.24} + 24%|██▍ | 601/2500 [3:24:29<41:53:24, 79.41s/it] 24%|██▍ | 602/2500 [3:24:50<32:40:45, 61.98s/it] {'loss': 0.0003, 'grad_norm': 0.5930443693469609, 'learning_rate': 7.592e-07, 'completion_length': 158.23214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0068206787109375, 'epoch': 0.24} + 24%|██▍ | 602/2500 [3:24:50<32:40:45, 61.98s/it] 24%|██▍ | 603/2500 [3:25:11<26:08:41, 49.62s/it] {'loss': 0.0002, 'grad_norm': 0.19449768887130478, 'learning_rate': 7.588e-07, 'completion_length': 145.39286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00508880615234375, 'epoch': 0.24} + 24%|██▍ | 603/2500 [3:25:11<26:08:41, 49.62s/it] 24%|██▍ | 604/2500 [3:25:32<21:42:53, 41.23s/it] {'loss': 0.0002, 'grad_norm': 0.7125650292389722, 'learning_rate': 7.583999999999999e-07, 'completion_length': 141.96429061889648, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00548553466796875, 'epoch': 0.24} + 24%|██▍ | 604/2500 [3:25:33<21:42:53, 41.23s/it] 24%|██▍ | 605/2500 [3:25:54<18:35:49, 35.33s/it] {'loss': 0.0003, 'grad_norm': 0.35237398950888565, 'learning_rate': 7.58e-07, 'completion_length': 152.21429443359375, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.00732421875, 'epoch': 0.24} + 24%|██▍ | 605/2500 [3:25:54<18:35:49, 35.33s/it] 24%|██▍ | 606/2500 [3:26:19<16:59:57, 32.31s/it] {'loss': 0.0003, 'grad_norm': 0.21787298406569505, 'learning_rate': 7.576000000000001e-07, 'completion_length': 179.16072845458984, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9464285969734192, 'reward_std': 0.06838765740394592, 'kl': 0.00738525390625, 'epoch': 0.24} + 24%|██▍ | 606/2500 [3:26:19<16:59:57, 32.31s/it] 24%|██▍ | 607/2500 [3:26:39<15:01:12, 28.56s/it] {'loss': 0.0001, 'grad_norm': 0.01627061334043516, 'learning_rate': 7.571999999999999e-07, 'completion_length': 125.26786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003173828125, 'epoch': 0.24} + 24%|██▍ | 607/2500 [3:26:39<15:01:12, 28.56s/it] 24%|██▍ | 608/2500 [3:27:02<14:02:26, 26.72s/it] {'loss': 0.0003, 'grad_norm': 0.3806363701573427, 'learning_rate': 7.568e-07, 'completion_length': 162.5714340209961, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.00677490234375, 'epoch': 0.24} + 24%|██▍ | 608/2500 [3:27:02<14:02:26, 26.72s/it] 24%|██▍ | 609/2500 [3:27:23<13:12:26, 25.14s/it] {'loss': 0.0004, 'grad_norm': 0.09838718507117021, 'learning_rate': 7.564e-07, 'completion_length': 159.2857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00958251953125, 'epoch': 0.24} + 24%|██▍ | 609/2500 [3:27:23<13:12:26, 25.14s/it] 24%|██▍ | 610/2500 [3:27:45<12:41:59, 24.19s/it] {'loss': 0.0002, 'grad_norm': 0.39591465167311335, 'learning_rate': 7.559999999999999e-07, 'completion_length': 148.9107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005035400390625, 'epoch': 0.24} + 24%|██▍ | 610/2500 [3:27:45<12:41:59, 24.19s/it] 24%|██▍ | 611/2500 [3:28:07<12:18:33, 23.46s/it] {'loss': 0.0002, 'grad_norm': 0.34059219316749495, 'learning_rate': 7.556e-07, 'completion_length': 146.1964340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0059967041015625, 'epoch': 0.24} + 24%|██▍ | 611/2500 [3:28:07<12:18:33, 23.46s/it] 24%|██▍ | 612/2500 [3:28:28<12:00:07, 22.89s/it] {'loss': 0.0003, 'grad_norm': 1.1696516778054267, 'learning_rate': 7.552e-07, 'completion_length': 153.1428680419922, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.006988525390625, 'epoch': 0.24} + 24%|██▍ | 612/2500 [3:28:28<12:00:07, 22.89s/it] 25%|██▍ | 613/2500 [3:28:50<11:49:53, 22.57s/it] {'loss': 0.0002, 'grad_norm': 0.408638396498815, 'learning_rate': 7.548e-07, 'completion_length': 156.98214721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.00604248046875, 'epoch': 0.25} + 25%|██▍ | 613/2500 [3:28:50<11:49:53, 22.57s/it] 25%|██▍ | 614/2500 [3:29:12<11:46:39, 22.48s/it] {'loss': 0.0002, 'grad_norm': 4.9716137129741265, 'learning_rate': 7.543999999999999e-07, 'completion_length': 157.9107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00543212890625, 'epoch': 0.25} + 25%|██▍ | 614/2500 [3:29:12<11:46:39, 22.48s/it] 25%|██▍ | 615/2500 [3:29:36<11:57:21, 22.83s/it] {'loss': 0.0003, 'grad_norm': 0.5178598671104943, 'learning_rate': 7.54e-07, 'completion_length': 164.57144165039062, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0068359375, 'epoch': 0.25} + 25%|██▍ | 615/2500 [3:29:36<11:57:21, 22.83s/it] 25%|██▍ | 616/2500 [3:29:59<11:55:30, 22.79s/it] {'loss': 0.0003, 'grad_norm': 0.5911909693251572, 'learning_rate': 7.536e-07, 'completion_length': 165.33929443359375, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.14838215708732605, 'kl': 0.0073394775390625, 'epoch': 0.25} + 25%|██▍ | 616/2500 [3:29:59<11:55:30, 22.79s/it] 25%|██▍ | 617/2500 [3:30:21<11:49:45, 22.62s/it] {'loss': 0.0002, 'grad_norm': 0.5021403559519088, 'learning_rate': 7.531999999999999e-07, 'completion_length': 155.60714721679688, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.11266788095235825, 'kl': 0.005950927734375, 'epoch': 0.25} + 25%|██▍ | 617/2500 [3:30:21<11:49:45, 22.62s/it] 25%|██▍ | 618/2500 [3:30:45<12:01:42, 23.01s/it] {'loss': 0.0004, 'grad_norm': 0.5308023756934204, 'learning_rate': 7.528e-07, 'completion_length': 176.4821548461914, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.009185791015625, 'epoch': 0.25} + 25%|██▍ | 618/2500 [3:30:45<12:01:42, 23.01s/it] 25%|██▍ | 619/2500 [3:31:06<11:42:40, 22.41s/it] {'loss': 0.0002, 'grad_norm': 1.190506401207762, 'learning_rate': 7.523999999999999e-07, 'completion_length': 149.4107208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.006195068359375, 'epoch': 0.25} + 25%|██▍ | 619/2500 [3:31:06<11:42:40, 22.41s/it] 25%|██▍ | 620/2500 [3:31:28<11:36:30, 22.23s/it] {'loss': 0.0003, 'grad_norm': 0.36089751579948964, 'learning_rate': 7.52e-07, 'completion_length': 156.60714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0085601806640625, 'epoch': 0.25} + 25%|██▍ | 620/2500 [3:31:28<11:36:30, 22.23s/it] 25%|██▍ | 621/2500 [3:31:51<11:41:36, 22.40s/it] {'loss': 0.0003, 'grad_norm': 1.011269793767679, 'learning_rate': 7.516e-07, 'completion_length': 168.6428680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0072479248046875, 'epoch': 0.25} + 25%|██▍ | 621/2500 [3:31:51<11:41:36, 22.40s/it] 25%|██▍ | 622/2500 [3:32:12<11:28:35, 22.00s/it] {'loss': 0.0002, 'grad_norm': 0.15329384732173576, 'learning_rate': 7.511999999999999e-07, 'completion_length': 140.21428680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005828857421875, 'epoch': 0.25} + 25%|██▍ | 622/2500 [3:32:12<11:28:35, 22.00s/it] 25%|██▍ | 623/2500 [3:32:32<11:15:31, 21.59s/it] {'loss': 0.0003, 'grad_norm': 0.8955998760767014, 'learning_rate': 7.508e-07, 'completion_length': 148.87500762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.006683349609375, 'epoch': 0.25} + 25%|██▍ | 623/2500 [3:32:32<11:15:31, 21.59s/it] 25%|██▍ | 624/2500 [3:32:55<11:22:58, 21.84s/it] {'loss': 0.0002, 'grad_norm': 0.23317705669705877, 'learning_rate': 7.503999999999999e-07, 'completion_length': 154.23214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005340576171875, 'epoch': 0.25} + 25%|██▍ | 624/2500 [3:32:55<11:22:58, 21.84s/it] 25%|██▌ | 625/2500 [3:33:16<11:18:10, 21.70s/it] {'loss': 0.0002, 'grad_norm': 0.350985390133444, 'learning_rate': 7.5e-07, 'completion_length': 148.75000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00567626953125, 'epoch': 0.25} + 25%|██▌ | 625/2500 [3:33:16<11:18:10, 21.70s/it] 25%|██▌ | 626/2500 [3:34:13<16:48:33, 32.29s/it] {'loss': 0.0003, 'grad_norm': 0.43382400183448444, 'learning_rate': 7.496e-07, 'completion_length': 150.9464340209961, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.1071428619325161, 'kl': 0.007110595703125, 'epoch': 0.25} + 25%|██▌ | 626/2500 [3:34:13<16:48:33, 32.29s/it] 25%|██▌ | 627/2500 [3:34:34<15:06:35, 29.04s/it] {'loss': 0.0003, 'grad_norm': 0.28836839694758765, 'learning_rate': 7.492e-07, 'completion_length': 153.17858123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.007232666015625, 'epoch': 0.25} + 25%|██▌ | 627/2500 [3:34:34<15:06:35, 29.04s/it] 25%|██▌ | 628/2500 [3:34:56<13:59:53, 26.92s/it] {'loss': 0.0002, 'grad_norm': 0.3639591722427344, 'learning_rate': 7.488e-07, 'completion_length': 143.10714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.005767822265625, 'epoch': 0.25} + 25%|██▌ | 628/2500 [3:34:56<13:59:53, 26.92s/it] 25%|██▌ | 629/2500 [3:35:18<13:12:46, 25.42s/it] {'loss': 0.0003, 'grad_norm': 2.9935365956495374, 'learning_rate': 7.483999999999999e-07, 'completion_length': 158.5357208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0087432861328125, 'epoch': 0.25} + 25%|██▌ | 629/2500 [3:35:18<13:12:46, 25.42s/it] 25%|██▌ | 630/2500 [3:35:40<12:37:06, 24.29s/it] {'loss': 0.0002, 'grad_norm': 0.02266844144613168, 'learning_rate': 7.48e-07, 'completion_length': 143.25000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0053253173828125, 'epoch': 0.25} + 25%|██▌ | 630/2500 [3:35:40<12:37:06, 24.29s/it] 25%|██▌ | 631/2500 [3:36:02<12:13:07, 23.54s/it] {'loss': 0.0003, 'grad_norm': 1.2220700317757307, 'learning_rate': 7.476e-07, 'completion_length': 159.6071548461914, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0080718994140625, 'epoch': 0.25} + 25%|██▌ | 631/2500 [3:36:02<12:13:07, 23.54s/it] 25%|██▌ | 632/2500 [3:36:24<12:02:04, 23.19s/it] {'loss': 0.0003, 'grad_norm': 0.044398294450283295, 'learning_rate': 7.471999999999999e-07, 'completion_length': 166.07144165039062, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0077972412109375, 'epoch': 0.25} + 25%|██▌ | 632/2500 [3:36:24<12:02:04, 23.19s/it] 25%|██▌ | 633/2500 [3:36:46<11:50:47, 22.84s/it] {'loss': 0.0003, 'grad_norm': 0.31611484246760985, 'learning_rate': 7.468e-07, 'completion_length': 157.8214340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.007080078125, 'epoch': 0.25} + 25%|██▌ | 633/2500 [3:36:46<11:50:47, 22.84s/it] 25%|██▌ | 634/2500 [3:37:07<11:32:56, 22.28s/it] {'loss': 0.0002, 'grad_norm': 0.032191455833627355, 'learning_rate': 7.464e-07, 'completion_length': 145.01786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004608154296875, 'epoch': 0.25} + 25%|██▌ | 634/2500 [3:37:07<11:32:56, 22.28s/it] 25%|██▌ | 635/2500 [3:37:29<11:31:08, 22.24s/it] {'loss': 0.0004, 'grad_norm': 0.5707457522179372, 'learning_rate': 7.459999999999999e-07, 'completion_length': 171.50000762939453, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.14838216453790665, 'kl': 0.009124755859375, 'epoch': 0.25} + 25%|██▌ | 635/2500 [3:37:29<11:31:08, 22.24s/it] 25%|██▌ | 636/2500 [3:37:52<11:31:49, 22.27s/it] {'loss': 0.0003, 'grad_norm': 0.25729206939700916, 'learning_rate': 7.456e-07, 'completion_length': 155.21428680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.007904052734375, 'epoch': 0.25} + 25%|██▌ | 636/2500 [3:37:52<11:31:49, 22.27s/it] 25%|██▌ | 637/2500 [3:38:14<11:29:32, 22.21s/it] {'loss': 0.0003, 'grad_norm': 0.3649435304488047, 'learning_rate': 7.452e-07, 'completion_length': 164.28572845458984, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0082855224609375, 'epoch': 0.25} + 25%|██▌ | 637/2500 [3:38:14<11:29:32, 22.21s/it] 26%|██▌ | 638/2500 [3:38:35<11:23:53, 22.04s/it] {'loss': 0.0003, 'grad_norm': 0.7816999549663172, 'learning_rate': 7.447999999999999e-07, 'completion_length': 148.96429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0075531005859375, 'epoch': 0.26} + 26%|██▌ | 638/2500 [3:38:35<11:23:53, 22.04s/it] 26%|██▌ | 639/2500 [3:38:58<11:26:48, 22.14s/it] {'loss': 0.0004, 'grad_norm': 0.030051319329995354, 'learning_rate': 7.443999999999999e-07, 'completion_length': 167.1964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.010009765625, 'epoch': 0.26} + 26%|██▌ | 639/2500 [3:38:58<11:26:48, 22.14s/it] 26%|██▌ | 640/2500 [3:39:21<11:33:35, 22.37s/it] {'loss': 0.0004, 'grad_norm': 0.1767433225912517, 'learning_rate': 7.44e-07, 'completion_length': 172.5178680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00885009765625, 'epoch': 0.26} + 26%|██▌ | 640/2500 [3:39:21<11:33:35, 22.37s/it] 26%|██▌ | 641/2500 [3:39:41<11:16:59, 21.85s/it] {'loss': 0.0002, 'grad_norm': 0.028516558703851235, 'learning_rate': 7.436e-07, 'completion_length': 135.01786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0058135986328125, 'epoch': 0.26} + 26%|██▌ | 641/2500 [3:39:41<11:16:59, 21.85s/it] 26%|██▌ | 642/2500 [3:40:04<11:26:21, 22.16s/it] {'loss': 0.0003, 'grad_norm': 0.02328455360177959, 'learning_rate': 7.431999999999999e-07, 'completion_length': 160.42858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0068206787109375, 'epoch': 0.26} + 26%|██▌ | 642/2500 [3:40:04<11:26:21, 22.16s/it] 26%|██▌ | 643/2500 [3:40:25<11:14:55, 21.81s/it] {'loss': 0.0003, 'grad_norm': 0.023859124648131527, 'learning_rate': 7.428e-07, 'completion_length': 148.75, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0063018798828125, 'epoch': 0.26} + 26%|██▌ | 643/2500 [3:40:25<11:14:55, 21.81s/it] 26%|██▌ | 644/2500 [3:40:47<11:16:13, 21.86s/it] {'loss': 0.0004, 'grad_norm': 0.7420898207677187, 'learning_rate': 7.423999999999999e-07, 'completion_length': 149.4464340209961, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.011077880859375, 'epoch': 0.26} + 26%|██▌ | 644/2500 [3:40:47<11:16:13, 21.86s/it] 26%|██▌ | 645/2500 [3:41:09<11:15:44, 21.86s/it] {'loss': 0.0003, 'grad_norm': 0.3516211779094942, 'learning_rate': 7.42e-07, 'completion_length': 151.50000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0065765380859375, 'epoch': 0.26} + 26%|██▌ | 645/2500 [3:41:09<11:15:44, 21.86s/it] 26%|██▌ | 646/2500 [3:41:30<11:09:06, 21.65s/it] {'loss': 0.0004, 'grad_norm': 0.026189878203113356, 'learning_rate': 7.416e-07, 'completion_length': 147.1428680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0107574462890625, 'epoch': 0.26} + 26%|██▌ | 646/2500 [3:41:30<11:09:06, 21.65s/it] 26%|██▌ | 647/2500 [3:41:51<11:04:35, 21.52s/it] {'loss': 0.0004, 'grad_norm': 0.32124758131385006, 'learning_rate': 7.411999999999999e-07, 'completion_length': 150.42858123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0088958740234375, 'epoch': 0.26} + 26%|██▌ | 647/2500 [3:41:51<11:04:35, 21.52s/it] 26%|██▌ | 648/2500 [3:42:15<11:22:24, 22.11s/it] {'loss': 0.0004, 'grad_norm': 0.7418901666139829, 'learning_rate': 7.408e-07, 'completion_length': 171.89286041259766, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.1071428619325161, 'kl': 0.00909423828125, 'epoch': 0.26} + 26%|██▌ | 648/2500 [3:42:15<11:22:24, 22.11s/it] 26%|██▌ | 649/2500 [3:42:36<11:17:07, 21.95s/it] {'loss': 0.0004, 'grad_norm': 0.6771168275182607, 'learning_rate': 7.403999999999999e-07, 'completion_length': 156.82144165039062, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.04123930633068085, 'kl': 0.0093536376953125, 'epoch': 0.26} + 26%|██▌ | 649/2500 [3:42:36<11:17:07, 21.95s/it] 26%|██▌ | 650/2500 [3:42:59<11:19:40, 22.04s/it] {'loss': 0.0003, 'grad_norm': 0.9263874672929984, 'learning_rate': 7.4e-07, 'completion_length': 158.3928680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.007659912109375, 'epoch': 0.26} + 26%|██▌ | 650/2500 [3:42:59<11:19:40, 22.04s/it] 26%|██▌ | 651/2500 [3:43:19<11:03:30, 21.53s/it] {'loss': 0.0002, 'grad_norm': 0.2683581295613577, 'learning_rate': 7.396e-07, 'completion_length': 145.98214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0062408447265625, 'epoch': 0.26} + 26%|██▌ | 651/2500 [3:43:19<11:03:30, 21.53s/it] 26%|██▌ | 652/2500 [3:43:42<11:14:00, 21.88s/it] {'loss': 0.0003, 'grad_norm': 0.09465067994502402, 'learning_rate': 7.392e-07, 'completion_length': 156.51786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0071868896484375, 'epoch': 0.26} + 26%|██▌ | 652/2500 [3:43:42<11:14:00, 21.88s/it] 26%|██▌ | 653/2500 [3:44:03<11:10:27, 21.78s/it] {'loss': 0.0004, 'grad_norm': 0.7463207094702893, 'learning_rate': 7.388e-07, 'completion_length': 150.2857208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.010406494140625, 'epoch': 0.26} + 26%|██▌ | 653/2500 [3:44:03<11:10:27, 21.78s/it] 26%|██▌ | 654/2500 [3:44:25<11:06:35, 21.67s/it] {'loss': 0.0003, 'grad_norm': 0.4243372054074714, 'learning_rate': 7.383999999999999e-07, 'completion_length': 161.5714340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0079803466796875, 'epoch': 0.26} + 26%|██▌ | 654/2500 [3:44:25<11:06:35, 21.67s/it] 26%|██▌ | 655/2500 [3:44:47<11:12:50, 21.88s/it] {'loss': 0.0003, 'grad_norm': 0.33100324504606377, 'learning_rate': 7.38e-07, 'completion_length': 163.9821548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.008331298828125, 'epoch': 0.26} + 26%|██▌ | 655/2500 [3:44:47<11:12:50, 21.88s/it] 26%|██▌ | 656/2500 [3:45:11<11:29:34, 22.44s/it] {'loss': 0.0002, 'grad_norm': 1.2547101297202, 'learning_rate': 7.376e-07, 'completion_length': 159.66072845458984, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0057220458984375, 'epoch': 0.26} + 26%|██▌ | 656/2500 [3:45:11<11:29:34, 22.44s/it] 26%|██▋ | 657/2500 [3:45:33<11:30:14, 22.47s/it] {'loss': 0.0003, 'grad_norm': 0.5109360508581839, 'learning_rate': 7.371999999999999e-07, 'completion_length': 167.98214721679688, 'rewards/accuracy_reward': 0.8392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.8392858505249023, 'reward_std': 0.1181928999722004, 'kl': 0.008453369140625, 'epoch': 0.26} + 26%|██▋ | 657/2500 [3:45:33<11:30:14, 22.47s/it] 26%|██▋ | 658/2500 [3:45:57<11:39:04, 22.77s/it] {'loss': 0.0004, 'grad_norm': 0.3982783635283046, 'learning_rate': 7.368e-07, 'completion_length': 179.39286041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.009246826171875, 'epoch': 0.26} + 26%|██▋ | 658/2500 [3:45:57<11:39:04, 22.77s/it] 26%|██▋ | 659/2500 [3:46:19<11:32:41, 22.58s/it] {'loss': 0.0004, 'grad_norm': 0.05969947336089277, 'learning_rate': 7.364000000000001e-07, 'completion_length': 160.64286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.010345458984375, 'epoch': 0.26} + 26%|██▋ | 659/2500 [3:46:19<11:32:41, 22.58s/it] 26%|██▋ | 660/2500 [3:46:41<11:28:09, 22.44s/it] {'loss': 0.0004, 'grad_norm': 0.3394486150742412, 'learning_rate': 7.359999999999999e-07, 'completion_length': 170.0357208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.008880615234375, 'epoch': 0.26} + 26%|██▋ | 660/2500 [3:46:41<11:28:09, 22.44s/it] 26%|██▋ | 661/2500 [3:47:02<11:15:43, 22.05s/it] {'loss': 0.0002, 'grad_norm': 0.21981360274133724, 'learning_rate': 7.356e-07, 'completion_length': 142.48214721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0057373046875, 'epoch': 0.26} + 26%|██▋ | 661/2500 [3:47:02<11:15:43, 22.05s/it] 26%|██▋ | 662/2500 [3:47:24<11:10:23, 21.88s/it] {'loss': 0.0003, 'grad_norm': 0.5042412538339149, 'learning_rate': 7.352e-07, 'completion_length': 144.92857360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.006439208984375, 'epoch': 0.26} + 26%|██▋ | 662/2500 [3:47:24<11:10:23, 21.88s/it] 27%|██▋ | 663/2500 [3:47:45<11:03:28, 21.67s/it] {'loss': 0.0004, 'grad_norm': 0.029473801490275287, 'learning_rate': 7.347999999999999e-07, 'completion_length': 152.30358123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0087890625, 'epoch': 0.27} + 27%|██▋ | 663/2500 [3:47:45<11:03:28, 21.67s/it] 27%|██▋ | 664/2500 [3:48:07<11:05:35, 21.75s/it] {'loss': 0.0003, 'grad_norm': 0.2695033974308562, 'learning_rate': 7.344e-07, 'completion_length': 158.50000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0066375732421875, 'epoch': 0.27} + 27%|██▋ | 664/2500 [3:48:07<11:05:35, 21.75s/it] 27%|██▋ | 665/2500 [3:48:29<11:07:05, 21.81s/it] {'loss': 0.0003, 'grad_norm': 0.18076197978385258, 'learning_rate': 7.34e-07, 'completion_length': 152.05358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0073699951171875, 'epoch': 0.27} + 27%|██▋ | 665/2500 [3:48:29<11:07:05, 21.81s/it] 27%|██▋ | 666/2500 [3:48:51<11:07:06, 21.82s/it] {'loss': 0.0005, 'grad_norm': 0.2621373601230528, 'learning_rate': 7.336e-07, 'completion_length': 152.07144165039062, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.011566162109375, 'epoch': 0.27} + 27%|██▋ | 666/2500 [3:48:51<11:07:06, 21.82s/it] 27%|██▋ | 667/2500 [3:49:13<11:11:07, 21.97s/it] {'loss': 0.0003, 'grad_norm': 0.3589582452737861, 'learning_rate': 7.331999999999999e-07, 'completion_length': 150.60714721679688, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.0074310302734375, 'epoch': 0.27} + 27%|██▋ | 667/2500 [3:49:13<11:11:07, 21.97s/it] 27%|██▋ | 668/2500 [3:49:36<11:20:25, 22.28s/it] {'loss': 0.0003, 'grad_norm': 0.025312546465008596, 'learning_rate': 7.328e-07, 'completion_length': 164.39286041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0074005126953125, 'epoch': 0.27} + 27%|██▋ | 668/2500 [3:49:36<11:20:25, 22.28s/it] 27%|██▋ | 669/2500 [3:49:58<11:21:38, 22.34s/it] {'loss': 0.0003, 'grad_norm': 0.2707189109040492, 'learning_rate': 7.324e-07, 'completion_length': 160.30358123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0076446533203125, 'epoch': 0.27} + 27%|██▋ | 669/2500 [3:49:58<11:21:38, 22.34s/it] 27%|██▋ | 670/2500 [3:50:21<11:22:45, 22.39s/it] {'loss': 0.0004, 'grad_norm': 0.841625147872455, 'learning_rate': 7.319999999999999e-07, 'completion_length': 149.4107208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.010589599609375, 'epoch': 0.27} + 27%|██▋ | 670/2500 [3:50:21<11:22:45, 22.39s/it] 27%|██▋ | 671/2500 [3:50:43<11:17:30, 22.23s/it] {'loss': 0.0004, 'grad_norm': 0.05022121709666246, 'learning_rate': 7.316e-07, 'completion_length': 154.0, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.009368896484375, 'epoch': 0.27} + 27%|██▋ | 671/2500 [3:50:43<11:17:30, 22.23s/it] 27%|██▋ | 672/2500 [3:51:04<11:12:03, 22.06s/it] {'loss': 0.0003, 'grad_norm': 0.7889041475656391, 'learning_rate': 7.311999999999999e-07, 'completion_length': 140.05358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00628662109375, 'epoch': 0.27} + 27%|██▋ | 672/2500 [3:51:04<11:12:03, 22.06s/it] 27%|██▋ | 673/2500 [3:51:27<11:18:39, 22.29s/it] {'loss': 0.0004, 'grad_norm': 1.2611794182754728, 'learning_rate': 7.308e-07, 'completion_length': 150.87500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0090789794921875, 'epoch': 0.27} + 27%|██▋ | 673/2500 [3:51:27<11:18:39, 22.29s/it] 27%|██▋ | 674/2500 [3:51:50<11:19:33, 22.33s/it] {'loss': 0.0002, 'grad_norm': 0.02408255104518607, 'learning_rate': 7.304e-07, 'completion_length': 153.8571548461914, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0058441162109375, 'epoch': 0.27} + 27%|██▋ | 674/2500 [3:51:50<11:19:33, 22.33s/it] 27%|██▋ | 675/2500 [3:52:12<11:17:32, 22.28s/it] {'loss': 0.0004, 'grad_norm': 0.8216374216292957, 'learning_rate': 7.3e-07, 'completion_length': 164.8928680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.008758544921875, 'epoch': 0.27} + 27%|██▋ | 675/2500 [3:52:12<11:17:32, 22.28s/it] 27%|██▋ | 676/2500 [3:52:35<11:22:59, 22.47s/it] {'loss': 0.0003, 'grad_norm': 0.03488936340475623, 'learning_rate': 7.296e-07, 'completion_length': 167.4464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0084686279296875, 'epoch': 0.27} + 27%|██▋ | 676/2500 [3:52:35<11:22:59, 22.47s/it] 27%|██▋ | 677/2500 [3:52:57<11:24:23, 22.53s/it] {'loss': 0.0004, 'grad_norm': 0.6232552829186216, 'learning_rate': 7.291999999999999e-07, 'completion_length': 168.4464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0096282958984375, 'epoch': 0.27} + 27%|██▋ | 677/2500 [3:52:57<11:24:23, 22.53s/it] 27%|██▋ | 678/2500 [3:53:20<11:21:30, 22.44s/it] {'loss': 0.0002, 'grad_norm': 0.4005057263629562, 'learning_rate': 7.288e-07, 'completion_length': 154.21429443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.00616455078125, 'epoch': 0.27} + 27%|██▋ | 678/2500 [3:53:20<11:21:30, 22.44s/it] 27%|██▋ | 679/2500 [3:53:41<11:14:49, 22.23s/it] {'loss': 0.0003, 'grad_norm': 0.5467070441510853, 'learning_rate': 7.284e-07, 'completion_length': 160.8214340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.1071428656578064, 'kl': 0.0086822509765625, 'epoch': 0.27} + 27%|██▋ | 679/2500 [3:53:41<11:14:49, 22.23s/it] 27%|██▋ | 680/2500 [3:54:04<11:16:15, 22.29s/it] {'loss': 0.0002, 'grad_norm': 0.2877852008526631, 'learning_rate': 7.28e-07, 'completion_length': 145.00000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.005340576171875, 'epoch': 0.27} + 27%|██▋ | 680/2500 [3:54:04<11:16:15, 22.29s/it] 27%|██▋ | 681/2500 [3:54:29<11:44:56, 23.25s/it] {'loss': 0.0003, 'grad_norm': 0.32969047374015203, 'learning_rate': 7.276e-07, 'completion_length': 171.71429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0084075927734375, 'epoch': 0.27} + 27%|██▋ | 681/2500 [3:54:29<11:44:56, 23.25s/it] 27%|██▋ | 682/2500 [3:54:51<11:32:12, 22.85s/it] {'loss': 0.0002, 'grad_norm': 0.022216433269952866, 'learning_rate': 7.271999999999999e-07, 'completion_length': 140.4107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00441741943359375, 'epoch': 0.27} + 27%|██▋ | 682/2500 [3:54:51<11:32:12, 22.85s/it] 27%|██▋ | 683/2500 [3:55:14<11:26:51, 22.68s/it] {'loss': 0.0002, 'grad_norm': 0.048946217936128446, 'learning_rate': 7.268e-07, 'completion_length': 148.08929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0054931640625, 'epoch': 0.27} + 27%|██▋ | 683/2500 [3:55:14<11:26:51, 22.68s/it] 27%|██▋ | 684/2500 [3:55:35<11:12:27, 22.22s/it] {'loss': 0.0002, 'grad_norm': 0.7901020528389122, 'learning_rate': 7.264e-07, 'completion_length': 149.5357208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0060272216796875, 'epoch': 0.27} + 27%|██▋ | 684/2500 [3:55:35<11:12:27, 22.22s/it] 27%|██▋ | 685/2500 [3:55:56<11:07:22, 22.06s/it] {'loss': 0.0003, 'grad_norm': 0.34929084584462944, 'learning_rate': 7.259999999999999e-07, 'completion_length': 154.87500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0075225830078125, 'epoch': 0.27} + 27%|██▋ | 685/2500 [3:55:56<11:07:22, 22.06s/it] 27%|██▋ | 686/2500 [3:56:18<11:01:15, 21.87s/it] {'loss': 0.0004, 'grad_norm': 0.040186742358644646, 'learning_rate': 7.256e-07, 'completion_length': 147.92858123779297, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.0, 'kl': 0.010101318359375, 'epoch': 0.27} + 27%|██▋ | 686/2500 [3:56:18<11:01:15, 21.87s/it] 27%|██▋ | 687/2500 [3:56:39<10:51:16, 21.55s/it] {'loss': 0.0002, 'grad_norm': 0.8719370621600039, 'learning_rate': 7.252e-07, 'completion_length': 136.60714721679688, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.00506591796875, 'epoch': 0.27} + 27%|██▋ | 687/2500 [3:56:39<10:51:16, 21.55s/it] 28%|██▊ | 688/2500 [3:57:01<11:00:42, 21.88s/it] {'loss': 0.0003, 'grad_norm': 0.02890301007463311, 'learning_rate': 7.247999999999999e-07, 'completion_length': 147.96429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0070953369140625, 'epoch': 0.28} + 28%|██▊ | 688/2500 [3:57:01<11:00:42, 21.88s/it] 28%|██▊ | 689/2500 [3:57:24<11:08:10, 22.14s/it] {'loss': 0.0003, 'grad_norm': 0.024260733865779798, 'learning_rate': 7.244e-07, 'completion_length': 171.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0079498291015625, 'epoch': 0.28} + 28%|██▊ | 689/2500 [3:57:24<11:08:10, 22.14s/it] 28%|██▊ | 690/2500 [3:57:45<10:58:32, 21.83s/it] {'loss': 0.0002, 'grad_norm': 0.019444839160681907, 'learning_rate': 7.24e-07, 'completion_length': 157.08928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0058441162109375, 'epoch': 0.28} + 28%|██▊ | 690/2500 [3:57:45<10:58:32, 21.83s/it] 28%|██▊ | 691/2500 [3:58:06<10:51:03, 21.59s/it] {'loss': 0.0003, 'grad_norm': 0.5041587313920785, 'learning_rate': 7.235999999999999e-07, 'completion_length': 149.1607208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0070343017578125, 'epoch': 0.28} + 28%|██▊ | 691/2500 [3:58:06<10:51:03, 21.59s/it] 28%|██▊ | 692/2500 [3:58:28<10:53:01, 21.67s/it] {'loss': 0.0002, 'grad_norm': 0.5133443628139718, 'learning_rate': 7.231999999999999e-07, 'completion_length': 151.55357360839844, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.0062103271484375, 'epoch': 0.28} + 28%|██▊ | 692/2500 [3:58:28<10:53:01, 21.67s/it] 28%|██▊ | 693/2500 [3:58:50<10:55:20, 21.76s/it] {'loss': 0.0004, 'grad_norm': 0.22697531616277905, 'learning_rate': 7.228e-07, 'completion_length': 157.67857360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0087890625, 'epoch': 0.28} + 28%|██▊ | 693/2500 [3:58:50<10:55:20, 21.76s/it] 28%|██▊ | 694/2500 [3:59:13<11:03:10, 22.03s/it] {'loss': 0.0003, 'grad_norm': 0.3730978882943004, 'learning_rate': 7.224e-07, 'completion_length': 156.37500762939453, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.008270263671875, 'epoch': 0.28} + 28%|██▊ | 694/2500 [3:59:13<11:03:10, 22.03s/it] 28%|██▊ | 695/2500 [3:59:36<11:15:04, 22.44s/it] {'loss': 0.0004, 'grad_norm': 4.412199277266528, 'learning_rate': 7.219999999999999e-07, 'completion_length': 178.30358123779297, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.18409645557403564, 'kl': 0.009979248046875, 'epoch': 0.28} + 28%|██▊ | 695/2500 [3:59:36<11:15:04, 22.44s/it] 28%|██▊ | 696/2500 [3:59:58<11:09:59, 22.28s/it] {'loss': 0.0003, 'grad_norm': 0.04603130666505308, 'learning_rate': 7.216e-07, 'completion_length': 145.9464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0063934326171875, 'epoch': 0.28} + 28%|██▊ | 696/2500 [3:59:58<11:09:59, 22.28s/it] 28%|██▊ | 697/2500 [4:00:19<10:56:56, 21.86s/it] {'loss': 0.0002, 'grad_norm': 0.2895758341443813, 'learning_rate': 7.211999999999999e-07, 'completion_length': 150.4107208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0059051513671875, 'epoch': 0.28} + 28%|██▊ | 697/2500 [4:00:19<10:56:56, 21.86s/it] 28%|██▊ | 698/2500 [4:00:42<11:11:37, 22.36s/it] {'loss': 0.0003, 'grad_norm': 0.516720536937079, 'learning_rate': 7.207999999999999e-07, 'completion_length': 161.6428680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00836181640625, 'epoch': 0.28} + 28%|██▊ | 698/2500 [4:00:42<11:11:37, 22.36s/it] 28%|██▊ | 699/2500 [4:01:06<11:20:58, 22.69s/it] {'loss': 0.0003, 'grad_norm': 0.29920518791902456, 'learning_rate': 7.204e-07, 'completion_length': 166.9464340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0069580078125, 'epoch': 0.28} + 28%|██▊ | 699/2500 [4:01:06<11:20:58, 22.69s/it] 28%|██▊ | 700/2500 [4:01:27<11:03:29, 22.12s/it] {'loss': 0.0002, 'grad_norm': 0.4950699865142188, 'learning_rate': 7.2e-07, 'completion_length': 153.1964340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0048828125, 'epoch': 0.28} + 28%|██▊ | 700/2500 [4:01:27<11:03:29, 22.12s/it] 28%|██▊ | 701/2500 [4:04:21<33:57:24, 67.95s/it] {'loss': 0.0003, 'grad_norm': 1.54987630996617, 'learning_rate': 7.196e-07, 'completion_length': 151.17858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0076446533203125, 'epoch': 0.28} + 28%|██▊ | 701/2500 [4:04:21<33:57:24, 67.95s/it] 28%|██▊ | 702/2500 [4:04:37<26:08:54, 52.36s/it] {'loss': 0.0002, 'grad_norm': 0.024205089996286025, 'learning_rate': 7.191999999999999e-07, 'completion_length': 152.5357208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0052337646484375, 'epoch': 0.28} + 28%|██▊ | 702/2500 [4:04:37<26:08:54, 52.36s/it] 28%|██▊ | 703/2500 [4:04:54<20:44:48, 41.56s/it] {'loss': 0.0002, 'grad_norm': 0.20941126034868737, 'learning_rate': 7.188e-07, 'completion_length': 160.6964340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.005401611328125, 'epoch': 0.28} + 28%|██▊ | 703/2500 [4:04:54<20:44:48, 41.56s/it] 28%|██▊ | 704/2500 [4:05:10<16:58:22, 34.02s/it] {'loss': 0.0002, 'grad_norm': 0.027415965544402225, 'learning_rate': 7.184e-07, 'completion_length': 148.83929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0059814453125, 'epoch': 0.28} + 28%|██▊ | 704/2500 [4:05:10<16:58:22, 34.02s/it] 28%|██▊ | 705/2500 [4:05:32<15:10:35, 30.44s/it] {'loss': 0.0002, 'grad_norm': 0.3621606438366915, 'learning_rate': 7.179999999999999e-07, 'completion_length': 146.2678680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0054168701171875, 'epoch': 0.28} + 28%|██▊ | 705/2500 [4:05:32<15:10:35, 30.44s/it] 28%|██▊ | 706/2500 [4:05:55<13:59:43, 28.08s/it] {'loss': 0.0003, 'grad_norm': 0.2222153383764273, 'learning_rate': 7.176e-07, 'completion_length': 160.1964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006591796875, 'epoch': 0.28} + 28%|██▊ | 706/2500 [4:05:55<13:59:43, 28.08s/it] 28%|██▊ | 707/2500 [4:06:16<12:58:50, 26.06s/it] {'loss': 0.0002, 'grad_norm': 0.020258462097830457, 'learning_rate': 7.171999999999999e-07, 'completion_length': 139.21429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0051422119140625, 'epoch': 0.28} + 28%|██▊ | 707/2500 [4:06:16<12:58:50, 26.06s/it] 28%|██▊ | 708/2500 [4:06:38<12:18:50, 24.74s/it] {'loss': 0.0003, 'grad_norm': 0.2970130368194346, 'learning_rate': 7.168e-07, 'completion_length': 160.5178680419922, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.00714111328125, 'epoch': 0.28} + 28%|██▊ | 708/2500 [4:06:38<12:18:50, 24.74s/it] 28%|██▊ | 709/2500 [4:07:00<11:56:19, 24.00s/it] {'loss': 0.0003, 'grad_norm': 0.4031903251294557, 'learning_rate': 7.164e-07, 'completion_length': 166.4464340209961, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.12371791899204254, 'kl': 0.008636474609375, 'epoch': 0.28} + 28%|██▊ | 709/2500 [4:07:00<11:56:19, 24.00s/it] 28%|██▊ | 710/2500 [4:07:23<11:42:53, 23.56s/it] {'loss': 0.0002, 'grad_norm': 0.17867140003826965, 'learning_rate': 7.159999999999999e-07, 'completion_length': 162.71429443359375, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.006103515625, 'epoch': 0.28} + 28%|██▊ | 710/2500 [4:07:23<11:42:53, 23.56s/it] 28%|██▊ | 711/2500 [4:07:45<11:30:09, 23.15s/it] {'loss': 0.0003, 'grad_norm': 0.21195600450569896, 'learning_rate': 7.156e-07, 'completion_length': 156.75000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00787353515625, 'epoch': 0.28} + 28%|██▊ | 711/2500 [4:07:45<11:30:09, 23.15s/it] 28%|██▊ | 712/2500 [4:08:08<11:28:53, 23.12s/it] {'loss': 0.0003, 'grad_norm': 0.2276803775589602, 'learning_rate': 7.151999999999999e-07, 'completion_length': 161.7857208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.006866455078125, 'epoch': 0.28} + 28%|██▊ | 712/2500 [4:08:08<11:28:53, 23.12s/it] 29%|██▊ | 713/2500 [4:08:30<11:17:41, 22.75s/it] {'loss': 0.0002, 'grad_norm': 0.18790575443770632, 'learning_rate': 7.147999999999999e-07, 'completion_length': 145.01786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0054168701171875, 'epoch': 0.29} + 29%|██▊ | 713/2500 [4:08:30<11:17:41, 22.75s/it] 29%|██▊ | 714/2500 [4:08:52<11:15:52, 22.71s/it] {'loss': 0.0003, 'grad_norm': 0.020390055846075488, 'learning_rate': 7.144e-07, 'completion_length': 157.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0063018798828125, 'epoch': 0.29} + 29%|██▊ | 714/2500 [4:08:52<11:15:52, 22.71s/it] 29%|██▊ | 715/2500 [4:09:16<11:22:32, 22.94s/it] {'loss': 0.0003, 'grad_norm': 0.5463791760987027, 'learning_rate': 7.14e-07, 'completion_length': 170.0178680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0073699951171875, 'epoch': 0.29} + 29%|██▊ | 715/2500 [4:09:16<11:22:32, 22.94s/it] 29%|██▊ | 716/2500 [4:09:40<11:34:50, 23.37s/it] {'loss': 0.0002, 'grad_norm': 0.21436064409884692, 'learning_rate': 7.135999999999999e-07, 'completion_length': 162.14286041259766, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005706787109375, 'epoch': 0.29} + 29%|██▊ | 716/2500 [4:09:40<11:34:50, 23.37s/it] 29%|██▊ | 717/2500 [4:10:03<11:31:36, 23.27s/it] {'loss': 0.0003, 'grad_norm': 0.3613389797037983, 'learning_rate': 7.131999999999999e-07, 'completion_length': 175.23214721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.007965087890625, 'epoch': 0.29} + 29%|██▊ | 717/2500 [4:10:03<11:31:36, 23.27s/it] 29%|██▊ | 718/2500 [4:10:25<11:15:41, 22.75s/it] {'loss': 0.0002, 'grad_norm': 0.019219112179583206, 'learning_rate': 7.128e-07, 'completion_length': 153.96429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0052032470703125, 'epoch': 0.29} + 29%|██▊ | 718/2500 [4:10:25<11:15:41, 22.75s/it] 29%|██▉ | 719/2500 [4:10:46<11:03:23, 22.35s/it] {'loss': 0.0002, 'grad_norm': 0.018374025079236016, 'learning_rate': 7.124e-07, 'completion_length': 147.60714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005828857421875, 'epoch': 0.29} + 29%|██▉ | 719/2500 [4:10:46<11:03:23, 22.35s/it] 29%|██▉ | 720/2500 [4:11:10<11:10:58, 22.62s/it] {'loss': 0.0002, 'grad_norm': 0.03344337568605365, 'learning_rate': 7.119999999999999e-07, 'completion_length': 153.19644165039062, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0058746337890625, 'epoch': 0.29} + 29%|██▉ | 720/2500 [4:11:10<11:10:58, 22.62s/it] 29%|██▉ | 721/2500 [4:11:31<11:03:38, 22.38s/it] {'loss': 0.0003, 'grad_norm': 0.02367303055660571, 'learning_rate': 7.116e-07, 'completion_length': 163.80358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00775146484375, 'epoch': 0.29} + 29%|██▉ | 721/2500 [4:11:31<11:03:38, 22.38s/it] 29%|██▉ | 722/2500 [4:11:54<11:06:55, 22.51s/it] {'loss': 0.0002, 'grad_norm': 0.21941759880993805, 'learning_rate': 7.112000000000001e-07, 'completion_length': 146.3214340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005889892578125, 'epoch': 0.29} + 29%|██▉ | 722/2500 [4:11:54<11:06:55, 22.51s/it] 29%|██▉ | 723/2500 [4:12:16<11:02:51, 22.38s/it] {'loss': 0.0002, 'grad_norm': 0.31150094870993184, 'learning_rate': 7.107999999999999e-07, 'completion_length': 158.5357208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00597381591796875, 'epoch': 0.29} + 29%|██▉ | 723/2500 [4:12:16<11:02:51, 22.38s/it] 29%|██▉ | 724/2500 [4:12:39<11:03:41, 22.42s/it] {'loss': 0.0002, 'grad_norm': 0.3084924143337719, 'learning_rate': 7.104e-07, 'completion_length': 154.3928680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0055389404296875, 'epoch': 0.29} + 29%|██▉ | 724/2500 [4:12:39<11:03:41, 22.42s/it] 29%|██▉ | 725/2500 [4:13:03<11:19:07, 22.96s/it] {'loss': 0.0003, 'grad_norm': 1.094890091223569, 'learning_rate': 7.1e-07, 'completion_length': 163.42858123779297, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.1428571529686451, 'kl': 0.00811767578125, 'epoch': 0.29} + 29%|██▉ | 725/2500 [4:13:03<11:19:07, 22.96s/it] 29%|██▉ | 726/2500 [4:13:25<11:10:18, 22.67s/it] {'loss': 0.0003, 'grad_norm': 4.062409584698609, 'learning_rate': 7.096e-07, 'completion_length': 145.12500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006988525390625, 'epoch': 0.29} + 29%|██▉ | 726/2500 [4:13:25<11:10:18, 22.67s/it] 29%|██▉ | 727/2500 [4:13:47<11:00:43, 22.36s/it] {'loss': 0.0003, 'grad_norm': 0.25677931072277427, 'learning_rate': 7.092e-07, 'completion_length': 158.19644165039062, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.008331298828125, 'epoch': 0.29} + 29%|██▉ | 727/2500 [4:13:47<11:00:43, 22.36s/it] 29%|██▉ | 728/2500 [4:14:09<10:59:15, 22.32s/it] {'loss': 0.0003, 'grad_norm': 0.35167064782159163, 'learning_rate': 7.088e-07, 'completion_length': 159.01786041259766, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0069122314453125, 'epoch': 0.29} + 29%|██▉ | 728/2500 [4:14:09<10:59:15, 22.32s/it] 29%|██▉ | 729/2500 [4:14:31<10:54:27, 22.17s/it] {'loss': 0.0002, 'grad_norm': 0.02753598903870605, 'learning_rate': 7.084e-07, 'completion_length': 148.9821548461914, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.005157470703125, 'epoch': 0.29} + 29%|██▉ | 729/2500 [4:14:31<10:54:27, 22.17s/it] 29%|██▉ | 730/2500 [4:14:53<10:59:58, 22.37s/it] {'loss': 0.0003, 'grad_norm': 0.39145956170246066, 'learning_rate': 7.079999999999999e-07, 'completion_length': 158.4464340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.0063629150390625, 'epoch': 0.29} + 29%|██▉ | 730/2500 [4:14:54<10:59:58, 22.37s/it] 29%|██▉ | 731/2500 [4:15:15<10:51:20, 22.09s/it] {'loss': 0.0002, 'grad_norm': 0.0384746178079503, 'learning_rate': 7.076e-07, 'completion_length': 149.73214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00434112548828125, 'epoch': 0.29} + 29%|██▉ | 731/2500 [4:15:15<10:51:20, 22.09s/it] 29%|██▉ | 732/2500 [4:15:36<10:44:28, 21.87s/it] {'loss': 0.0003, 'grad_norm': 0.04146949269513062, 'learning_rate': 7.072e-07, 'completion_length': 139.2321548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0068511962890625, 'epoch': 0.29} + 29%|██▉ | 732/2500 [4:15:36<10:44:28, 21.87s/it] 29%|██▉ | 733/2500 [4:15:59<10:50:40, 22.09s/it] {'loss': 0.0003, 'grad_norm': 0.269400133177223, 'learning_rate': 7.068e-07, 'completion_length': 164.26786041259766, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.00860595703125, 'epoch': 0.29} + 29%|██▉ | 733/2500 [4:15:59<10:50:40, 22.09s/it] 29%|██▉ | 734/2500 [4:16:21<10:50:17, 22.09s/it] {'loss': 0.0003, 'grad_norm': 0.29425329784915116, 'learning_rate': 7.064e-07, 'completion_length': 146.37500762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0078277587890625, 'epoch': 0.29} + 29%|██▉ | 734/2500 [4:16:21<10:50:17, 22.09s/it] 29%|██▉ | 735/2500 [4:16:45<11:08:53, 22.74s/it] {'loss': 0.0003, 'grad_norm': 0.35612467497752504, 'learning_rate': 7.059999999999999e-07, 'completion_length': 153.10714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00689697265625, 'epoch': 0.29} + 29%|██▉ | 735/2500 [4:16:45<11:08:53, 22.74s/it] 29%|██▉ | 736/2500 [4:17:08<11:10:26, 22.80s/it] {'loss': 0.0003, 'grad_norm': 0.32236212545839915, 'learning_rate': 7.056e-07, 'completion_length': 160.23214721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.0077362060546875, 'epoch': 0.29} + 29%|██▉ | 736/2500 [4:17:08<11:10:26, 22.80s/it] 29%|██▉ | 737/2500 [4:17:31<11:08:34, 22.75s/it] {'loss': 0.0003, 'grad_norm': 0.2581472057198835, 'learning_rate': 7.052e-07, 'completion_length': 173.7857208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.008636474609375, 'epoch': 0.29} + 29%|██▉ | 737/2500 [4:17:31<11:08:34, 22.75s/it] 30%|██▉ | 738/2500 [4:17:54<11:09:12, 22.79s/it] {'loss': 0.0004, 'grad_norm': 0.7149498627041971, 'learning_rate': 7.047999999999999e-07, 'completion_length': 150.00000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.009552001953125, 'epoch': 0.3} + 30%|██▉ | 738/2500 [4:17:54<11:09:12, 22.79s/it] 30%|██▉ | 739/2500 [4:18:15<10:57:34, 22.40s/it] {'loss': 0.0002, 'grad_norm': 0.024607576356662407, 'learning_rate': 7.044e-07, 'completion_length': 152.30358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00531005859375, 'epoch': 0.3} + 30%|██▉ | 739/2500 [4:18:15<10:57:34, 22.40s/it] 30%|██▉ | 740/2500 [4:18:38<11:04:01, 22.64s/it] {'loss': 0.0003, 'grad_norm': 0.7190459417641911, 'learning_rate': 7.04e-07, 'completion_length': 161.25000762939453, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.0084075927734375, 'epoch': 0.3} + 30%|██▉ | 740/2500 [4:18:38<11:04:01, 22.64s/it] 30%|██▉ | 741/2500 [4:18:59<10:50:07, 22.18s/it] {'loss': 0.0002, 'grad_norm': 0.2544057984539631, 'learning_rate': 7.035999999999999e-07, 'completion_length': 148.0357208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0059051513671875, 'epoch': 0.3} + 30%|██▉ | 741/2500 [4:18:59<10:50:07, 22.18s/it] 30%|██▉ | 742/2500 [4:19:22<10:52:23, 22.27s/it] {'loss': 0.0002, 'grad_norm': 0.31032511985611755, 'learning_rate': 7.032e-07, 'completion_length': 151.83929443359375, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.0056304931640625, 'epoch': 0.3} + 30%|██▉ | 742/2500 [4:19:22<10:52:23, 22.27s/it] 30%|██▉ | 743/2500 [4:19:46<11:03:20, 22.65s/it] {'loss': 0.0003, 'grad_norm': 0.24064865194075444, 'learning_rate': 7.028e-07, 'completion_length': 172.17858123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00714111328125, 'epoch': 0.3} + 30%|██▉ | 743/2500 [4:19:46<11:03:20, 22.65s/it] 30%|██▉ | 744/2500 [4:20:08<11:05:22, 22.73s/it] {'loss': 0.0003, 'grad_norm': 0.2709981268232238, 'learning_rate': 7.024e-07, 'completion_length': 165.62500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0068206787109375, 'epoch': 0.3} + 30%|██▉ | 744/2500 [4:20:08<11:05:22, 22.73s/it] 30%|██▉ | 745/2500 [4:20:30<10:51:38, 22.28s/it] {'loss': 0.0003, 'grad_norm': 0.6011897383056085, 'learning_rate': 7.019999999999999e-07, 'completion_length': 149.96428680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.0063629150390625, 'epoch': 0.3} + 30%|██▉ | 745/2500 [4:20:30<10:51:38, 22.28s/it] 30%|██▉ | 746/2500 [4:20:52<10:48:40, 22.19s/it] {'loss': 0.0003, 'grad_norm': 0.6259296482099777, 'learning_rate': 7.016e-07, 'completion_length': 149.55357360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0071868896484375, 'epoch': 0.3} + 30%|██▉ | 746/2500 [4:20:52<10:48:40, 22.19s/it] 30%|██▉ | 747/2500 [4:21:13<10:42:19, 21.99s/it] {'loss': 0.0002, 'grad_norm': 0.32538012546679795, 'learning_rate': 7.012000000000001e-07, 'completion_length': 139.87500762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0045318603515625, 'epoch': 0.3} + 30%|██▉ | 747/2500 [4:21:13<10:42:19, 21.99s/it] 30%|██▉ | 748/2500 [4:21:37<10:57:18, 22.51s/it] {'loss': 0.0003, 'grad_norm': 0.7349413771217671, 'learning_rate': 7.007999999999999e-07, 'completion_length': 161.51786041259766, 'rewards/accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.1428571492433548, 'kl': 0.007476806640625, 'epoch': 0.3} + 30%|██▉ | 748/2500 [4:21:37<10:57:18, 22.51s/it] 30%|██▉ | 749/2500 [4:21:58<10:44:14, 22.08s/it] {'loss': 0.0002, 'grad_norm': 0.05438284885521874, 'learning_rate': 7.004e-07, 'completion_length': 143.25, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0048675537109375, 'epoch': 0.3} + 30%|██▉ | 749/2500 [4:21:58<10:44:14, 22.08s/it] 30%|███ | 750/2500 [4:22:19<10:34:12, 21.74s/it] {'loss': 0.0002, 'grad_norm': 0.032036458569806525, 'learning_rate': 7e-07, 'completion_length': 142.01786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005096435546875, 'epoch': 0.3} + 30%|███ | 750/2500 [4:22:19<10:34:12, 21.74s/it] 30%|███ | 751/2500 [4:22:41<10:40:29, 21.97s/it] {'loss': 0.0003, 'grad_norm': 0.45566257730241316, 'learning_rate': 6.995999999999999e-07, 'completion_length': 168.6428680419922, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.0086669921875, 'epoch': 0.3} + 30%|███ | 751/2500 [4:22:41<10:40:29, 21.97s/it] 30%|███ | 752/2500 [4:23:04<10:44:05, 22.11s/it] {'loss': 0.0004, 'grad_norm': 0.369959136765396, 'learning_rate': 6.992e-07, 'completion_length': 181.0357208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0089874267578125, 'epoch': 0.3} + 30%|███ | 752/2500 [4:23:04<10:44:05, 22.11s/it] 30%|███ | 753/2500 [4:23:26<10:47:17, 22.23s/it] {'loss': 0.0002, 'grad_norm': 0.03085492012403293, 'learning_rate': 6.988e-07, 'completion_length': 159.01786041259766, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.00567626953125, 'epoch': 0.3} + 30%|███ | 753/2500 [4:23:26<10:47:17, 22.23s/it] 30%|███ | 754/2500 [4:23:48<10:40:14, 22.00s/it] {'loss': 0.0002, 'grad_norm': 0.3256811123439834, 'learning_rate': 6.984e-07, 'completion_length': 155.9821548461914, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07695359364151955, 'kl': 0.0059661865234375, 'epoch': 0.3} + 30%|███ | 754/2500 [4:23:48<10:40:14, 22.00s/it] 30%|███ | 755/2500 [4:24:10<10:43:04, 22.11s/it] {'loss': 0.0003, 'grad_norm': 0.022673530052490693, 'learning_rate': 6.979999999999999e-07, 'completion_length': 159.21428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00665283203125, 'epoch': 0.3} + 30%|███ | 755/2500 [4:24:10<10:43:04, 22.11s/it] 30%|███ | 756/2500 [4:24:32<10:37:23, 21.93s/it] {'loss': 0.0003, 'grad_norm': 0.7748116921615867, 'learning_rate': 6.976e-07, 'completion_length': 146.89286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006805419921875, 'epoch': 0.3} + 30%|███ | 756/2500 [4:24:32<10:37:23, 21.93s/it] 30%|███ | 757/2500 [4:24:53<10:35:51, 21.89s/it] {'loss': 0.0002, 'grad_norm': 0.02588775137545087, 'learning_rate': 6.972e-07, 'completion_length': 147.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005462646484375, 'epoch': 0.3} + 30%|███ | 757/2500 [4:24:54<10:35:51, 21.89s/it] 30%|███ | 758/2500 [4:25:15<10:34:51, 21.87s/it] {'loss': 0.0002, 'grad_norm': 0.020419503143979372, 'learning_rate': 6.967999999999999e-07, 'completion_length': 156.1964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00618743896484375, 'epoch': 0.3} + 30%|███ | 758/2500 [4:25:15<10:34:51, 21.87s/it] 30%|███ | 759/2500 [4:25:37<10:31:24, 21.76s/it] {'loss': 0.0002, 'grad_norm': 0.02855938871394136, 'learning_rate': 6.964e-07, 'completion_length': 145.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0056610107421875, 'epoch': 0.3} + 30%|███ | 759/2500 [4:25:37<10:31:24, 21.76s/it] 30%|███ | 760/2500 [4:25:59<10:31:14, 21.77s/it] {'loss': 0.0003, 'grad_norm': 0.041409044086639674, 'learning_rate': 6.959999999999999e-07, 'completion_length': 157.37500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0072021484375, 'epoch': 0.3} + 30%|███ | 760/2500 [4:25:59<10:31:14, 21.77s/it] 30%|███ | 761/2500 [4:26:21<10:33:59, 21.87s/it] {'loss': 0.0004, 'grad_norm': 0.5329580952495999, 'learning_rate': 6.956e-07, 'completion_length': 177.46428680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.008880615234375, 'epoch': 0.3} + 30%|███ | 761/2500 [4:26:21<10:33:59, 21.87s/it] 30%|███ | 762/2500 [4:26:42<10:29:05, 21.72s/it] {'loss': 0.0003, 'grad_norm': 0.23001175073182686, 'learning_rate': 6.952e-07, 'completion_length': 151.8214340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0080108642578125, 'epoch': 0.3} + 30%|███ | 762/2500 [4:26:42<10:29:05, 21.72s/it] 31%|███ | 763/2500 [4:27:03<10:24:17, 21.56s/it] {'loss': 0.0002, 'grad_norm': 0.03424831029475149, 'learning_rate': 6.947999999999999e-07, 'completion_length': 149.67857360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00508880615234375, 'epoch': 0.31} + 31%|███ | 763/2500 [4:27:03<10:24:17, 21.56s/it] 31%|███ | 764/2500 [4:27:26<10:30:30, 21.79s/it] {'loss': 0.0003, 'grad_norm': 0.44080457383620375, 'learning_rate': 6.944e-07, 'completion_length': 170.8214340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0084991455078125, 'epoch': 0.31} + 31%|███ | 764/2500 [4:27:26<10:30:30, 21.79s/it] 31%|███ | 765/2500 [4:27:48<10:33:33, 21.91s/it] {'loss': 0.0002, 'grad_norm': 0.2629891299668013, 'learning_rate': 6.939999999999999e-07, 'completion_length': 160.12500762939453, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.07695359364151955, 'kl': 0.0053558349609375, 'epoch': 0.31} + 31%|███ | 765/2500 [4:27:48<10:33:33, 21.91s/it] 31%|███ | 766/2500 [4:28:11<10:43:28, 22.27s/it] {'loss': 0.0003, 'grad_norm': 0.29125918883849977, 'learning_rate': 6.935999999999999e-07, 'completion_length': 178.14286041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00830078125, 'epoch': 0.31} + 31%|███ | 766/2500 [4:28:11<10:43:28, 22.27s/it] 31%|███ | 767/2500 [4:28:34<10:47:36, 22.42s/it] {'loss': 0.0004, 'grad_norm': 0.8694328617756046, 'learning_rate': 6.932e-07, 'completion_length': 175.0714340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.11266788095235825, 'kl': 0.0098724365234375, 'epoch': 0.31} + 31%|███ | 767/2500 [4:28:34<10:47:36, 22.42s/it] 31%|███ | 768/2500 [4:28:56<10:47:58, 22.45s/it] {'loss': 0.0002, 'grad_norm': 0.02722054766179424, 'learning_rate': 6.928e-07, 'completion_length': 158.25000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006011962890625, 'epoch': 0.31} + 31%|███ | 768/2500 [4:28:56<10:47:58, 22.45s/it] 31%|███ | 769/2500 [4:29:20<10:57:12, 22.78s/it] {'loss': 0.0003, 'grad_norm': 1.1499900902974767, 'learning_rate': 6.924e-07, 'completion_length': 155.4107208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0068817138671875, 'epoch': 0.31} + 31%|███ | 769/2500 [4:29:20<10:57:12, 22.78s/it] 31%|███ | 770/2500 [4:29:45<11:13:57, 23.37s/it] {'loss': 0.0003, 'grad_norm': 0.4179198467093073, 'learning_rate': 6.919999999999999e-07, 'completion_length': 179.73214721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00830078125, 'epoch': 0.31} + 31%|███ | 770/2500 [4:29:45<11:13:57, 23.37s/it] 31%|███ | 771/2500 [4:30:06<11:00:33, 22.92s/it] {'loss': 0.0003, 'grad_norm': 0.4908900413182292, 'learning_rate': 6.916e-07, 'completion_length': 160.625, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.1071428656578064, 'kl': 0.0069580078125, 'epoch': 0.31} + 31%|███ | 771/2500 [4:30:06<11:00:33, 22.92s/it] 31%|███ | 772/2500 [4:30:29<10:56:01, 22.78s/it] {'loss': 0.0003, 'grad_norm': 0.26469242289083517, 'learning_rate': 6.912e-07, 'completion_length': 160.76786041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.006683349609375, 'epoch': 0.31} + 31%|███ | 772/2500 [4:30:29<10:56:01, 22.78s/it] 31%|███ | 773/2500 [4:30:51<10:47:53, 22.51s/it] {'loss': 0.0003, 'grad_norm': 0.02396105731679014, 'learning_rate': 6.907999999999999e-07, 'completion_length': 162.10714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0070953369140625, 'epoch': 0.31} + 31%|███ | 773/2500 [4:30:51<10:47:53, 22.51s/it] 31%|███ | 774/2500 [4:31:13<10:47:05, 22.49s/it] {'loss': 0.0002, 'grad_norm': 1.7565969593302708, 'learning_rate': 6.904e-07, 'completion_length': 167.16072845458984, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695358991622925, 'kl': 0.0061492919921875, 'epoch': 0.31} + 31%|███ | 774/2500 [4:31:13<10:47:05, 22.49s/it] 31%|███ | 775/2500 [4:31:36<10:50:00, 22.61s/it] {'loss': 0.0003, 'grad_norm': 0.026305902973107944, 'learning_rate': 6.9e-07, 'completion_length': 149.12500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0071563720703125, 'epoch': 0.31} + 31%|███ | 775/2500 [4:31:36<10:50:00, 22.61s/it] 31%|███ | 776/2500 [4:31:57<10:36:49, 22.16s/it] {'loss': 0.0002, 'grad_norm': 0.3763849749765121, 'learning_rate': 6.895999999999999e-07, 'completion_length': 147.7857208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.005523681640625, 'epoch': 0.31} + 31%|███ | 776/2500 [4:31:57<10:36:49, 22.16s/it] 31%|███ | 777/2500 [4:32:19<10:32:47, 22.04s/it] {'loss': 0.0002, 'grad_norm': 0.023809723144376536, 'learning_rate': 6.892e-07, 'completion_length': 156.33929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00543212890625, 'epoch': 0.31} + 31%|███ | 777/2500 [4:32:19<10:32:47, 22.04s/it] 31%|███ | 778/2500 [4:32:42<10:44:28, 22.46s/it] {'loss': 0.0004, 'grad_norm': 0.5278671096942757, 'learning_rate': 6.888e-07, 'completion_length': 174.10714721679688, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.1071428619325161, 'kl': 0.010406494140625, 'epoch': 0.31} + 31%|███ | 778/2500 [4:32:42<10:44:28, 22.46s/it] 31%|███ | 779/2500 [4:33:04<10:37:43, 22.23s/it] {'loss': 0.0002, 'grad_norm': 0.25755915090157316, 'learning_rate': 6.883999999999999e-07, 'completion_length': 141.21428680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.0050811767578125, 'epoch': 0.31} + 31%|███ | 779/2500 [4:33:04<10:37:43, 22.23s/it] 31%|███ | 780/2500 [4:33:26<10:35:36, 22.17s/it] {'loss': 0.0002, 'grad_norm': 0.23927775786908045, 'learning_rate': 6.879999999999999e-07, 'completion_length': 154.00000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00531005859375, 'epoch': 0.31} + 31%|███ | 780/2500 [4:33:26<10:35:36, 22.17s/it] 31%|███ | 781/2500 [4:33:49<10:44:11, 22.48s/it] {'loss': 0.0003, 'grad_norm': 0.028646096321259838, 'learning_rate': 6.876e-07, 'completion_length': 171.4821548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0078582763671875, 'epoch': 0.31} + 31%|███ | 781/2500 [4:33:49<10:44:11, 22.48s/it] 31%|███▏ | 782/2500 [4:34:12<10:44:59, 22.53s/it] {'loss': 0.0003, 'grad_norm': 0.5746431414805138, 'learning_rate': 6.872e-07, 'completion_length': 154.2857208251953, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.12371791899204254, 'kl': 0.00701904296875, 'epoch': 0.31} + 31%|███▏ | 782/2500 [4:34:12<10:44:59, 22.53s/it] 31%|███▏ | 783/2500 [4:34:34<10:39:02, 22.33s/it] {'loss': 0.0002, 'grad_norm': 0.02642393421008301, 'learning_rate': 6.867999999999999e-07, 'completion_length': 156.9107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0059661865234375, 'epoch': 0.31} + 31%|███▏ | 783/2500 [4:34:34<10:39:02, 22.33s/it] 31%|███▏ | 784/2500 [4:34:55<10:28:22, 21.97s/it] {'loss': 0.0002, 'grad_norm': 0.2587210688177064, 'learning_rate': 6.864e-07, 'completion_length': 143.85714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.005584716796875, 'epoch': 0.31} + 31%|███▏ | 784/2500 [4:34:55<10:28:22, 21.97s/it] 31%|███▏ | 785/2500 [4:35:16<10:24:26, 21.85s/it] {'loss': 0.0003, 'grad_norm': 0.0253297571167965, 'learning_rate': 6.86e-07, 'completion_length': 144.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0065765380859375, 'epoch': 0.31} + 31%|███▏ | 785/2500 [4:35:16<10:24:26, 21.85s/it] 31%|███▏ | 786/2500 [4:35:38<10:25:04, 21.88s/it] {'loss': 0.0003, 'grad_norm': 0.32346106491021726, 'learning_rate': 6.855999999999999e-07, 'completion_length': 158.75000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.0073394775390625, 'epoch': 0.31} + 31%|███▏ | 786/2500 [4:35:38<10:25:04, 21.88s/it] 31%|███▏ | 787/2500 [4:36:01<10:30:24, 22.08s/it] {'loss': 0.0002, 'grad_norm': 0.22690935209119276, 'learning_rate': 6.852e-07, 'completion_length': 149.37500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004974365234375, 'epoch': 0.31} + 31%|███▏ | 787/2500 [4:36:01<10:30:24, 22.08s/it] 32%|███▏ | 788/2500 [4:36:23<10:26:46, 21.97s/it] {'loss': 0.0003, 'grad_norm': 0.2318872568720806, 'learning_rate': 6.847999999999999e-07, 'completion_length': 150.85714721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0070953369140625, 'epoch': 0.32} + 32%|███▏ | 788/2500 [4:36:23<10:26:46, 21.97s/it] 32%|███▏ | 789/2500 [4:36:45<10:30:51, 22.12s/it] {'loss': 0.0003, 'grad_norm': 0.7340896880339318, 'learning_rate': 6.844e-07, 'completion_length': 155.35714721679688, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.0074462890625, 'epoch': 0.32} + 32%|███▏ | 789/2500 [4:36:45<10:30:51, 22.12s/it] 32%|███▏ | 790/2500 [4:37:07<10:26:49, 21.99s/it] {'loss': 0.0002, 'grad_norm': 0.540298179551596, 'learning_rate': 6.84e-07, 'completion_length': 140.33929061889648, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.00522613525390625, 'epoch': 0.32} + 32%|███▏ | 790/2500 [4:37:07<10:26:49, 21.99s/it] 32%|███▏ | 791/2500 [4:37:29<10:28:25, 22.06s/it] {'loss': 0.0003, 'grad_norm': 0.3144038756397842, 'learning_rate': 6.836e-07, 'completion_length': 155.51786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.00757598876953125, 'epoch': 0.32} + 32%|███▏ | 791/2500 [4:37:29<10:28:25, 22.06s/it] 32%|███▏ | 792/2500 [4:37:51<10:22:40, 21.87s/it] {'loss': 0.0002, 'grad_norm': 0.21387018533348384, 'learning_rate': 6.832e-07, 'completion_length': 145.4107208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.005279541015625, 'epoch': 0.32} + 32%|███▏ | 792/2500 [4:37:51<10:22:40, 21.87s/it] 32%|███▏ | 793/2500 [4:38:12<10:17:48, 21.72s/it] {'loss': 0.0002, 'grad_norm': 0.031119418637227687, 'learning_rate': 6.827999999999999e-07, 'completion_length': 148.96428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00603485107421875, 'epoch': 0.32} + 32%|███▏ | 793/2500 [4:38:12<10:17:48, 21.72s/it] 32%|███▏ | 794/2500 [4:38:34<10:18:07, 21.74s/it] {'loss': 0.0003, 'grad_norm': 0.052603775544800585, 'learning_rate': 6.824e-07, 'completion_length': 168.0357208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0078887939453125, 'epoch': 0.32} + 32%|███▏ | 794/2500 [4:38:34<10:18:07, 21.74s/it] 32%|███▏ | 795/2500 [4:38:56<10:21:12, 21.86s/it] {'loss': 0.0003, 'grad_norm': 0.27338727137079066, 'learning_rate': 6.82e-07, 'completion_length': 149.82144165039062, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0071868896484375, 'epoch': 0.32} + 32%|███▏ | 795/2500 [4:38:56<10:21:12, 21.86s/it] 32%|███▏ | 796/2500 [4:39:18<10:24:43, 22.00s/it] {'loss': 0.0002, 'grad_norm': 0.2611196259929166, 'learning_rate': 6.816e-07, 'completion_length': 152.7857208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.00531005859375, 'epoch': 0.32} + 32%|███▏ | 796/2500 [4:39:18<10:24:43, 22.00s/it] 32%|███▏ | 797/2500 [4:39:41<10:30:47, 22.22s/it] {'loss': 0.0004, 'grad_norm': 0.5589491981337168, 'learning_rate': 6.812e-07, 'completion_length': 166.21429443359375, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.009033203125, 'epoch': 0.32} + 32%|███▏ | 797/2500 [4:39:41<10:30:47, 22.22s/it] 32%|███▏ | 798/2500 [4:40:08<11:11:20, 23.67s/it] {'loss': 0.0003, 'grad_norm': 0.17298032348322306, 'learning_rate': 6.807999999999999e-07, 'completion_length': 174.4107208251953, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8750000596046448, 'reward_std': 0.06838765740394592, 'kl': 0.0085601806640625, 'epoch': 0.32} + 32%|███▏ | 798/2500 [4:40:08<11:11:20, 23.67s/it] 32%|███▏ | 799/2500 [4:40:30<11:00:59, 23.32s/it] {'loss': 0.0002, 'grad_norm': 0.4157665962551669, 'learning_rate': 6.804e-07, 'completion_length': 147.0714340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.006195068359375, 'epoch': 0.32} + 32%|███▏ | 799/2500 [4:40:30<11:00:59, 23.32s/it] 32%|███▏ | 800/2500 [4:40:53<10:51:03, 22.98s/it] {'loss': 0.0003, 'grad_norm': 0.3895926751488319, 'learning_rate': 6.800000000000001e-07, 'completion_length': 146.2857208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.007781982421875, 'epoch': 0.32} + 32%|███▏ | 800/2500 [4:40:53<10:51:03, 22.98s/it] 32%|███▏ | 801/2500 [4:44:23<37:19:00, 79.07s/it] {'loss': 0.0003, 'grad_norm': 0.21902840482784086, 'learning_rate': 6.795999999999999e-07, 'completion_length': 172.7321548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0085296630859375, 'epoch': 0.32} + 32%|███▏ | 801/2500 [4:44:23<37:19:00, 79.07s/it] 32%|███▏ | 802/2500 [4:44:45<29:20:09, 62.20s/it] {'loss': 0.0004, 'grad_norm': 0.5521762583265475, 'learning_rate': 6.792e-07, 'completion_length': 172.7678680419922, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.1428571492433548, 'kl': 0.00994873046875, 'epoch': 0.32} + 32%|███▏ | 802/2500 [4:44:45<29:20:09, 62.20s/it] 32%|███▏ | 803/2500 [4:45:07<23:34:17, 50.00s/it] {'loss': 0.0003, 'grad_norm': 0.24422916265034206, 'learning_rate': 6.788e-07, 'completion_length': 145.67858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.006439208984375, 'epoch': 0.32} + 32%|███▏ | 803/2500 [4:45:07<23:34:17, 50.00s/it] 32%|███▏ | 804/2500 [4:45:30<19:41:31, 41.80s/it] {'loss': 0.0003, 'grad_norm': 0.4483353321841351, 'learning_rate': 6.783999999999999e-07, 'completion_length': 158.3214340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.006378173828125, 'epoch': 0.32} + 32%|███▏ | 804/2500 [4:45:30<19:41:31, 41.80s/it] 32%|███▏ | 805/2500 [4:45:51<16:51:18, 35.80s/it] {'loss': 0.0003, 'grad_norm': 0.7964576867779457, 'learning_rate': 6.78e-07, 'completion_length': 157.51786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0066070556640625, 'epoch': 0.32} + 32%|███▏ | 805/2500 [4:45:51<16:51:18, 35.80s/it] 32%|███▏ | 806/2500 [4:46:12<14:46:21, 31.39s/it] {'loss': 0.0003, 'grad_norm': 0.3866076307021467, 'learning_rate': 6.776e-07, 'completion_length': 131.46429443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00634765625, 'epoch': 0.32} + 32%|███▏ | 806/2500 [4:46:13<14:46:21, 31.39s/it] 32%|███▏ | 807/2500 [4:46:35<13:32:42, 28.80s/it] {'loss': 0.0003, 'grad_norm': 0.05956563700598305, 'learning_rate': 6.772e-07, 'completion_length': 161.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006927490234375, 'epoch': 0.32} + 32%|███▏ | 807/2500 [4:46:35<13:32:42, 28.80s/it] 32%|███▏ | 808/2500 [4:46:57<12:35:31, 26.79s/it] {'loss': 0.0002, 'grad_norm': 0.2068567384950924, 'learning_rate': 6.767999999999999e-07, 'completion_length': 152.89286041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0061187744140625, 'epoch': 0.32} + 32%|███▏ | 808/2500 [4:46:57<12:35:31, 26.79s/it] 32%|███▏ | 809/2500 [4:47:19<11:55:18, 25.38s/it] {'loss': 0.0003, 'grad_norm': 0.03721503679987784, 'learning_rate': 6.764e-07, 'completion_length': 158.69644165039062, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0064239501953125, 'epoch': 0.32} + 32%|███▏ | 809/2500 [4:47:19<11:55:18, 25.38s/it] 32%|███▏ | 810/2500 [4:47:44<11:48:48, 25.17s/it] {'loss': 0.0004, 'grad_norm': 0.5207444916995683, 'learning_rate': 6.76e-07, 'completion_length': 169.3214340209961, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.1181928962469101, 'kl': 0.010894775390625, 'epoch': 0.32} + 32%|███▏ | 810/2500 [4:47:44<11:48:48, 25.17s/it] 32%|███▏ | 811/2500 [4:48:07<11:32:49, 24.61s/it] {'loss': 0.0003, 'grad_norm': 3.91197182908085, 'learning_rate': 6.755999999999999e-07, 'completion_length': 168.3928680419922, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0063934326171875, 'epoch': 0.32} + 32%|███▏ | 811/2500 [4:48:07<11:32:49, 24.61s/it] 32%|███▏ | 812/2500 [4:48:29<11:08:34, 23.76s/it] {'loss': 0.0003, 'grad_norm': 0.5985602339524463, 'learning_rate': 6.752e-07, 'completion_length': 152.6964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0070037841796875, 'epoch': 0.32} + 32%|███▏ | 812/2500 [4:48:29<11:08:34, 23.76s/it] 33%|███▎ | 813/2500 [4:48:52<11:03:48, 23.61s/it] {'loss': 0.0003, 'grad_norm': 0.3005148318330127, 'learning_rate': 6.747999999999999e-07, 'completion_length': 163.2857208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0069122314453125, 'epoch': 0.33} + 33%|███▎ | 813/2500 [4:48:52<11:03:48, 23.61s/it] 33%|███▎ | 814/2500 [4:49:15<10:55:12, 23.32s/it] {'loss': 0.0002, 'grad_norm': 0.16941548052558683, 'learning_rate': 6.744e-07, 'completion_length': 160.46429443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.005279541015625, 'epoch': 0.33} + 33%|███▎ | 814/2500 [4:49:15<10:55:12, 23.32s/it] 33%|███▎ | 815/2500 [4:49:36<10:37:49, 22.71s/it] {'loss': 0.0002, 'grad_norm': 0.014818380806197016, 'learning_rate': 6.74e-07, 'completion_length': 152.8928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00467681884765625, 'epoch': 0.33} + 33%|███▎ | 815/2500 [4:49:36<10:37:49, 22.71s/it] 33%|███▎ | 816/2500 [4:49:58<10:28:35, 22.40s/it] {'loss': 0.0003, 'grad_norm': 0.19197838191646627, 'learning_rate': 6.736e-07, 'completion_length': 156.87500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0079345703125, 'epoch': 0.33} + 33%|███▎ | 816/2500 [4:49:58<10:28:35, 22.40s/it] 33%|███▎ | 817/2500 [4:50:21<10:28:53, 22.42s/it] {'loss': 0.0003, 'grad_norm': 0.0361176055543982, 'learning_rate': 6.732e-07, 'completion_length': 148.01786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.007415771484375, 'epoch': 0.33} + 33%|███▎ | 817/2500 [4:50:21<10:28:53, 22.42s/it] 33%|███▎ | 818/2500 [4:50:42<10:24:04, 22.26s/it] {'loss': 0.0002, 'grad_norm': 0.042529216643196345, 'learning_rate': 6.727999999999999e-07, 'completion_length': 154.7678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0055389404296875, 'epoch': 0.33} + 33%|███▎ | 818/2500 [4:50:42<10:24:04, 22.26s/it] 33%|███▎ | 819/2500 [4:51:05<10:26:52, 22.38s/it] {'loss': 0.0003, 'grad_norm': 0.13433446710878816, 'learning_rate': 6.724e-07, 'completion_length': 148.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0069122314453125, 'epoch': 0.33} + 33%|███▎ | 819/2500 [4:51:05<10:26:52, 22.38s/it] 33%|███▎ | 820/2500 [4:51:28<10:29:21, 22.48s/it] {'loss': 0.0003, 'grad_norm': 0.2519530160077956, 'learning_rate': 6.72e-07, 'completion_length': 170.4464340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0084075927734375, 'epoch': 0.33} + 33%|███▎ | 820/2500 [4:51:28<10:29:21, 22.48s/it] 33%|███▎ | 821/2500 [4:51:50<10:30:02, 22.51s/it] {'loss': 0.0003, 'grad_norm': 0.17347453150393108, 'learning_rate': 6.716e-07, 'completion_length': 147.62500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0066070556640625, 'epoch': 0.33} + 33%|███▎ | 821/2500 [4:51:50<10:30:02, 22.51s/it] 33%|███▎ | 822/2500 [4:52:13<10:27:43, 22.45s/it] {'loss': 0.0003, 'grad_norm': 0.2590032826315198, 'learning_rate': 6.712e-07, 'completion_length': 147.25000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0062713623046875, 'epoch': 0.33} + 33%|███▎ | 822/2500 [4:52:13<10:27:43, 22.45s/it] 33%|███▎ | 823/2500 [4:52:35<10:26:17, 22.41s/it] {'loss': 0.0003, 'grad_norm': 0.023228027737242016, 'learning_rate': 6.707999999999999e-07, 'completion_length': 155.42857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006500244140625, 'epoch': 0.33} + 33%|███▎ | 823/2500 [4:52:35<10:26:17, 22.41s/it] 33%|███▎ | 824/2500 [4:52:57<10:21:16, 22.24s/it] {'loss': 0.0003, 'grad_norm': 0.04341088413837325, 'learning_rate': 6.704e-07, 'completion_length': 150.85714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0081329345703125, 'epoch': 0.33} + 33%|███▎ | 824/2500 [4:52:57<10:21:16, 22.24s/it] 33%|███▎ | 825/2500 [4:53:19<10:20:19, 22.22s/it] {'loss': 0.0002, 'grad_norm': 0.30270969186873153, 'learning_rate': 6.7e-07, 'completion_length': 152.50000762939453, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.0055084228515625, 'epoch': 0.33} + 33%|███▎ | 825/2500 [4:53:19<10:20:19, 22.22s/it] 33%|███▎ | 826/2500 [4:53:42<10:25:17, 22.41s/it] {'loss': 0.0003, 'grad_norm': 0.2630033485732801, 'learning_rate': 6.695999999999999e-07, 'completion_length': 152.7321548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0073699951171875, 'epoch': 0.33} + 33%|███▎ | 826/2500 [4:53:42<10:25:17, 22.41s/it] 33%|███▎ | 827/2500 [4:54:04<10:25:09, 22.42s/it] {'loss': 0.0003, 'grad_norm': 0.732094127888496, 'learning_rate': 6.692e-07, 'completion_length': 155.4107208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.11266788095235825, 'kl': 0.0083465576171875, 'epoch': 0.33} + 33%|███▎ | 827/2500 [4:54:04<10:25:09, 22.42s/it] 33%|███▎ | 828/2500 [4:54:28<10:35:50, 22.82s/it] {'loss': 0.0005, 'grad_norm': 0.3111424356347401, 'learning_rate': 6.688e-07, 'completion_length': 166.21429443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.011383056640625, 'epoch': 0.33} + 33%|███▎ | 828/2500 [4:54:28<10:35:50, 22.82s/it] 33%|███��� | 829/2500 [4:54:50<10:25:37, 22.46s/it] {'loss': 0.0003, 'grad_norm': 0.03034871749469383, 'learning_rate': 6.683999999999999e-07, 'completion_length': 155.7321548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0063018798828125, 'epoch': 0.33} + 33%|███▎ | 829/2500 [4:54:50<10:25:37, 22.46s/it] 33%|███▎ | 830/2500 [4:55:12<10:23:29, 22.40s/it] {'loss': 0.0003, 'grad_norm': 0.02896669686105724, 'learning_rate': 6.68e-07, 'completion_length': 167.2678680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00750732421875, 'epoch': 0.33} + 33%|███▎ | 830/2500 [4:55:12<10:23:29, 22.40s/it] 33%|███▎ | 831/2500 [4:55:33<10:13:58, 22.07s/it] {'loss': 0.0002, 'grad_norm': 0.37191314773183115, 'learning_rate': 6.676e-07, 'completion_length': 140.1607208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.005645751953125, 'epoch': 0.33} + 33%|███▎ | 831/2500 [4:55:33<10:13:58, 22.07s/it] 33%|███▎ | 832/2500 [4:55:56<10:18:26, 22.25s/it] {'loss': 0.0003, 'grad_norm': 0.5108766194945349, 'learning_rate': 6.671999999999999e-07, 'completion_length': 152.2857208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0067138671875, 'epoch': 0.33} + 33%|███▎ | 832/2500 [4:55:56<10:18:26, 22.25s/it] 33%|███▎ | 833/2500 [4:56:18<10:17:19, 22.22s/it] {'loss': 0.0002, 'grad_norm': 0.02944201712421564, 'learning_rate': 6.667999999999999e-07, 'completion_length': 156.33929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005523681640625, 'epoch': 0.33} + 33%|███▎ | 833/2500 [4:56:18<10:17:19, 22.22s/it] 33%|███▎ | 834/2500 [4:56:40<10:16:39, 22.21s/it] {'loss': 0.0002, 'grad_norm': 1.3194284839293626, 'learning_rate': 6.664e-07, 'completion_length': 156.00000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0059967041015625, 'epoch': 0.33} + 33%|███▎ | 834/2500 [4:56:40<10:16:39, 22.21s/it] 33%|███▎ | 835/2500 [4:57:03<10:16:54, 22.23s/it] {'loss': 0.0003, 'grad_norm': 0.026330060898362553, 'learning_rate': 6.66e-07, 'completion_length': 143.25000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0083770751953125, 'epoch': 0.33} + 33%|███▎ | 835/2500 [4:57:03<10:16:54, 22.23s/it] 33%|███▎ | 836/2500 [4:57:26<10:26:59, 22.61s/it] {'loss': 0.0003, 'grad_norm': 0.636151412300923, 'learning_rate': 6.655999999999999e-07, 'completion_length': 153.0357208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.006805419921875, 'epoch': 0.33} + 33%|███▎ | 836/2500 [4:57:26<10:26:59, 22.61s/it] 33%|███▎ | 837/2500 [4:57:49<10:31:11, 22.77s/it] {'loss': 0.0002, 'grad_norm': 0.020771328927120216, 'learning_rate': 6.652e-07, 'completion_length': 151.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00506591796875, 'epoch': 0.33} + 33%|███▎ | 837/2500 [4:57:49<10:31:11, 22.77s/it] 34%|███▎ | 838/2500 [4:58:12<10:30:03, 22.75s/it] {'loss': 0.0002, 'grad_norm': 0.30571679755067516, 'learning_rate': 6.647999999999999e-07, 'completion_length': 159.42858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.005279541015625, 'epoch': 0.34} + 34%|███▎ | 838/2500 [4:58:12<10:30:03, 22.75s/it] 34%|███▎ | 839/2500 [4:58:35<10:31:09, 22.80s/it] {'loss': 0.0003, 'grad_norm': 0.29960391773671813, 'learning_rate': 6.643999999999999e-07, 'completion_length': 164.8214340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0087432861328125, 'epoch': 0.34} + 34%|███▎ | 839/2500 [4:58:35<10:31:09, 22.80s/it] 34%|███▎ | 840/2500 [4:58:56<10:21:30, 22.46s/it] {'loss': 0.0002, 'grad_norm': 0.020967301989123537, 'learning_rate': 6.64e-07, 'completion_length': 147.42857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0056610107421875, 'epoch': 0.34} + 34%|███▎ | 840/2500 [4:58:56<10:21:30, 22.46s/it] 34%|███▎ | 841/2500 [4:59:19<10:24:53, 22.60s/it] {'loss': 0.0003, 'grad_norm': 0.5520733673191389, 'learning_rate': 6.636e-07, 'completion_length': 161.0357208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00799560546875, 'epoch': 0.34} + 34%|███▎ | 841/2500 [4:59:19<10:24:53, 22.60s/it] 34%|███▎ | 842/2500 [4:59:43<10:31:04, 22.84s/it] {'loss': 0.0003, 'grad_norm': 0.43459802544099035, 'learning_rate': 6.632e-07, 'completion_length': 166.25000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0071258544921875, 'epoch': 0.34} + 34%|███▎ | 842/2500 [4:59:43<10:31:04, 22.84s/it] 34%|███▎ | 843/2500 [5:00:05<10:29:12, 22.78s/it] {'loss': 0.0003, 'grad_norm': 0.027654592088729724, 'learning_rate': 6.627999999999999e-07, 'completion_length': 166.16072845458984, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0071258544921875, 'epoch': 0.34} + 34%|███▎ | 843/2500 [5:00:05<10:29:12, 22.78s/it] 34%|███▍ | 844/2500 [5:00:29<10:38:56, 23.15s/it] {'loss': 0.0003, 'grad_norm': 0.43970256223165577, 'learning_rate': 6.624e-07, 'completion_length': 162.26786041259766, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.0070343017578125, 'epoch': 0.34} + 34%|███▍ | 844/2500 [5:00:29<10:38:56, 23.15s/it] 34%|███▍ | 845/2500 [5:00:52<10:34:51, 23.02s/it] {'loss': 0.0002, 'grad_norm': 0.3449988540959146, 'learning_rate': 6.62e-07, 'completion_length': 148.00000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004913330078125, 'epoch': 0.34} + 34%|███▍ | 845/2500 [5:00:52<10:34:51, 23.02s/it] 34%|███▍ | 846/2500 [5:01:15<10:35:50, 23.07s/it] {'loss': 0.0002, 'grad_norm': 0.02509316899620472, 'learning_rate': 6.615999999999999e-07, 'completion_length': 145.1964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006011962890625, 'epoch': 0.34} + 34%|███▍ | 846/2500 [5:01:15<10:35:50, 23.07s/it] 34%|███▍ | 847/2500 [5:01:39<10:44:07, 23.38s/it] {'loss': 0.0003, 'grad_norm': 0.21126908120487994, 'learning_rate': 6.612e-07, 'completion_length': 158.92858123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00701904296875, 'epoch': 0.34} + 34%|███▍ | 847/2500 [5:01:39<10:44:07, 23.38s/it] 34%|███▍ | 848/2500 [5:02:04<10:49:58, 23.61s/it] {'loss': 0.0002, 'grad_norm': 0.03284769785291716, 'learning_rate': 6.608e-07, 'completion_length': 160.1428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00592041015625, 'epoch': 0.34} + 34%|███▍ | 848/2500 [5:02:04<10:49:58, 23.61s/it] 34%|███▍ | 849/2500 [5:02:28<10:56:39, 23.86s/it] {'loss': 0.0003, 'grad_norm': 0.024648153338978896, 'learning_rate': 6.604e-07, 'completion_length': 156.64286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006927490234375, 'epoch': 0.34} + 34%|███▍ | 849/2500 [5:02:28<10:56:39, 23.86s/it] 34%|███▍ | 850/2500 [5:02:51<10:46:11, 23.50s/it] {'loss': 0.0002, 'grad_norm': 0.4471247257009924, 'learning_rate': 6.6e-07, 'completion_length': 144.1071548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00438690185546875, 'epoch': 0.34} + 34%|███▍ | 850/2500 [5:02:51<10:46:11, 23.50s/it] 34%|███▍ | 851/2500 [5:03:15<10:53:13, 23.77s/it] {'loss': 0.0003, 'grad_norm': 0.32250997805659587, 'learning_rate': 6.595999999999999e-07, 'completion_length': 157.35714721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0079498291015625, 'epoch': 0.34} + 34%|███▍ | 851/2500 [5:03:15<10:53:13, 23.77s/it] 34%|███▍ | 852/2500 [5:03:38<10:43:18, 23.42s/it] {'loss': 0.0003, 'grad_norm': 0.2402391553012465, 'learning_rate': 6.592e-07, 'completion_length': 159.14286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0068817138671875, 'epoch': 0.34} + 34%|███▍ | 852/2500 [5:03:38<10:43:18, 23.42s/it] 34%|███▍ | 853/2500 [5:04:00<10:30:21, 22.96s/it] {'loss': 0.0002, 'grad_norm': 0.017957632223672342, 'learning_rate': 6.588e-07, 'completion_length': 146.33928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045928955078125, 'epoch': 0.34} + 34%|███▍ | 853/2500 [5:04:00<10:30:21, 22.96s/it] 34%|███▍ | 854/2500 [5:04:24<10:42:43, 23.43s/it] {'loss': 0.0003, 'grad_norm': 0.16495659444905766, 'learning_rate': 6.583999999999999e-07, 'completion_length': 178.8571548461914, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0074615478515625, 'epoch': 0.34} + 34%|███▍ | 854/2500 [5:04:24<10:42:43, 23.43s/it] 34%|███▍ | 855/2500 [5:04:46<10:30:22, 22.99s/it] {'loss': 0.0003, 'grad_norm': 0.024837046219961842, 'learning_rate': 6.58e-07, 'completion_length': 156.00000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00689697265625, 'epoch': 0.34} + 34%|███▍ | 855/2500 [5:04:46<10:30:22, 22.99s/it] 34%|███▍ | 856/2500 [5:05:07<10:15:57, 22.48s/it] {'loss': 0.0002, 'grad_norm': 0.032917076047550356, 'learning_rate': 6.576e-07, 'completion_length': 144.1964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00531005859375, 'epoch': 0.34} + 34%|███▍ | 856/2500 [5:05:07<10:15:57, 22.48s/it] 34%|███▍ | 857/2500 [5:05:29<10:09:40, 22.26s/it] {'loss': 0.0003, 'grad_norm': 0.3380334570811244, 'learning_rate': 6.571999999999999e-07, 'completion_length': 140.71428680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.00714111328125, 'epoch': 0.34} + 34%|███▍ | 857/2500 [5:05:29<10:09:40, 22.26s/it] 34%|███▍ | 858/2500 [5:05:51<10:06:26, 22.16s/it] {'loss': 0.0002, 'grad_norm': 0.22214564511512833, 'learning_rate': 6.568e-07, 'completion_length': 144.80358123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00464630126953125, 'epoch': 0.34} + 34%|███▍ | 858/2500 [5:05:51<10:06:26, 22.16s/it] 34%|███▍ | 859/2500 [5:06:14<10:11:06, 22.34s/it] {'loss': 0.0002, 'grad_norm': 0.31205264623820134, 'learning_rate': 6.564e-07, 'completion_length': 148.64286041259766, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.00424957275390625, 'epoch': 0.34} + 34%|███▍ | 859/2500 [5:06:14<10:11:06, 22.34s/it] 34%|███▍ | 860/2500 [5:06:37<10:14:41, 22.49s/it] {'loss': 0.0002, 'grad_norm': 0.4558042067417913, 'learning_rate': 6.56e-07, 'completion_length': 161.08928680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.00499725341796875, 'epoch': 0.34} + 34%|███▍ | 860/2500 [5:06:37<10:14:41, 22.49s/it] 34%|███▍ | 861/2500 [5:06:58<10:04:44, 22.14s/it] {'loss': 0.0002, 'grad_norm': 0.018838102253567097, 'learning_rate': 6.555999999999999e-07, 'completion_length': 146.1428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005126953125, 'epoch': 0.34} + 34%|███▍ | 861/2500 [5:06:58<10:04:44, 22.14s/it] 34%|███▍ | 862/2500 [5:07:20<10:01:12, 22.02s/it] {'loss': 0.0004, 'grad_norm': 0.2662816221972931, 'learning_rate': 6.552e-07, 'completion_length': 169.0357208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.009124755859375, 'epoch': 0.34} + 34%|███▍ | 862/2500 [5:07:20<10:01:12, 22.02s/it] 35%|███▍ | 863/2500 [5:07:42<10:03:41, 22.13s/it] {'loss': 0.0003, 'grad_norm': 0.3011606289024102, 'learning_rate': 6.548000000000001e-07, 'completion_length': 160.50000762939453, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0824786126613617, 'kl': 0.008544921875, 'epoch': 0.35} + 35%|███▍ | 863/2500 [5:07:42<10:03:41, 22.13s/it] 35%|███▍ | 864/2500 [5:08:05<10:11:43, 22.43s/it] {'loss': 0.0002, 'grad_norm': 0.21147830246923335, 'learning_rate': 6.543999999999999e-07, 'completion_length': 172.6071548461914, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0060577392578125, 'epoch': 0.35} + 35%|███▍ | 864/2500 [5:08:05<10:11:43, 22.43s/it] 35%|███▍ | 865/2500 [5:08:26<9:59:35, 22.00s/it] {'loss': 0.0002, 'grad_norm': 0.7552453576479191, 'learning_rate': 6.54e-07, 'completion_length': 136.21429443359375, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.00530242919921875, 'epoch': 0.35} + 35%|███▍ | 865/2500 [5:08:26<9:59:35, 22.00s/it] 35%|███▍ | 866/2500 [5:08:47<9:51:33, 21.72s/it] {'loss': 0.0003, 'grad_norm': 0.26094399268017804, 'learning_rate': 6.536e-07, 'completion_length': 136.50000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006500244140625, 'epoch': 0.35} + 35%|███▍ | 866/2500 [5:08:47<9:51:33, 21.72s/it] 35%|███▍ | 867/2500 [5:09:09<9:54:08, 21.83s/it] {'loss': 0.0002, 'grad_norm': 1.9867719011946288, 'learning_rate': 6.531999999999999e-07, 'completion_length': 142.26786041259766, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.11266788095235825, 'kl': 0.004852294921875, 'epoch': 0.35} + 35%|███▍ | 867/2500 [5:09:09<9:54:08, 21.83s/it] 35%|███▍ | 868/2500 [5:09:32<9:56:44, 21.94s/it] {'loss': 0.0004, 'grad_norm': 0.019799964982559044, 'learning_rate': 6.528e-07, 'completion_length': 148.3214340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.009735107421875, 'epoch': 0.35} + 35%|███▍ | 868/2500 [5:09:32<9:56:44, 21.94s/it] 35%|███▍ | 869/2500 [5:09:53<9:54:38, 21.88s/it] {'loss': 0.0003, 'grad_norm': 0.41322829794621396, 'learning_rate': 6.524e-07, 'completion_length': 139.85714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00738525390625, 'epoch': 0.35} + 35%|███▍ | 869/2500 [5:09:53<9:54:38, 21.88s/it] 35%|███▍ | 870/2500 [5:10:16<9:58:06, 22.02s/it] {'loss': 0.0002, 'grad_norm': 0.8612166257546835, 'learning_rate': 6.52e-07, 'completion_length': 147.46429061889648, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005950927734375, 'epoch': 0.35} + 35%|███▍ | 870/2500 [5:10:16<9:58:06, 22.02s/it] 35%|███▍ | 871/2500 [5:10:38<10:03:13, 22.22s/it] {'loss': 0.0002, 'grad_norm': 0.04010601998313645, 'learning_rate': 6.515999999999999e-07, 'completion_length': 144.0714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005523681640625, 'epoch': 0.35} + 35%|███▍ | 871/2500 [5:10:38<10:03:13, 22.22s/it] 35%|███▍ | 872/2500 [5:10:59<9:54:09, 21.90s/it] {'loss': 0.0002, 'grad_norm': 0.02340554008833702, 'learning_rate': 6.512e-07, 'completion_length': 142.55358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0037841796875, 'epoch': 0.35} + 35%|███▍ | 872/2500 [5:10:59<9:54:09, 21.90s/it] 35%|███▍ | 873/2500 [5:11:22<9:58:35, 22.07s/it] {'loss': 0.0002, 'grad_norm': 0.026297849495578587, 'learning_rate': 6.508e-07, 'completion_length': 155.17858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0062255859375, 'epoch': 0.35} + 35%|███▍ | 873/2500 [5:11:22<9:58:35, 22.07s/it] 35%|███▍ | 874/2500 [5:11:44<9:58:54, 22.10s/it] {'loss': 0.0002, 'grad_norm': 0.16887513203065274, 'learning_rate': 6.504e-07, 'completion_length': 156.39286041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.006011962890625, 'epoch': 0.35} + 35%|███▍ | 874/2500 [5:11:44<9:58:54, 22.10s/it] 35%|███▌ | 875/2500 [5:12:07<10:03:57, 22.30s/it] {'loss': 0.0004, 'grad_norm': 0.31938434452212, 'learning_rate': 6.5e-07, 'completion_length': 156.5714340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0088653564453125, 'epoch': 0.35} + 35%|███▌ | 875/2500 [5:12:07<10:03:57, 22.30s/it] 35%|███▌ | 876/2500 [5:12:30<10:09:16, 22.51s/it] {'loss': 0.0002, 'grad_norm': 0.4378457819379037, 'learning_rate': 6.495999999999999e-07, 'completion_length': 164.96429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0061798095703125, 'epoch': 0.35} + 35%|███▌ | 876/2500 [5:12:30<10:09:16, 22.51s/it] 35%|███▌ | 877/2500 [5:12:53<10:11:43, 22.61s/it] {'loss': 0.0004, 'grad_norm': 1.0213758156198385, 'learning_rate': 6.492e-07, 'completion_length': 171.87500762939453, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.00927734375, 'epoch': 0.35} + 35%|███▌ | 877/2500 [5:12:53<10:11:43, 22.61s/it] 35%|███▌ | 878/2500 [5:13:15<10:12:20, 22.65s/it] {'loss': 0.0003, 'grad_norm': 0.8141883441378551, 'learning_rate': 6.488e-07, 'completion_length': 153.6607208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0073089599609375, 'epoch': 0.35} + 35%|███▌ | 878/2500 [5:13:15<10:12:20, 22.65s/it] 35%|███▌ | 879/2500 [5:13:40<10:23:32, 23.08s/it] {'loss': 0.0003, 'grad_norm': 0.9805144174360313, 'learning_rate': 6.483999999999999e-07, 'completion_length': 161.67857360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.007781982421875, 'epoch': 0.35} + 35%|███▌ | 879/2500 [5:13:40<10:23:32, 23.08s/it] 35%|███▌ | 880/2500 [5:14:01<10:11:48, 22.66s/it] {'loss': 0.0002, 'grad_norm': 0.019828954669608156, 'learning_rate': 6.48e-07, 'completion_length': 151.64286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00394439697265625, 'epoch': 0.35} + 35%|███▌ | 880/2500 [5:14:01<10:11:48, 22.66s/it] 35%|███▌ | 881/2500 [5:14:25<10:16:23, 22.84s/it] {'loss': 0.0003, 'grad_norm': 0.04227980296687892, 'learning_rate': 6.476e-07, 'completion_length': 168.1607208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0068817138671875, 'epoch': 0.35} + 35%|███▌ | 881/2500 [5:14:25<10:16:23, 22.84s/it] 35%|███▌ | 882/2500 [5:14:46<10:05:41, 22.46s/it] {'loss': 0.0002, 'grad_norm': 0.20187310344582743, 'learning_rate': 6.471999999999999e-07, 'completion_length': 145.28571701049805, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00533294677734375, 'epoch': 0.35} + 35%|███▌ | 882/2500 [5:14:46<10:05:41, 22.46s/it] 35%|███▌ | 883/2500 [5:15:09<10:06:19, 22.50s/it] {'loss': 0.0003, 'grad_norm': 0.46548397222879695, 'learning_rate': 6.468e-07, 'completion_length': 163.89286041259766, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0064239501953125, 'epoch': 0.35} + 35%|███▌ | 883/2500 [5:15:09<10:06:19, 22.50s/it] 35%|███▌ | 884/2500 [5:15:32<10:15:42, 22.86s/it] {'loss': 0.0003, 'grad_norm': 0.3893187631996004, 'learning_rate': 6.464e-07, 'completion_length': 165.6607208251953, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07695358991622925, 'kl': 0.00628662109375, 'epoch': 0.35} + 35%|███▌ | 884/2500 [5:15:32<10:15:42, 22.86s/it] 35%|███▌ | 885/2500 [5:15:53<9:57:19, 22.19s/it] {'loss': 0.0002, 'grad_norm': 0.019585941943100038, 'learning_rate': 6.46e-07, 'completion_length': 131.00000381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00388336181640625, 'epoch': 0.35} + 35%|███▌ | 885/2500 [5:15:53<9:57:19, 22.19s/it] 35%|███▌ | 886/2500 [5:16:16<10:01:31, 22.36s/it] {'loss': 0.0003, 'grad_norm': 0.4293484017656582, 'learning_rate': 6.455999999999999e-07, 'completion_length': 160.83929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.0085906982421875, 'epoch': 0.35} + 35%|███▌ | 886/2500 [5:16:16<10:01:31, 22.36s/it] 35%|███▌ | 887/2500 [5:16:40<10:13:55, 22.84s/it] {'loss': 0.0002, 'grad_norm': 0.4465638453767699, 'learning_rate': 6.452e-07, 'completion_length': 141.6964340209961, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.1071428619325161, 'kl': 0.0058441162109375, 'epoch': 0.35} + 35%|███▌ | 887/2500 [5:16:40<10:13:55, 22.84s/it] 36%|███▌ | 888/2500 [5:17:04<10:22:53, 23.18s/it] {'loss': 0.0003, 'grad_norm': 1.1659707403821669, 'learning_rate': 6.448000000000001e-07, 'completion_length': 170.37500762939453, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.007965087890625, 'epoch': 0.36} + 36%|███▌ | 888/2500 [5:17:04<10:22:53, 23.18s/it] 36%|███▌ | 889/2500 [5:17:26<10:14:31, 22.89s/it] {'loss': 0.0003, 'grad_norm': 0.02618625775177218, 'learning_rate': 6.443999999999999e-07, 'completion_length': 157.26786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0071868896484375, 'epoch': 0.36} + 36%|███▌ | 889/2500 [5:17:26<10:14:31, 22.89s/it] 36%|███▌ | 890/2500 [5:17:48<10:04:16, 22.52s/it] {'loss': 0.0003, 'grad_norm': 0.024029474452615874, 'learning_rate': 6.44e-07, 'completion_length': 148.1607208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0067138671875, 'epoch': 0.36} + 36%|███▌ | 890/2500 [5:17:48<10:04:16, 22.52s/it] 36%|███▌ | 891/2500 [5:18:10<9:59:53, 22.37s/it] {'loss': 0.0002, 'grad_norm': 0.33489588731413356, 'learning_rate': 6.436e-07, 'completion_length': 146.5357208251953, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0052337646484375, 'epoch': 0.36} + 36%|███▌ | 891/2500 [5:18:10<9:59:53, 22.37s/it] 36%|███▌ | 892/2500 [5:18:33<10:04:05, 22.54s/it] {'loss': 0.0003, 'grad_norm': 1.3118614796907464, 'learning_rate': 6.431999999999999e-07, 'completion_length': 172.25000762939453, 'rewards/accuracy_reward': 0.8392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.8392858505249023, 'reward_std': 0.1071428619325161, 'kl': 0.0086212158203125, 'epoch': 0.36} + 36%|███▌ | 892/2500 [5:18:33<10:04:05, 22.54s/it] 36%|███▌ | 893/2500 [5:18:54<9:56:56, 22.29s/it] {'loss': 0.0002, 'grad_norm': 0.06343717788157643, 'learning_rate': 6.428e-07, 'completion_length': 145.92857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00481414794921875, 'epoch': 0.36} + 36%|███▌ | 893/2500 [5:18:54<9:56:56, 22.29s/it] 36%|███▌ | 894/2500 [5:19:16<9:53:06, 22.16s/it] {'loss': 0.0003, 'grad_norm': 0.03416642392020168, 'learning_rate': 6.424e-07, 'completion_length': 164.5357208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00634765625, 'epoch': 0.36} + 36%|███▌ | 894/2500 [5:19:16<9:53:06, 22.16s/it] 36%|███▌ | 895/2500 [5:19:39<9:54:59, 22.24s/it] {'loss': 0.0003, 'grad_norm': 0.06619813888638962, 'learning_rate': 6.42e-07, 'completion_length': 150.17858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0064849853515625, 'epoch': 0.36} + 36%|███▌ | 895/2500 [5:19:39<9:54:59, 22.24s/it] 36%|███▌ | 896/2500 [5:20:00<9:52:30, 22.16s/it] {'loss': 0.0003, 'grad_norm': 0.4267712957337566, 'learning_rate': 6.415999999999999e-07, 'completion_length': 149.80358123779297, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.00872802734375, 'epoch': 0.36} + 36%|███▌ | 896/2500 [5:20:00<9:52:30, 22.16s/it] 36%|███▌ | 897/2500 [5:20:23<9:51:49, 22.15s/it] {'loss': 0.0003, 'grad_norm': 0.2917481849107935, 'learning_rate': 6.412e-07, 'completion_length': 164.19644165039062, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0067596435546875, 'epoch': 0.36} + 36%|███▌ | 897/2500 [5:20:23<9:51:49, 22.15s/it] 36%|███▌ | 898/2500 [5:20:44<9:44:58, 21.91s/it] {'loss': 0.0002, 'grad_norm': 0.01746762929415969, 'learning_rate': 6.408e-07, 'completion_length': 146.89286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0043792724609375, 'epoch': 0.36} + 36%|███▌ | 898/2500 [5:20:44<9:44:58, 21.91s/it] 36%|███▌ | 899/2500 [5:21:07<9:51:07, 22.15s/it] {'loss': 0.0003, 'grad_norm': 3.4863850663080784, 'learning_rate': 6.403999999999999e-07, 'completion_length': 157.5357208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.007537841796875, 'epoch': 0.36} + 36%|███▌ | 899/2500 [5:21:07<9:51:07, 22.15s/it] 36%|███▌ | 900/2500 [5:21:30<9:58:36, 22.45s/it] {'loss': 0.0003, 'grad_norm': 0.02755197024749369, 'learning_rate': 6.4e-07, 'completion_length': 157.73214721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.007354736328125, 'epoch': 0.36} + 36%|███▌ | 900/2500 [5:21:30<9:58:36, 22.45s/it] 36%|███▌ | 901/2500 [5:25:14<36:52:24, 83.02s/it] {'loss': 0.0004, 'grad_norm': 0.8884496911228619, 'learning_rate': 6.395999999999999e-07, 'completion_length': 158.85714721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.1428571529686451, 'kl': 0.00970458984375, 'epoch': 0.36} + 36%|███▌ | 901/2500 [5:25:14<36:52:24, 83.02s/it] 36%|███▌ | 902/2500 [5:25:37<28:52:27, 65.05s/it] {'loss': 0.0002, 'grad_norm': 0.23852391114401075, 'learning_rate': 6.392e-07, 'completion_length': 154.8214340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0059661865234375, 'epoch': 0.36} + 36%|███▌ | 902/2500 [5:25:37<28:52:27, 65.05s/it] 36%|███▌ | 903/2500 [5:25:59<23:08:37, 52.17s/it] {'loss': 0.0002, 'grad_norm': 0.023982585504992267, 'learning_rate': 6.388e-07, 'completion_length': 151.0714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004791259765625, 'epoch': 0.36} + 36%|███▌ | 903/2500 [5:25:59<23:08:37, 52.17s/it] 36%|███▌ | 904/2500 [5:26:21<19:02:57, 42.97s/it] {'loss': 0.0002, 'grad_norm': 0.05611271034105753, 'learning_rate': 6.383999999999999e-07, 'completion_length': 146.2857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00495147705078125, 'epoch': 0.36} + 36%|███▌ | 904/2500 [5:26:21<19:02:57, 42.97s/it] 36%|███▌ | 905/2500 [5:26:44<16:20:54, 36.90s/it] {'loss': 0.0003, 'grad_norm': 0.5551835762670138, 'learning_rate': 6.38e-07, 'completion_length': 164.82144165039062, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.00799560546875, 'epoch': 0.36} + 36%|███▌ | 905/2500 [5:26:44<16:20:54, 36.90s/it] 36%|███▌ | 906/2500 [5:27:07<14:32:04, 32.83s/it] {'loss': 0.0003, 'grad_norm': 0.36241340714604253, 'learning_rate': 6.375999999999999e-07, 'completion_length': 153.67858123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0067596435546875, 'epoch': 0.36} + 36%|███▌ | 906/2500 [5:27:07<14:32:04, 32.83s/it] 36%|███▋ | 907/2500 [5:27:31<13:23:32, 30.27s/it] {'loss': 0.0003, 'grad_norm': 0.4080662425961056, 'learning_rate': 6.371999999999999e-07, 'completion_length': 164.37500762939453, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.11266787722706795, 'kl': 0.00799560546875, 'epoch': 0.36} + 36%|███▋ | 907/2500 [5:27:31<13:23:32, 30.27s/it] 36%|███▋ | 908/2500 [5:27:54<12:19:40, 27.88s/it] {'loss': 0.0003, 'grad_norm': 0.02615347260690987, 'learning_rate': 6.368e-07, 'completion_length': 151.1964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.007110595703125, 'epoch': 0.36} + 36%|███▋ | 908/2500 [5:27:54<12:19:40, 27.88s/it] 36%|███▋ | 909/2500 [5:28:17<11:47:52, 26.70s/it] {'loss': 0.0003, 'grad_norm': 0.4376440270906636, 'learning_rate': 6.364e-07, 'completion_length': 164.5357208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.00665283203125, 'epoch': 0.36} + 36%|███▋ | 909/2500 [5:28:18<11:47:52, 26.70s/it] 36%|███▋ | 910/2500 [5:28:40<11:16:27, 25.53s/it] {'loss': 0.0002, 'grad_norm': 0.4223272290312742, 'learning_rate': 6.36e-07, 'completion_length': 160.55358123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.0058135986328125, 'epoch': 0.36} + 36%|███▋ | 910/2500 [5:28:40<11:16:27, 25.53s/it] 36%|███▋ | 911/2500 [5:29:03<10:53:38, 24.68s/it] {'loss': 0.0002, 'grad_norm': 0.2308136906452744, 'learning_rate': 6.356e-07, 'completion_length': 153.17858123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0052642822265625, 'epoch': 0.36} + 36%|███▋ | 911/2500 [5:29:03<10:53:38, 24.68s/it] 36%|███▋ | 912/2500 [5:29:26<10:37:33, 24.09s/it] {'loss': 0.0002, 'grad_norm': 0.019608273491928264, 'learning_rate': 6.352e-07, 'completion_length': 159.67858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00450897216796875, 'epoch': 0.36} + 36%|███▋ | 912/2500 [5:29:26<10:37:33, 24.09s/it] 37%|███▋ | 913/2500 [5:29:49<10:27:34, 23.73s/it] {'loss': 0.0003, 'grad_norm': 0.5966688510164138, 'learning_rate': 6.348e-07, 'completion_length': 160.9107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.008148193359375, 'epoch': 0.37} + 37%|███▋ | 913/2500 [5:29:49<10:27:34, 23.73s/it] 37%|███▋ | 914/2500 [5:30:12<10:23:04, 23.57s/it] {'loss': 0.0003, 'grad_norm': 0.18645854317085744, 'learning_rate': 6.343999999999999e-07, 'completion_length': 169.33929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.006866455078125, 'epoch': 0.37} + 37%|███▋ | 914/2500 [5:30:12<10:23:04, 23.57s/it] 37%|███▋ | 915/2500 [5:30:34<10:12:13, 23.18s/it] {'loss': 0.0003, 'grad_norm': 0.2765990046569151, 'learning_rate': 6.34e-07, 'completion_length': 161.96428680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0070343017578125, 'epoch': 0.37} + 37%|███▋ | 915/2500 [5:30:34<10:12:13, 23.18s/it] 37%|███▋ | 916/2500 [5:30:58<10:16:55, 23.37s/it] {'loss': 0.0002, 'grad_norm': 0.48327071481974626, 'learning_rate': 6.336000000000001e-07, 'completion_length': 159.92858123779297, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.11266787722706795, 'kl': 0.0052490234375, 'epoch': 0.37} + 37%|███▋ | 916/2500 [5:30:58<10:16:55, 23.37s/it] 37%|███▋ | 917/2500 [5:31:21<10:11:57, 23.20s/it] {'loss': 0.0003, 'grad_norm': 0.027701685396987023, 'learning_rate': 6.331999999999999e-07, 'completion_length': 145.94644165039062, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00628662109375, 'epoch': 0.37} + 37%|███▋ | 917/2500 [5:31:21<10:11:57, 23.20s/it] 37%|███▋ | 918/2500 [5:31:42<10:00:38, 22.78s/it] {'loss': 0.0003, 'grad_norm': 0.42676449956521956, 'learning_rate': 6.328e-07, 'completion_length': 166.67857360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0069580078125, 'epoch': 0.37} + 37%|███▋ | 918/2500 [5:31:42<10:00:38, 22.78s/it] 37%|███▋ | 919/2500 [5:32:04<9:50:47, 22.42s/it] {'loss': 0.0002, 'grad_norm': 0.01901539309732931, 'learning_rate': 6.324e-07, 'completion_length': 149.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0057220458984375, 'epoch': 0.37} + 37%|███▋ | 919/2500 [5:32:04<9:50:47, 22.42s/it] 37%|███▋ | 920/2500 [5:32:27<9:52:36, 22.50s/it] {'loss': 0.0003, 'grad_norm': 0.03906386477882607, 'learning_rate': 6.319999999999999e-07, 'completion_length': 163.8214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00848388671875, 'epoch': 0.37} + 37%|███▋ | 920/2500 [5:32:27<9:52:36, 22.50s/it] 37%|███▋ | 921/2500 [5:32:51<10:07:00, 23.07s/it] {'loss': 0.0003, 'grad_norm': 0.0668044407589542, 'learning_rate': 6.316e-07, 'completion_length': 152.6964340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00677490234375, 'epoch': 0.37} + 37%|███▋ | 921/2500 [5:32:51<10:07:00, 23.07s/it] 37%|███▋ | 922/2500 [5:33:13<9:54:35, 22.61s/it] {'loss': 0.0003, 'grad_norm': 0.7509965598856448, 'learning_rate': 6.312e-07, 'completion_length': 150.14286422729492, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.007598876953125, 'epoch': 0.37} + 37%|███▋ | 922/2500 [5:33:13<9:54:35, 22.61s/it] 37%|███▋ | 923/2500 [5:33:35<9:50:53, 22.48s/it] {'loss': 0.0004, 'grad_norm': 0.24337806858150995, 'learning_rate': 6.308e-07, 'completion_length': 157.71429443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00946044921875, 'epoch': 0.37} + 37%|███▋ | 923/2500 [5:33:35<9:50:53, 22.48s/it] 37%|███▋ | 924/2500 [5:33:58<9:54:45, 22.64s/it] {'loss': 0.0002, 'grad_norm': 0.3758259500371858, 'learning_rate': 6.303999999999999e-07, 'completion_length': 158.01786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.006134033203125, 'epoch': 0.37} + 37%|███▋ | 924/2500 [5:33:58<9:54:45, 22.64s/it] 37%|███▋ | 925/2500 [5:34:20<9:47:19, 22.37s/it] {'loss': 0.0002, 'grad_norm': 0.023813813492512978, 'learning_rate': 6.3e-07, 'completion_length': 148.1964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0046539306640625, 'epoch': 0.37} + 37%|███▋ | 925/2500 [5:34:20<9:47:19, 22.37s/it] 37%|███▋ | 926/2500 [5:34:43<9:55:23, 22.70s/it] {'loss': 0.0003, 'grad_norm': 0.32693225762912126, 'learning_rate': 6.296e-07, 'completion_length': 174.08929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.0072021484375, 'epoch': 0.37} + 37%|███▋ | 926/2500 [5:34:43<9:55:23, 22.70s/it] 37%|███▋ | 927/2500 [5:35:06<9:57:46, 22.80s/it] {'loss': 0.0002, 'grad_norm': 0.041041485996247504, 'learning_rate': 6.291999999999999e-07, 'completion_length': 158.58929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0058746337890625, 'epoch': 0.37} + 37%|███▋ | 927/2500 [5:35:06<9:57:46, 22.80s/it] 37%|███▋ | 928/2500 [5:35:29<9:56:20, 22.76s/it] {'loss': 0.0003, 'grad_norm': 0.3823738058602852, 'learning_rate': 6.288e-07, 'completion_length': 147.62500762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.0078277587890625, 'epoch': 0.37} + 37%|███▋ | 928/2500 [5:35:29<9:56:20, 22.76s/it] 37%|███▋ | 929/2500 [5:35:51<9:48:47, 22.49s/it] {'loss': 0.0002, 'grad_norm': 0.30152457152818823, 'learning_rate': 6.283999999999999e-07, 'completion_length': 143.4107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00479888916015625, 'epoch': 0.37} + 37%|███▋ | 929/2500 [5:35:51<9:48:47, 22.49s/it] 37%|███▋ | 930/2500 [5:36:15<10:05:47, 23.15s/it] {'loss': 0.0003, 'grad_norm': 0.36278681707004434, 'learning_rate': 6.28e-07, 'completion_length': 171.4107208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0085601806640625, 'epoch': 0.37} + 37%|███▋ | 930/2500 [5:36:15<10:05:47, 23.15s/it] 37%|███▋ | 931/2500 [5:36:38<9:59:09, 22.91s/it] {'loss': 0.0003, 'grad_norm': 0.030910428928705897, 'learning_rate': 6.276e-07, 'completion_length': 158.4107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.008270263671875, 'epoch': 0.37} + 37%|███▋ | 931/2500 [5:36:38<9:59:09, 22.91s/it] 37%|███▋ | 932/2500 [5:36:59<9:45:23, 22.40s/it] {'loss': 0.0002, 'grad_norm': 0.5126149831091602, 'learning_rate': 6.271999999999999e-07, 'completion_length': 137.16072463989258, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00543212890625, 'epoch': 0.37} + 37%|███▋ | 932/2500 [5:36:59<9:45:23, 22.40s/it] 37%|███▋ | 933/2500 [5:37:21<9:39:20, 22.18s/it] {'loss': 0.0003, 'grad_norm': 0.5907053744117255, 'learning_rate': 6.268e-07, 'completion_length': 152.71429443359375, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.0072021484375, 'epoch': 0.37} + 37%|███▋ | 933/2500 [5:37:21<9:39:20, 22.18s/it] 37%|███▋ | 934/2500 [5:37:43<9:43:22, 22.35s/it] {'loss': 0.0002, 'grad_norm': 0.6863450952667449, 'learning_rate': 6.263999999999999e-07, 'completion_length': 149.98214721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0052490234375, 'epoch': 0.37} + 37%|███▋ | 934/2500 [5:37:43<9:43:22, 22.35s/it] 37%|███▋ | 935/2500 [5:38:05<9:36:37, 22.11s/it] {'loss': 0.0002, 'grad_norm': 0.03699230957194251, 'learning_rate': 6.26e-07, 'completion_length': 157.9107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0061798095703125, 'epoch': 0.37} + 37%|███▋ | 935/2500 [5:38:05<9:36:37, 22.11s/it] 37%|███▋ | 936/2500 [5:38:28<9:42:47, 22.36s/it] {'loss': 0.0004, 'grad_norm': 1.8679391018645228, 'learning_rate': 6.256e-07, 'completion_length': 161.7678680419922, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1181928999722004, 'kl': 0.0087890625, 'epoch': 0.37} + 37%|███▋ | 936/2500 [5:38:28<9:42:47, 22.36s/it] 37%|███▋ | 937/2500 [5:38:50<9:40:19, 22.28s/it] {'loss': 0.0002, 'grad_norm': 0.6972974129773645, 'learning_rate': 6.252e-07, 'completion_length': 145.75000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.006134033203125, 'epoch': 0.37} + 37%|███▋ | 937/2500 [5:38:50<9:40:19, 22.28s/it] 38%|███▊ | 938/2500 [5:39:12<9:35:28, 22.11s/it] {'loss': 0.0002, 'grad_norm': 0.5100502237789637, 'learning_rate': 6.248e-07, 'completion_length': 141.1964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0048065185546875, 'epoch': 0.38} + 38%|███▊ | 938/2500 [5:39:12<9:35:28, 22.11s/it] 38%|███▊ | 939/2500 [5:39:34<9:41:01, 22.33s/it] {'loss': 0.0003, 'grad_norm': 0.21687831165971086, 'learning_rate': 6.243999999999999e-07, 'completion_length': 157.75000762939453, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0064239501953125, 'epoch': 0.38} + 38%|███▊ | 939/2500 [5:39:34<9:41:01, 22.33s/it] 38%|███▊ | 940/2500 [5:39:58<9:46:31, 22.56s/it] {'loss': 0.0003, 'grad_norm': 0.3871853295665238, 'learning_rate': 6.24e-07, 'completion_length': 165.9107208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.007568359375, 'epoch': 0.38} + 38%|███▊ | 940/2500 [5:39:58<9:46:31, 22.56s/it] 38%|███▊ | 941/2500 [5:40:20<9:45:57, 22.55s/it] {'loss': 0.0002, 'grad_norm': 0.5044903185665068, 'learning_rate': 6.236e-07, 'completion_length': 157.85714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00528717041015625, 'epoch': 0.38} + 38%|███▊ | 941/2500 [5:40:20<9:45:57, 22.55s/it] 38%|███▊ | 942/2500 [5:40:42<9:39:50, 22.33s/it] {'loss': 0.0002, 'grad_norm': 0.015215303484394831, 'learning_rate': 6.231999999999999e-07, 'completion_length': 149.10714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004791259765625, 'epoch': 0.38} + 38%|███▊ | 942/2500 [5:40:42<9:39:50, 22.33s/it] 38%|███▊ | 943/2500 [5:41:04<9:37:38, 22.26s/it] {'loss': 0.0003, 'grad_norm': 0.6574649143312097, 'learning_rate': 6.228e-07, 'completion_length': 154.01786041259766, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.0078887939453125, 'epoch': 0.38} + 38%|███▊ | 943/2500 [5:41:04<9:37:38, 22.26s/it] 38%|███▊ | 944/2500 [5:41:26<9:32:49, 22.09s/it] {'loss': 0.0003, 'grad_norm': 0.17191246124358922, 'learning_rate': 6.224e-07, 'completion_length': 161.4464340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.007080078125, 'epoch': 0.38} + 38%|███▊ | 944/2500 [5:41:26<9:32:49, 22.09s/it] 38%|███▊ | 945/2500 [5:41:47<9:29:07, 21.96s/it] {'loss': 0.0002, 'grad_norm': 0.24215819390218057, 'learning_rate': 6.219999999999999e-07, 'completion_length': 135.67858123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005462646484375, 'epoch': 0.38} + 38%|███▊ | 945/2500 [5:41:47<9:29:07, 21.96s/it] 38%|███▊ | 946/2500 [5:42:10<9:37:05, 22.28s/it] {'loss': 0.0002, 'grad_norm': 0.47566601655676477, 'learning_rate': 6.216e-07, 'completion_length': 149.76786041259766, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.00470733642578125, 'epoch': 0.38} + 38%|███▊ | 946/2500 [5:42:10<9:37:05, 22.28s/it] 38%|███▊ | 947/2500 [5:42:34<9:46:08, 22.65s/it] {'loss': 0.0004, 'grad_norm': 0.8664198341143863, 'learning_rate': 6.212e-07, 'completion_length': 173.71428680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.010772705078125, 'epoch': 0.38} + 38%|███▊ | 947/2500 [5:42:34<9:46:08, 22.65s/it] 38%|███▊ | 948/2500 [5:42:57<9:50:38, 22.83s/it] {'loss': 0.0002, 'grad_norm': 0.4231574983593713, 'learning_rate': 6.208e-07, 'completion_length': 167.08929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00579833984375, 'epoch': 0.38} + 38%|███▊ | 948/2500 [5:42:57<9:50:38, 22.83s/it] 38%|███▊ | 949/2500 [5:43:19<9:45:07, 22.64s/it] {'loss': 0.0002, 'grad_norm': 0.026552428464933178, 'learning_rate': 6.203999999999999e-07, 'completion_length': 148.75000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.005126953125, 'epoch': 0.38} + 38%|███▊ | 949/2500 [5:43:19<9:45:07, 22.64s/it] 38%|███▊ | 950/2500 [5:43:42<9:46:18, 22.70s/it] {'loss': 0.0002, 'grad_norm': 0.379188010269468, 'learning_rate': 6.2e-07, 'completion_length': 149.42858123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00615692138671875, 'epoch': 0.38} + 38%|███▊ | 950/2500 [5:43:42<9:46:18, 22.70s/it] 38%|███▊ | 951/2500 [5:44:04<9:40:43, 22.49s/it] {'loss': 0.0003, 'grad_norm': 0.02981746134054584, 'learning_rate': 6.196e-07, 'completion_length': 162.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006683349609375, 'epoch': 0.38} + 38%|███▊ | 951/2500 [5:44:04<9:40:43, 22.49s/it] 38%|███▊ | 952/2500 [5:44:27<9:40:40, 22.51s/it] {'loss': 0.0002, 'grad_norm': 0.24090958169209753, 'learning_rate': 6.191999999999999e-07, 'completion_length': 160.7857208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.005615234375, 'epoch': 0.38} + 38%|███▊ | 952/2500 [5:44:27<9:40:40, 22.51s/it] 38%|███▊ | 953/2500 [5:44:48<9:33:29, 22.24s/it] {'loss': 0.0002, 'grad_norm': 0.03709063090268077, 'learning_rate': 6.188e-07, 'completion_length': 141.6964340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0059661865234375, 'epoch': 0.38} + 38%|███▊ | 953/2500 [5:44:48<9:33:29, 22.24s/it] 38%|███▊ | 954/2500 [5:45:10<9:25:20, 21.94s/it] {'loss': 0.0003, 'grad_norm': 0.7380252981423921, 'learning_rate': 6.183999999999999e-07, 'completion_length': 153.44644165039062, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006317138671875, 'epoch': 0.38} + 38%|███▊ | 954/2500 [5:45:10<9:25:20, 21.94s/it] 38%|███▊ | 955/2500 [5:45:31<9:20:46, 21.78s/it] {'loss': 0.0002, 'grad_norm': 0.6990290587720696, 'learning_rate': 6.18e-07, 'completion_length': 133.21429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00458526611328125, 'epoch': 0.38} + 38%|███▊ | 955/2500 [5:45:31<9:20:46, 21.78s/it] 38%|███▊ | 956/2500 [5:45:52<9:14:36, 21.55s/it] {'loss': 0.0002, 'grad_norm': 0.021038633644658553, 'learning_rate': 6.176e-07, 'completion_length': 143.50000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0058135986328125, 'epoch': 0.38} + 38%|███▊ | 956/2500 [5:45:52<9:14:36, 21.55s/it] 38%|███▊ | 957/2500 [5:46:15<9:24:04, 21.93s/it] {'loss': 0.0003, 'grad_norm': 0.7325777719073886, 'learning_rate': 6.172e-07, 'completion_length': 167.80358123779297, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.11266787722706795, 'kl': 0.006805419921875, 'epoch': 0.38} + 38%|███▊ | 957/2500 [5:46:15<9:24:04, 21.93s/it] 38%|███▊ | 958/2500 [5:46:36<9:19:31, 21.77s/it] {'loss': 0.0002, 'grad_norm': 0.04535581410941563, 'learning_rate': 6.168e-07, 'completion_length': 151.25, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0052490234375, 'epoch': 0.38} + 38%|███▊ | 958/2500 [5:46:36<9:19:31, 21.77s/it] 38%|███▊ | 959/2500 [5:46:59<9:25:39, 22.02s/it] {'loss': 0.0003, 'grad_norm': 1.2599130064675497, 'learning_rate': 6.163999999999999e-07, 'completion_length': 172.55358123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.00634765625, 'epoch': 0.38} + 38%|███▊ | 959/2500 [5:46:59<9:25:39, 22.02s/it] 38%|███▊ | 960/2500 [5:47:21<9:27:22, 22.11s/it] {'loss': 0.0003, 'grad_norm': 0.5712573844906791, 'learning_rate': 6.16e-07, 'completion_length': 158.0357208251953, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.006927490234375, 'epoch': 0.38} + 38%|███▊ | 960/2500 [5:47:21<9:27:22, 22.11s/it] 38%|███▊ | 961/2500 [5:47:43<9:28:50, 22.18s/it] {'loss': 0.0004, 'grad_norm': 0.20350526474612957, 'learning_rate': 6.156e-07, 'completion_length': 164.92857360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00958251953125, 'epoch': 0.38} + 38%|███▊ | 961/2500 [5:47:43<9:28:50, 22.18s/it] 38%|███▊ | 962/2500 [5:48:05<9:26:31, 22.10s/it] {'loss': 0.0002, 'grad_norm': 0.03426585989140815, 'learning_rate': 6.152e-07, 'completion_length': 143.05358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00579833984375, 'epoch': 0.38} + 38%|███▊ | 962/2500 [5:48:05<9:26:31, 22.10s/it] 39%|███▊ | 963/2500 [5:48:27<9:25:16, 22.07s/it] {'loss': 0.0002, 'grad_norm': 1.941657568799895, 'learning_rate': 6.148e-07, 'completion_length': 159.0, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695358991622925, 'kl': 0.00616455078125, 'epoch': 0.39} + 39%|███▊ | 963/2500 [5:48:27<9:25:16, 22.07s/it] 39%|███▊ | 964/2500 [5:48:51<9:35:43, 22.49s/it] {'loss': 0.0003, 'grad_norm': 0.43959931687107334, 'learning_rate': 6.143999999999999e-07, 'completion_length': 179.03572845458984, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0069732666015625, 'epoch': 0.39} + 39%|███▊ | 964/2500 [5:48:51<9:35:43, 22.49s/it] 39%|███▊ | 965/2500 [5:49:15<9:51:00, 23.10s/it] {'loss': 0.0002, 'grad_norm': 0.5804982148336199, 'learning_rate': 6.14e-07, 'completion_length': 175.19644165039062, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0062408447265625, 'epoch': 0.39} + 39%|███▊ | 965/2500 [5:49:15<9:51:00, 23.10s/it] 39%|███▊ | 966/2500 [5:49:39<9:51:04, 23.12s/it] {'loss': 0.0003, 'grad_norm': 0.5886633122426917, 'learning_rate': 6.136e-07, 'completion_length': 155.3214340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0064849853515625, 'epoch': 0.39} + 39%|███▊ | 966/2500 [5:49:39<9:51:04, 23.12s/it] 39%|███▊ | 967/2500 [5:50:01<9:45:56, 22.93s/it] {'loss': 0.0003, 'grad_norm': 0.057160461144246076, 'learning_rate': 6.131999999999999e-07, 'completion_length': 172.05358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.008636474609375, 'epoch': 0.39} + 39%|███▊ | 967/2500 [5:50:01<9:45:56, 22.93s/it] 39%|███▊ | 968/2500 [5:50:24<9:43:49, 22.87s/it] {'loss': 0.0003, 'grad_norm': 0.4745953181588086, 'learning_rate': 6.128e-07, 'completion_length': 163.67858123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0065765380859375, 'epoch': 0.39} + 39%|███▊ | 968/2500 [5:50:24<9:43:49, 22.87s/it] 39%|███▉ | 969/2500 [5:50:46<9:41:08, 22.77s/it] {'loss': 0.0004, 'grad_norm': 1.1128179606114954, 'learning_rate': 6.124000000000001e-07, 'completion_length': 171.60714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.009185791015625, 'epoch': 0.39} + 39%|███▉ | 969/2500 [5:50:46<9:41:08, 22.77s/it] 39%|███▉ | 970/2500 [5:51:10<9:44:25, 22.92s/it] {'loss': 0.0003, 'grad_norm': 0.4661422693134536, 'learning_rate': 6.119999999999999e-07, 'completion_length': 166.00000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.11266788095235825, 'kl': 0.0073089599609375, 'epoch': 0.39} + 39%|███▉ | 970/2500 [5:51:10<9:44:25, 22.92s/it] 39%|███▉ | 971/2500 [5:51:33<9:46:37, 23.02s/it] {'loss': 0.0002, 'grad_norm': 0.02203774871911074, 'learning_rate': 6.116e-07, 'completion_length': 163.80358123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00579833984375, 'epoch': 0.39} + 39%|███▉ | 971/2500 [5:51:33<9:46:37, 23.02s/it] 39%|███▉ | 972/2500 [5:51:54<9:30:50, 22.42s/it] {'loss': 0.0002, 'grad_norm': 0.017663377714241316, 'learning_rate': 6.112e-07, 'completion_length': 146.8928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045013427734375, 'epoch': 0.39} + 39%|███▉ | 972/2500 [5:51:54<9:30:50, 22.42s/it] 39%|███▉ | 973/2500 [5:52:16<9:27:03, 22.28s/it] {'loss': 0.0002, 'grad_norm': 0.18850422354061483, 'learning_rate': 6.107999999999999e-07, 'completion_length': 148.21429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00545501708984375, 'epoch': 0.39} + 39%|███▉ | 973/2500 [5:52:16<9:27:03, 22.28s/it] 39%|███▉ | 974/2500 [5:52:39<9:33:34, 22.55s/it] {'loss': 0.0003, 'grad_norm': 0.9784950423781187, 'learning_rate': 6.104e-07, 'completion_length': 162.05358123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.1428571492433548, 'kl': 0.0067901611328125, 'epoch': 0.39} + 39%|███▉ | 974/2500 [5:52:39<9:33:34, 22.55s/it] 39%|███▉ | 975/2500 [5:53:02<9:35:09, 22.63s/it] {'loss': 0.0002, 'grad_norm': 0.01661992239071432, 'learning_rate': 6.1e-07, 'completion_length': 148.1964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045623779296875, 'epoch': 0.39} + 39%|███▉ | 975/2500 [5:53:02<9:35:09, 22.63s/it] 39%|███▉ | 976/2500 [5:53:25<9:39:23, 22.81s/it] {'loss': 0.0002, 'grad_norm': 0.020963158163929494, 'learning_rate': 6.096e-07, 'completion_length': 159.2857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00441741943359375, 'epoch': 0.39} + 39%|███▉ | 976/2500 [5:53:25<9:39:23, 22.81s/it] 39%|███▉ | 977/2500 [5:53:47<9:32:28, 22.55s/it] {'loss': 0.0003, 'grad_norm': 0.0269801288583228, 'learning_rate': 6.091999999999999e-07, 'completion_length': 150.83929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.008087158203125, 'epoch': 0.39} + 39%|███▉ | 977/2500 [5:53:47<9:32:28, 22.55s/it] 39%|███▉ | 978/2500 [5:54:11<9:43:49, 23.02s/it] {'loss': 0.0003, 'grad_norm': 0.2218110024458763, 'learning_rate': 6.088e-07, 'completion_length': 179.71429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0068206787109375, 'epoch': 0.39} + 39%|███▉ | 978/2500 [5:54:11<9:43:49, 23.02s/it] 39%|███▉ | 979/2500 [5:54:34<9:41:28, 22.94s/it] {'loss': 0.0003, 'grad_norm': 1.935018979317115, 'learning_rate': 6.084000000000001e-07, 'completion_length': 145.42858123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.00689697265625, 'epoch': 0.39} + 39%|███▉ | 979/2500 [5:54:34<9:41:28, 22.94s/it] 39%|███▉ | 980/2500 [5:54:57<9:44:53, 23.09s/it] {'loss': 0.0003, 'grad_norm': 0.5139734405955252, 'learning_rate': 6.079999999999999e-07, 'completion_length': 162.0357208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.006683349609375, 'epoch': 0.39} + 39%|███▉ | 980/2500 [5:54:57<9:44:53, 23.09s/it] 39%|███▉ | 981/2500 [5:55:19<9:34:05, 22.68s/it] {'loss': 0.0002, 'grad_norm': 0.29647226119343223, 'learning_rate': 6.076e-07, 'completion_length': 150.0357208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00531005859375, 'epoch': 0.39} + 39%|███▉ | 981/2500 [5:55:19<9:34:05, 22.68s/it] 39%|███▉ | 982/2500 [5:55:40<9:24:27, 22.31s/it] {'loss': 0.0002, 'grad_norm': 0.0580392870044856, 'learning_rate': 6.072e-07, 'completion_length': 152.85714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005828857421875, 'epoch': 0.39} + 39%|███▉ | 982/2500 [5:55:40<9:24:27, 22.31s/it] 39%|███▉ | 983/2500 [5:56:02<9:21:50, 22.22s/it] {'loss': 0.0002, 'grad_norm': 0.01922813617391873, 'learning_rate': 6.068e-07, 'completion_length': 153.6964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004608154296875, 'epoch': 0.39} + 39%|███▉ | 983/2500 [5:56:02<9:21:50, 22.22s/it] 39%|███▉ | 984/2500 [5:56:25<9:23:09, 22.29s/it] {'loss': 0.0003, 'grad_norm': 0.42884329291799583, 'learning_rate': 6.064e-07, 'completion_length': 155.46429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.007110595703125, 'epoch': 0.39} + 39%|███▉ | 984/2500 [5:56:25<9:23:09, 22.29s/it] 39%|███▉ | 985/2500 [5:56:47<9:23:43, 22.33s/it] {'loss': 0.0003, 'grad_norm': 0.2876607704246685, 'learning_rate': 6.06e-07, 'completion_length': 165.1071548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.007537841796875, 'epoch': 0.39} + 39%|███▉ | 985/2500 [5:56:47<9:23:43, 22.33s/it] 39%|███▉ | 986/2500 [5:57:10<9:23:06, 22.32s/it] {'loss': 0.0002, 'grad_norm': 0.4373474547791952, 'learning_rate': 6.056e-07, 'completion_length': 164.42857360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00555419921875, 'epoch': 0.39} + 39%|███▉ | 986/2500 [5:57:10<9:23:06, 22.32s/it] 39%|███▉ | 987/2500 [5:57:33<9:34:08, 22.77s/it] {'loss': 0.0003, 'grad_norm': 0.2698921031224709, 'learning_rate': 6.051999999999999e-07, 'completion_length': 167.57144165039062, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0070343017578125, 'epoch': 0.39} + 39%|███▉ | 987/2500 [5:57:33<9:34:08, 22.77s/it] 40%|███▉ | 988/2500 [5:57:57<9:37:33, 22.92s/it] {'loss': 0.0004, 'grad_norm': 0.23263821322539, 'learning_rate': 6.048e-07, 'completion_length': 172.76786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.009918212890625, 'epoch': 0.4} + 40%|███▉ | 988/2500 [5:57:57<9:37:33, 22.92s/it] 40%|███▉ | 989/2500 [5:58:20<9:41:35, 23.09s/it] {'loss': 0.0003, 'grad_norm': 0.6813002929246279, 'learning_rate': 6.044e-07, 'completion_length': 172.46429443359375, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.008636474609375, 'epoch': 0.4} + 40%|███▉ | 989/2500 [5:58:20<9:41:35, 23.09s/it] 40%|███▉ | 990/2500 [5:58:42<9:32:09, 22.73s/it] {'loss': 0.0002, 'grad_norm': 0.04095635268471329, 'learning_rate': 6.04e-07, 'completion_length': 158.12500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006072998046875, 'epoch': 0.4} + 40%|███▉ | 990/2500 [5:58:42<9:32:09, 22.73s/it] 40%|███▉ | 991/2500 [5:59:05<9:31:21, 22.72s/it] {'loss': 0.0002, 'grad_norm': 0.025177424887971377, 'learning_rate': 6.036e-07, 'completion_length': 162.5178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0061798095703125, 'epoch': 0.4} + 40%|███▉ | 991/2500 [5:59:05<9:31:21, 22.72s/it] 40%|███▉ | 992/2500 [5:59:27<9:29:13, 22.65s/it] {'loss': 0.0002, 'grad_norm': 0.06400397237904598, 'learning_rate': 6.031999999999999e-07, 'completion_length': 150.64286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0050048828125, 'epoch': 0.4} + 40%|███▉ | 992/2500 [5:59:27<9:29:13, 22.65s/it] 40%|███▉ | 993/2500 [5:59:50<9:31:26, 22.75s/it] {'loss': 0.0002, 'grad_norm': 0.022964045589677042, 'learning_rate': 6.028e-07, 'completion_length': 146.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0054931640625, 'epoch': 0.4} + 40%|███▉ | 993/2500 [5:59:50<9:31:26, 22.75s/it] 40%|███▉ | 994/2500 [6:00:14<9:36:13, 22.96s/it] {'loss': 0.0003, 'grad_norm': 0.042683799282412435, 'learning_rate': 6.024e-07, 'completion_length': 173.33928680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.006591796875, 'epoch': 0.4} + 40%|███▉ | 994/2500 [6:00:14<9:36:13, 22.96s/it] 40%|███▉ | 995/2500 [6:00:36<9:33:20, 22.86s/it] {'loss': 0.0003, 'grad_norm': 0.1798502360684131, 'learning_rate': 6.019999999999999e-07, 'completion_length': 164.62500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006927490234375, 'epoch': 0.4} + 40%|███▉ | 995/2500 [6:00:36<9:33:20, 22.86s/it] 40%|███▉ | 996/2500 [6:00:58<9:24:49, 22.53s/it] {'loss': 0.0002, 'grad_norm': 0.40333588904744144, 'learning_rate': 6.016e-07, 'completion_length': 147.5178680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.004638671875, 'epoch': 0.4} + 40%|███▉ | 996/2500 [6:00:58<9:24:49, 22.53s/it] 40%|███▉ | 997/2500 [6:01:20<9:20:33, 22.38s/it] {'loss': 0.0003, 'grad_norm': 4.005195728311249, 'learning_rate': 6.012e-07, 'completion_length': 166.28572845458984, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.15943220257759094, 'kl': 0.0075531005859375, 'epoch': 0.4} + 40%|███▉ | 997/2500 [6:01:20<9:20:33, 22.38s/it] 40%|███▉ | 998/2500 [6:01:42<9:17:34, 22.27s/it] {'loss': 0.0003, 'grad_norm': 0.2745768918907917, 'learning_rate': 6.007999999999999e-07, 'completion_length': 145.08928680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0063629150390625, 'epoch': 0.4} + 40%|███▉ | 998/2500 [6:01:42<9:17:34, 22.27s/it] 40%|███▉ | 999/2500 [6:02:04<9:14:36, 22.17s/it] {'loss': 0.0003, 'grad_norm': 1.1957033363659535, 'learning_rate': 6.004e-07, 'completion_length': 161.62500762939453, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.11266787722706795, 'kl': 0.00771331787109375, 'epoch': 0.4} + 40%|███▉ | 999/2500 [6:02:04<9:14:36, 22.17s/it] 40%|████ | 1000/2500 [6:02:28<9:27:10, 22.69s/it] {'loss': 0.0002, 'grad_norm': 0.27634228956157564, 'learning_rate': 6e-07, 'completion_length': 155.76786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0062408447265625, 'epoch': 0.4} + 40%|████ | 1000/2500 [6:02:28<9:27:10, 22.69s/it] 40%|████ | 1001/2500 [6:05:58<32:50:03, 78.85s/it] {'loss': 0.0004, 'grad_norm': 0.41725294859726336, 'learning_rate': 5.995999999999999e-07, 'completion_length': 179.41072845458984, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.1181928962469101, 'kl': 0.0087738037109375, 'epoch': 0.4} + 40%|████ | 1001/2500 [6:05:58<32:50:03, 78.85s/it] 40%|████ | 1002/2500 [6:06:22<25:55:30, 62.30s/it] {'loss': 0.0003, 'grad_norm': 0.15365003815347683, 'learning_rate': 5.991999999999999e-07, 'completion_length': 159.92858123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0070037841796875, 'epoch': 0.4} + 40%|████ | 1002/2500 [6:06:22<25:55:30, 62.30s/it] 40%|████ | 1003/2500 [6:06:43<20:49:41, 50.09s/it] {'loss': 0.0002, 'grad_norm': 0.7758584258870711, 'learning_rate': 5.988e-07, 'completion_length': 154.1964340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0062103271484375, 'epoch': 0.4} + 40%|████ | 1003/2500 [6:06:43<20:49:41, 50.09s/it] 40%|████ | 1004/2500 [6:07:05<17:21:23, 41.77s/it] {'loss': 0.0003, 'grad_norm': 0.07546415509374156, 'learning_rate': 5.984000000000001e-07, 'completion_length': 165.58929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.008697509765625, 'epoch': 0.4} + 40%|████ | 1004/2500 [6:07:05<17:21:23, 41.77s/it] 40%|████ | 1005/2500 [6:07:29<15:01:18, 36.17s/it] {'loss': 0.0002, 'grad_norm': 0.4416578908613791, 'learning_rate': 5.979999999999999e-07, 'completion_length': 149.3928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004180908203125, 'epoch': 0.4} + 40%|████ | 1005/2500 [6:07:29<15:01:18, 36.17s/it] 40%|████ | 1006/2500 [6:07:51<13:15:45, 31.96s/it] {'loss': 0.0003, 'grad_norm': 0.23106948313598652, 'learning_rate': 5.976e-07, 'completion_length': 168.3928680419922, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0072021484375, 'epoch': 0.4} + 40%|████ | 1006/2500 [6:07:51<13:15:45, 31.96s/it] 40%|████ | 1007/2500 [6:08:13<12:06:15, 29.19s/it] {'loss': 0.0002, 'grad_norm': 0.3587055732945805, 'learning_rate': 5.972e-07, 'completion_length': 155.6071548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00579833984375, 'epoch': 0.4} + 40%|████ | 1007/2500 [6:08:13<12:06:15, 29.19s/it] 40%|████ | 1008/2500 [6:08:37<11:24:45, 27.54s/it] {'loss': 0.0003, 'grad_norm': 0.26638698131019584, 'learning_rate': 5.967999999999999e-07, 'completion_length': 165.10714721679688, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.0075836181640625, 'epoch': 0.4} + 40%|████ | 1008/2500 [6:08:37<11:24:45, 27.54s/it] 40%|████ | 1009/2500 [6:09:01<10:54:37, 26.34s/it] {'loss': 0.0003, 'grad_norm': 0.34784365668384276, 'learning_rate': 5.964e-07, 'completion_length': 169.83929443359375, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.007354736328125, 'epoch': 0.4} + 40%|████ | 1009/2500 [6:09:01<10:54:37, 26.34s/it] 40%|████ | 1010/2500 [6:09:22<10:16:58, 24.84s/it] {'loss': 0.0002, 'grad_norm': 0.022173595814055914, 'learning_rate': 5.96e-07, 'completion_length': 147.92858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004302978515625, 'epoch': 0.4} + 40%|████ | 1010/2500 [6:09:22<10:16:58, 24.84s/it] 40%|████ | 1011/2500 [6:09:43<9:49:28, 23.75s/it] {'loss': 0.0002, 'grad_norm': 0.02722421834481714, 'learning_rate': 5.956e-07, 'completion_length': 143.30358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00555419921875, 'epoch': 0.4} + 40%|████ | 1011/2500 [6:09:43<9:49:28, 23.75s/it] 40%|████ | 1012/2500 [6:10:05<9:37:29, 23.29s/it] {'loss': 0.0003, 'grad_norm': 0.2900039014305858, 'learning_rate': 5.951999999999999e-07, 'completion_length': 160.8214340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.007537841796875, 'epoch': 0.4} + 40%|████ | 1012/2500 [6:10:05<9:37:29, 23.29s/it] 41%|████ | 1013/2500 [6:10:28<9:33:52, 23.16s/it] {'loss': 0.0003, 'grad_norm': 0.3891554111856143, 'learning_rate': 5.948e-07, 'completion_length': 163.67858123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0071563720703125, 'epoch': 0.41} + 41%|████ | 1013/2500 [6:10:28<9:33:52, 23.16s/it] 41%|████ | 1014/2500 [6:10:51<9:31:17, 23.07s/it] {'loss': 0.0003, 'grad_norm': 0.023510672288079158, 'learning_rate': 5.944e-07, 'completion_length': 159.8214340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.006439208984375, 'epoch': 0.41} + 41%|████ | 1014/2500 [6:10:51<9:31:17, 23.07s/it] 41%|████ | 1015/2500 [6:11:13<9:23:03, 22.75s/it] {'loss': 0.0003, 'grad_norm': 1.0569710078280636, 'learning_rate': 5.939999999999999e-07, 'completion_length': 158.10714721679688, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695358991622925, 'kl': 0.00677490234375, 'epoch': 0.41} + 41%|████ | 1015/2500 [6:11:13<9:23:03, 22.75s/it] 41%|████ | 1016/2500 [6:11:36<9:20:00, 22.64s/it] {'loss': 0.0003, 'grad_norm': 0.036210256035175824, 'learning_rate': 5.936e-07, 'completion_length': 161.08929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0070953369140625, 'epoch': 0.41} + 41%|████ | 1016/2500 [6:11:36<9:20:00, 22.64s/it] 41%|████ | 1017/2500 [6:11:58<9:20:44, 22.69s/it] {'loss': 0.0004, 'grad_norm': 0.5432732767735219, 'learning_rate': 5.931999999999999e-07, 'completion_length': 174.01786041259766, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.009002685546875, 'epoch': 0.41} + 41%|████ | 1017/2500 [6:11:58<9:20:44, 22.69s/it] 41%|████ | 1018/2500 [6:12:22<9:29:50, 23.07s/it] {'loss': 0.0003, 'grad_norm': 0.7683551295618154, 'learning_rate': 5.928e-07, 'completion_length': 163.62500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0074005126953125, 'epoch': 0.41} + 41%|████ | 1018/2500 [6:12:22<9:29:50, 23.07s/it] 41%|████ | 1019/2500 [6:12:44<9:19:04, 22.65s/it] {'loss': 0.0003, 'grad_norm': 0.34518682283515323, 'learning_rate': 5.924e-07, 'completion_length': 166.50000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.008514404296875, 'epoch': 0.41} + 41%|████ | 1019/2500 [6:12:44<9:19:04, 22.65s/it] 41%|████ | 1020/2500 [6:13:06<9:15:13, 22.51s/it] {'loss': 0.0003, 'grad_norm': 0.03138047463851904, 'learning_rate': 5.919999999999999e-07, 'completion_length': 169.62500762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0071563720703125, 'epoch': 0.41} + 41%|████ | 1020/2500 [6:13:06<9:15:13, 22.51s/it] 41%|████ | 1021/2500 [6:13:28<9:07:59, 22.23s/it] {'loss': 0.0003, 'grad_norm': 0.4156276317104405, 'learning_rate': 5.916e-07, 'completion_length': 148.60714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0066070556640625, 'epoch': 0.41} + 41%|████ | 1021/2500 [6:13:28<9:07:59, 22.23s/it] 41%|████ | 1022/2500 [6:13:50<9:08:31, 22.27s/it] {'loss': 0.0002, 'grad_norm': 0.4692542788979825, 'learning_rate': 5.911999999999999e-07, 'completion_length': 155.37500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0058441162109375, 'epoch': 0.41} + 41%|████ | 1022/2500 [6:13:50<9:08:31, 22.27s/it] 41%|████ | 1023/2500 [6:14:11<8:59:46, 21.93s/it] {'loss': 0.0002, 'grad_norm': 0.8385029708433411, 'learning_rate': 5.907999999999999e-07, 'completion_length': 143.7857208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004180908203125, 'epoch': 0.41} + 41%|████ | 1023/2500 [6:14:11<8:59:46, 21.93s/it] 41%|████ | 1024/2500 [6:14:32<8:53:15, 21.68s/it] {'loss': 0.0002, 'grad_norm': 0.024318600995834425, 'learning_rate': 5.904e-07, 'completion_length': 159.3214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005645751953125, 'epoch': 0.41} + 41%|████ | 1024/2500 [6:14:32<8:53:15, 21.68s/it] 41%|████ | 1025/2500 [6:14:54<8:50:22, 21.57s/it] {'loss': 0.0003, 'grad_norm': 0.03122919568164347, 'learning_rate': 5.9e-07, 'completion_length': 163.85714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.007598876953125, 'epoch': 0.41} + 41%|████ | 1025/2500 [6:14:54<8:50:22, 21.57s/it] 41%|████ | 1026/2500 [6:15:15<8:51:40, 21.64s/it] {'loss': 0.0002, 'grad_norm': 0.03337205630257408, 'learning_rate': 5.896e-07, 'completion_length': 158.0357208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0054931640625, 'epoch': 0.41} + 41%|████ | 1026/2500 [6:15:15<8:51:40, 21.64s/it] 41%|████ | 1027/2500 [6:15:36<8:45:55, 21.42s/it] {'loss': 0.0002, 'grad_norm': 0.24219083764288182, 'learning_rate': 5.891999999999999e-07, 'completion_length': 140.6607208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0057220458984375, 'epoch': 0.41} + 41%|████ | 1027/2500 [6:15:36<8:45:55, 21.42s/it] 41%|████ | 1028/2500 [6:15:58<8:44:37, 21.38s/it] {'loss': 0.0003, 'grad_norm': 0.35194389214474486, 'learning_rate': 5.888e-07, 'completion_length': 159.6428680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0066070556640625, 'epoch': 0.41} + 41%|████ | 1028/2500 [6:15:58<8:44:37, 21.38s/it] 41%|████ | 1029/2500 [6:16:18<8:37:17, 21.10s/it] {'loss': 0.0002, 'grad_norm': 0.01849987093267711, 'learning_rate': 5.884000000000001e-07, 'completion_length': 145.35714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00467681884765625, 'epoch': 0.41} + 41%|████ | 1029/2500 [6:16:18<8:37:17, 21.10s/it] 41%|████ | 1030/2500 [6:16:41<8:49:53, 21.63s/it] {'loss': 0.0003, 'grad_norm': 1.0890154837404928, 'learning_rate': 5.879999999999999e-07, 'completion_length': 163.7857208251953, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0067901611328125, 'epoch': 0.41} + 41%|████ | 1030/2500 [6:16:41<8:49:53, 21.63s/it] 41%|████ | 1031/2500 [6:17:03<8:54:25, 21.83s/it] {'loss': 0.0003, 'grad_norm': 0.02469331243211302, 'learning_rate': 5.876e-07, 'completion_length': 173.53572845458984, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00775146484375, 'epoch': 0.41} + 41%|████ | 1031/2500 [6:17:03<8:54:25, 21.83s/it] 41%|████▏ | 1032/2500 [6:17:25<8:52:50, 21.78s/it] {'loss': 0.0002, 'grad_norm': 0.7213457354243459, 'learning_rate': 5.872000000000001e-07, 'completion_length': 160.33929443359375, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.0055084228515625, 'epoch': 0.41} + 41%|████▏ | 1032/2500 [6:17:25<8:52:50, 21.78s/it] 41%|████▏ | 1033/2500 [6:17:46<8:48:59, 21.64s/it] {'loss': 0.0001, 'grad_norm': 0.025150640095072847, 'learning_rate': 5.867999999999999e-07, 'completion_length': 147.0178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00301361083984375, 'epoch': 0.41} + 41%|████▏ | 1033/2500 [6:17:46<8:48:59, 21.64s/it] 41%|████▏ | 1034/2500 [6:18:08<8:50:07, 21.70s/it] {'loss': 0.0003, 'grad_norm': 1.4655230249152822, 'learning_rate': 5.864e-07, 'completion_length': 171.33928680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.0077972412109375, 'epoch': 0.41} + 41%|████▏ | 1034/2500 [6:18:08<8:50:07, 21.70s/it] 41%|████▏ | 1035/2500 [6:18:30<8:53:04, 21.83s/it] {'loss': 0.0003, 'grad_norm': 1.2915001017333294, 'learning_rate': 5.86e-07, 'completion_length': 163.37500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0082550048828125, 'epoch': 0.41} + 41%|████▏ | 1035/2500 [6:18:30<8:53:04, 21.83s/it] 41%|████▏ | 1036/2500 [6:18:51<8:43:57, 21.47s/it] {'loss': 0.0002, 'grad_norm': 0.16016080723556783, 'learning_rate': 5.856e-07, 'completion_length': 146.80358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005340576171875, 'epoch': 0.41} + 41%|████▏ | 1036/2500 [6:18:51<8:43:57, 21.47s/it] 41%|████▏ | 1037/2500 [6:19:13<8:49:37, 21.72s/it] {'loss': 0.0002, 'grad_norm': 0.1820244042242979, 'learning_rate': 5.852e-07, 'completion_length': 152.73214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005462646484375, 'epoch': 0.41} + 41%|████▏ | 1037/2500 [6:19:13<8:49:37, 21.72s/it] 42%|████▏ | 1038/2500 [6:19:34<8:41:20, 21.40s/it] {'loss': 0.0002, 'grad_norm': 0.24338665138901888, 'learning_rate': 5.848e-07, 'completion_length': 143.4107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00614166259765625, 'epoch': 0.42} + 42%|████▏ | 1038/2500 [6:19:34<8:41:20, 21.40s/it] 42%|████▏ | 1039/2500 [6:19:56<8:46:51, 21.64s/it] {'loss': 0.0002, 'grad_norm': 0.05512359376591369, 'learning_rate': 5.844e-07, 'completion_length': 166.6428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006134033203125, 'epoch': 0.42} + 42%|████▏ | 1039/2500 [6:19:56<8:46:51, 21.64s/it] 42%|████▏ | 1040/2500 [6:20:17<8:44:19, 21.55s/it] {'loss': 0.0003, 'grad_norm': 0.027104700177714897, 'learning_rate': 5.839999999999999e-07, 'completion_length': 150.62500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0063323974609375, 'epoch': 0.42} + 42%|████▏ | 1040/2500 [6:20:17<8:44:19, 21.55s/it] 42%|████▏ | 1041/2500 [6:20:38<8:38:21, 21.32s/it] {'loss': 0.0002, 'grad_norm': 0.28997792263988764, 'learning_rate': 5.836e-07, 'completion_length': 151.76786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00457763671875, 'epoch': 0.42} + 42%|████▏ | 1041/2500 [6:20:38<8:38:21, 21.32s/it] 42%|████▏ | 1042/2500 [6:20:58<8:30:07, 20.99s/it] {'loss': 0.0002, 'grad_norm': 0.03311842777721322, 'learning_rate': 5.832e-07, 'completion_length': 151.0357208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0056304931640625, 'epoch': 0.42} + 42%|████▏ | 1042/2500 [6:20:58<8:30:07, 20.99s/it] 42%|████▏ | 1043/2500 [6:21:20<8:32:10, 21.09s/it] {'loss': 0.0003, 'grad_norm': 0.6448758308737195, 'learning_rate': 5.828e-07, 'completion_length': 163.05357360839844, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.0073699951171875, 'epoch': 0.42} + 42%|████▏ | 1043/2500 [6:21:20<8:32:10, 21.09s/it] 42%|████▏ | 1044/2500 [6:21:42<8:44:27, 21.61s/it] {'loss': 0.0003, 'grad_norm': 0.40015478288369294, 'learning_rate': 5.824e-07, 'completion_length': 170.25000762939453, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.008331298828125, 'epoch': 0.42} + 42%|████▏ | 1044/2500 [6:21:42<8:44:27, 21.61s/it] 42%|████▏ | 1045/2500 [6:22:06<8:57:38, 22.17s/it] {'loss': 0.0003, 'grad_norm': 0.03517617175954337, 'learning_rate': 5.819999999999999e-07, 'completion_length': 159.05358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0066070556640625, 'epoch': 0.42} + 42%|████▏ | 1045/2500 [6:22:06<8:57:38, 22.17s/it] 42%|████▏ | 1046/2500 [6:22:27<8:48:48, 21.82s/it] {'loss': 0.0003, 'grad_norm': 0.024821278678619843, 'learning_rate': 5.816e-07, 'completion_length': 151.50000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00848388671875, 'epoch': 0.42} + 42%|████▏ | 1046/2500 [6:22:27<8:48:48, 21.82s/it] 42%|████▏ | 1047/2500 [6:22:47<8:37:40, 21.38s/it] {'loss': 0.0001, 'grad_norm': 0.35079740269028786, 'learning_rate': 5.812e-07, 'completion_length': 139.1964340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00298309326171875, 'epoch': 0.42} + 42%|████▏ | 1047/2500 [6:22:47<8:37:40, 21.38s/it] 42%|████▏ | 1048/2500 [6:23:08<8:34:18, 21.25s/it] {'loss': 0.0002, 'grad_norm': 0.3047655703000553, 'learning_rate': 5.807999999999999e-07, 'completion_length': 151.9107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0058135986328125, 'epoch': 0.42} + 42%|████▏ | 1048/2500 [6:23:08<8:34:18, 21.25s/it] 42%|████▏ | 1049/2500 [6:23:29<8:32:23, 21.19s/it] {'loss': 0.0001, 'grad_norm': 0.018715858015241185, 'learning_rate': 5.804e-07, 'completion_length': 142.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00354766845703125, 'epoch': 0.42} + 42%|████▏ | 1049/2500 [6:23:29<8:32:23, 21.19s/it] 42%|████▏ | 1050/2500 [6:23:50<8:28:20, 21.04s/it] {'loss': 0.0004, 'grad_norm': 0.34688803949452035, 'learning_rate': 5.8e-07, 'completion_length': 142.8214340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.009124755859375, 'epoch': 0.42} + 42%|████▏ | 1050/2500 [6:23:50<8:28:20, 21.04s/it] 42%|████▏ | 1051/2500 [6:24:11<8:25:59, 20.95s/it] {'loss': 0.0002, 'grad_norm': 0.47567173384256495, 'learning_rate': 5.796e-07, 'completion_length': 145.05357360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00388336181640625, 'epoch': 0.42} + 42%|████▏ | 1051/2500 [6:24:11<8:25:59, 20.95s/it] 42%|████▏ | 1052/2500 [6:24:32<8:27:39, 21.04s/it] {'loss': 0.0002, 'grad_norm': 0.18297924069253985, 'learning_rate': 5.792e-07, 'completion_length': 150.2321548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0050048828125, 'epoch': 0.42} + 42%|████▏ | 1052/2500 [6:24:32<8:27:39, 21.04s/it] 42%|████▏ | 1053/2500 [6:24:53<8:30:09, 21.15s/it] {'loss': 0.0003, 'grad_norm': 0.960912949966012, 'learning_rate': 5.788e-07, 'completion_length': 157.8928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0074462890625, 'epoch': 0.42} + 42%|████▏ | 1053/2500 [6:24:53<8:30:09, 21.15s/it] 42%|████▏ | 1054/2500 [6:25:15<8:36:17, 21.42s/it] {'loss': 0.0004, 'grad_norm': 0.4966635932152544, 'learning_rate': 5.784e-07, 'completion_length': 159.71429443359375, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695358991622925, 'kl': 0.0091094970703125, 'epoch': 0.42} + 42%|████▏ | 1054/2500 [6:25:15<8:36:17, 21.42s/it] 42%|████▏ | 1055/2500 [6:25:37<8:34:55, 21.38s/it] {'loss': 0.0003, 'grad_norm': 0.250738887893247, 'learning_rate': 5.779999999999999e-07, 'completion_length': 158.12500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0065460205078125, 'epoch': 0.42} + 42%|████▏ | 1055/2500 [6:25:37<8:34:55, 21.38s/it] 42%|████▏ | 1056/2500 [6:25:59<8:39:10, 21.57s/it] {'loss': 0.0002, 'grad_norm': 0.1877844288955961, 'learning_rate': 5.776e-07, 'completion_length': 152.2857208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00386810302734375, 'epoch': 0.42} + 42%|████▏ | 1056/2500 [6:25:59<8:39:10, 21.57s/it] 42%|████▏ | 1057/2500 [6:26:20<8:39:09, 21.59s/it] {'loss': 0.0003, 'grad_norm': 0.5869787737488433, 'learning_rate': 5.772000000000001e-07, 'completion_length': 162.73214721679688, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.0824786126613617, 'kl': 0.00732421875, 'epoch': 0.42} + 42%|████▏ | 1057/2500 [6:26:20<8:39:09, 21.59s/it] 42%|████▏ | 1058/2500 [6:26:42<8:40:04, 21.64s/it] {'loss': 0.0003, 'grad_norm': 0.32086474398196385, 'learning_rate': 5.767999999999999e-07, 'completion_length': 165.89286041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.00750732421875, 'epoch': 0.42} + 42%|████▏ | 1058/2500 [6:26:42<8:40:04, 21.64s/it] 42%|████▏ | 1059/2500 [6:27:04<8:40:10, 21.66s/it] {'loss': 0.0003, 'grad_norm': 0.36922564354689824, 'learning_rate': 5.764e-07, 'completion_length': 149.1607208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00665283203125, 'epoch': 0.42} + 42%|████▏ | 1059/2500 [6:27:04<8:40:10, 21.66s/it] 42%|████▏ | 1060/2500 [6:27:25<8:38:07, 21.59s/it] {'loss': 0.0002, 'grad_norm': 0.054191790393987714, 'learning_rate': 5.76e-07, 'completion_length': 152.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045166015625, 'epoch': 0.42} + 42%|████▏ | 1060/2500 [6:27:25<8:38:07, 21.59s/it] 42%|████▏ | 1061/2500 [6:27:46<8:34:04, 21.43s/it] {'loss': 0.0002, 'grad_norm': 0.1926052108599862, 'learning_rate': 5.755999999999999e-07, 'completion_length': 144.71429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0050201416015625, 'epoch': 0.42} + 42%|████▏ | 1061/2500 [6:27:46<8:34:04, 21.43s/it] 42%|████▏ | 1062/2500 [6:28:08<8:34:13, 21.46s/it] {'loss': 0.0003, 'grad_norm': 0.14905888252202554, 'learning_rate': 5.752e-07, 'completion_length': 171.10714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00872802734375, 'epoch': 0.42} + 42%|████▏ | 1062/2500 [6:28:08<8:34:13, 21.46s/it] 43%|████▎ | 1063/2500 [6:28:29<8:33:59, 21.46s/it] {'loss': 0.0002, 'grad_norm': 0.8185845498039307, 'learning_rate': 5.748e-07, 'completion_length': 157.30358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0047760009765625, 'epoch': 0.43} + 43%|████▎ | 1063/2500 [6:28:29<8:33:59, 21.46s/it] 43%|████▎ | 1064/2500 [6:28:50<8:25:51, 21.14s/it] {'loss': 0.0001, 'grad_norm': 0.02220805676044651, 'learning_rate': 5.744e-07, 'completion_length': 136.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00366973876953125, 'epoch': 0.43} + 43%|████▎ | 1064/2500 [6:28:50<8:25:51, 21.14s/it] 43%|████▎ | 1065/2500 [6:29:13<8:38:49, 21.69s/it] {'loss': 0.0002, 'grad_norm': 0.015784855472535837, 'learning_rate': 5.739999999999999e-07, 'completion_length': 162.26786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00504302978515625, 'epoch': 0.43} + 43%|████▎ | 1065/2500 [6:29:13<8:38:49, 21.69s/it] 43%|████▎ | 1066/2500 [6:29:35<8:45:01, 21.97s/it] {'loss': 0.0003, 'grad_norm': 0.5569348362279595, 'learning_rate': 5.736e-07, 'completion_length': 156.98214721679688, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695358991622925, 'kl': 0.0066375732421875, 'epoch': 0.43} + 43%|████▎ | 1066/2500 [6:29:35<8:45:01, 21.97s/it] 43%|████▎ | 1067/2500 [6:29:58<8:48:55, 22.15s/it] {'loss': 0.0003, 'grad_norm': 0.45891451964919583, 'learning_rate': 5.732e-07, 'completion_length': 157.3571548461914, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0068206787109375, 'epoch': 0.43} + 43%|████▎ | 1067/2500 [6:29:58<8:48:55, 22.15s/it] 43%|████▎ | 1068/2500 [6:30:24<9:16:11, 23.30s/it] {'loss': 0.0005, 'grad_norm': 0.5260511538072855, 'learning_rate': 5.727999999999999e-07, 'completion_length': 163.50000762939453, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.012054443359375, 'epoch': 0.43} + 43%|████▎ | 1068/2500 [6:30:24<9:16:11, 23.30s/it] 43%|████▎ | 1069/2500 [6:30:45<8:59:58, 22.64s/it] {'loss': 0.0002, 'grad_norm': 0.1185967760591183, 'learning_rate': 5.724e-07, 'completion_length': 144.10714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004547119140625, 'epoch': 0.43} + 43%|████▎ | 1069/2500 [6:30:45<8:59:58, 22.64s/it] 43%|████▎ | 1070/2500 [6:31:08<8:59:07, 22.62s/it] {'loss': 0.0004, 'grad_norm': 0.12448895470122098, 'learning_rate': 5.719999999999999e-07, 'completion_length': 163.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00982666015625, 'epoch': 0.43} + 43%|████▎ | 1070/2500 [6:31:08<8:59:07, 22.62s/it] 43%|████▎ | 1071/2500 [6:31:29<8:50:27, 22.27s/it] {'loss': 0.0003, 'grad_norm': 0.07795004944474221, 'learning_rate': 5.716e-07, 'completion_length': 154.35714721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.008209228515625, 'epoch': 0.43} + 43%|████▎ | 1071/2500 [6:31:29<8:50:27, 22.27s/it] 43%|████▎ | 1072/2500 [6:31:50<8:39:38, 21.83s/it] {'loss': 0.0002, 'grad_norm': 0.02850185221657196, 'learning_rate': 5.712e-07, 'completion_length': 147.08929443359375, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.004974365234375, 'epoch': 0.43} + 43%|████▎ | 1072/2500 [6:31:50<8:39:38, 21.83s/it] 43%|████▎ | 1073/2500 [6:32:13<8:46:49, 22.15s/it] {'loss': 0.0002, 'grad_norm': 0.02252833710880395, 'learning_rate': 5.707999999999999e-07, 'completion_length': 173.98214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0052032470703125, 'epoch': 0.43} + 43%|████▎ | 1073/2500 [6:32:13<8:46:49, 22.15s/it] 43%|████▎ | 1074/2500 [6:32:35<8:45:24, 22.11s/it] {'loss': 0.0003, 'grad_norm': 0.036049673838905595, 'learning_rate': 5.704e-07, 'completion_length': 158.51786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00653076171875, 'epoch': 0.43} + 43%|████▎ | 1074/2500 [6:32:35<8:45:24, 22.11s/it] 43%|████▎ | 1075/2500 [6:32:55<8:33:11, 21.61s/it] {'loss': 0.0002, 'grad_norm': 0.023100935966027362, 'learning_rate': 5.699999999999999e-07, 'completion_length': 154.7678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004852294921875, 'epoch': 0.43} + 43%|████▎ | 1075/2500 [6:32:55<8:33:11, 21.61s/it] 43%|████▎ | 1076/2500 [6:33:20<8:53:40, 22.49s/it] {'loss': 0.0002, 'grad_norm': 0.024299773160046848, 'learning_rate': 5.696e-07, 'completion_length': 148.37500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004974365234375, 'epoch': 0.43} + 43%|████▎ | 1076/2500 [6:33:20<8:53:40, 22.49s/it] 43%|████▎ | 1077/2500 [6:33:42<8:50:46, 22.38s/it] {'loss': 0.0002, 'grad_norm': 0.36120500823274393, 'learning_rate': 5.692e-07, 'completion_length': 162.71429443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0048370361328125, 'epoch': 0.43} + 43%|████▎ | 1077/2500 [6:33:42<8:50:46, 22.38s/it] 43%|████▎ | 1078/2500 [6:34:03<8:45:16, 22.16s/it] {'loss': 0.0003, 'grad_norm': 1.07471734727369, 'learning_rate': 5.688e-07, 'completion_length': 149.625, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0072021484375, 'epoch': 0.43} + 43%|████▎ | 1078/2500 [6:34:03<8:45:16, 22.16s/it] 43%|████▎ | 1079/2500 [6:34:26<8:48:56, 22.33s/it] {'loss': 0.0004, 'grad_norm': 0.029758921661172187, 'learning_rate': 5.684e-07, 'completion_length': 163.10714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0093994140625, 'epoch': 0.43} + 43%|████▎ | 1079/2500 [6:34:26<8:48:56, 22.33s/it] 43%|████▎ | 1080/2500 [6:34:48<8:42:58, 22.10s/it] {'loss': 0.0003, 'grad_norm': 0.2983871873808204, 'learning_rate': 5.679999999999999e-07, 'completion_length': 153.875, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0074462890625, 'epoch': 0.43} + 43%|████▎ | 1080/2500 [6:34:48<8:42:58, 22.10s/it] 43%|████▎ | 1081/2500 [6:35:09<8:37:35, 21.89s/it] {'loss': 0.0002, 'grad_norm': 0.021750991436031344, 'learning_rate': 5.676e-07, 'completion_length': 158.42858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0047149658203125, 'epoch': 0.43} + 43%|████▎ | 1081/2500 [6:35:09<8:37:35, 21.89s/it] 43%|████▎ | 1082/2500 [6:35:31<8:40:36, 22.03s/it] {'loss': 0.0004, 'grad_norm': 0.2668349429654097, 'learning_rate': 5.672e-07, 'completion_length': 181.55357360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.009033203125, 'epoch': 0.43} + 43%|████▎ | 1082/2500 [6:35:31<8:40:36, 22.03s/it] 43%|████▎ | 1083/2500 [6:35:53<8:36:53, 21.89s/it] {'loss': 0.0002, 'grad_norm': 0.026226954683185335, 'learning_rate': 5.667999999999999e-07, 'completion_length': 163.2678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0054779052734375, 'epoch': 0.43} + 43%|████▎ | 1083/2500 [6:35:53<8:36:53, 21.89s/it] 43%|████▎ | 1084/2500 [6:36:15<8:34:11, 21.79s/it] {'loss': 0.0003, 'grad_norm': 0.23758655729192507, 'learning_rate': 5.664e-07, 'completion_length': 168.48214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00750732421875, 'epoch': 0.43} + 43%|████▎ | 1084/2500 [6:36:15<8:34:11, 21.79s/it] 43%|████▎ | 1085/2500 [6:36:36<8:28:56, 21.58s/it] {'loss': 0.0003, 'grad_norm': 0.4390467890960987, 'learning_rate': 5.66e-07, 'completion_length': 158.48214721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.00762939453125, 'epoch': 0.43} + 43%|████▎ | 1085/2500 [6:36:36<8:28:56, 21.58s/it] 43%|████▎ | 1086/2500 [6:36:58<8:32:27, 21.75s/it] {'loss': 0.0003, 'grad_norm': 0.171951992536183, 'learning_rate': 5.655999999999999e-07, 'completion_length': 148.69644165039062, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.006439208984375, 'epoch': 0.43} + 43%|████▎ | 1086/2500 [6:36:58<8:32:27, 21.75s/it] 43%|████▎ | 1087/2500 [6:37:20<8:34:06, 21.83s/it] {'loss': 0.0002, 'grad_norm': 0.36070879594607513, 'learning_rate': 5.652e-07, 'completion_length': 160.7678680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.005767822265625, 'epoch': 0.43} + 43%|████▎ | 1087/2500 [6:37:20<8:34:06, 21.83s/it] 44%|████▎ | 1088/2500 [6:37:42<8:36:08, 21.93s/it] {'loss': 0.0002, 'grad_norm': 0.02773005480298085, 'learning_rate': 5.648e-07, 'completion_length': 161.9464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006134033203125, 'epoch': 0.44} + 44%|████▎ | 1088/2500 [6:37:42<8:36:08, 21.93s/it] 44%|████▎ | 1089/2500 [6:38:03<8:30:38, 21.71s/it] {'loss': 0.0002, 'grad_norm': 0.29784194850713147, 'learning_rate': 5.643999999999999e-07, 'completion_length': 159.6964340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00592041015625, 'epoch': 0.44} + 44%|████▎ | 1089/2500 [6:38:03<8:30:38, 21.71s/it] 44%|████▎ | 1090/2500 [6:38:25<8:33:09, 21.84s/it] {'loss': 0.0003, 'grad_norm': 0.11847652979954698, 'learning_rate': 5.639999999999999e-07, 'completion_length': 160.37500762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0077667236328125, 'epoch': 0.44} + 44%|████▎ | 1090/2500 [6:38:25<8:33:09, 21.84s/it] 44%|████▎ | 1091/2500 [6:38:51<8:56:07, 22.83s/it] {'loss': 0.0002, 'grad_norm': 0.5216436746143877, 'learning_rate': 5.636e-07, 'completion_length': 161.87500762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.910714328289032, 'reward_std': 0.07695358991622925, 'kl': 0.0053253173828125, 'epoch': 0.44} + 44%|████▎ | 1091/2500 [6:38:51<8:56:07, 22.83s/it] 44%|████▎ | 1092/2500 [6:39:12<8:45:13, 22.38s/it] {'loss': 0.0004, 'grad_norm': 1.0424731828782423, 'learning_rate': 5.632e-07, 'completion_length': 152.35714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00909423828125, 'epoch': 0.44} + 44%|████▎ | 1092/2500 [6:39:12<8:45:13, 22.38s/it] 44%|████▎ | 1093/2500 [6:39:34<8:44:33, 22.37s/it] {'loss': 0.0002, 'grad_norm': 0.030877799607246972, 'learning_rate': 5.627999999999999e-07, 'completion_length': 160.71429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0054473876953125, 'epoch': 0.44} + 44%|████▎ | 1093/2500 [6:39:34<8:44:33, 22.37s/it] 44%|████▍ | 1094/2500 [6:39:56<8:38:13, 22.11s/it] {'loss': 0.0003, 'grad_norm': 0.45065555632693116, 'learning_rate': 5.624e-07, 'completion_length': 155.6071548461914, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.006591796875, 'epoch': 0.44} + 44%|████▍ | 1094/2500 [6:39:56<8:38:13, 22.11s/it] 44%|████▍ | 1095/2500 [6:40:18<8:37:17, 22.09s/it] {'loss': 0.0003, 'grad_norm': 0.02739279640265032, 'learning_rate': 5.620000000000001e-07, 'completion_length': 164.37500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00653076171875, 'epoch': 0.44} + 44%|████▍ | 1095/2500 [6:40:18<8:37:17, 22.09s/it] 44%|████▍ | 1096/2500 [6:40:39<8:33:22, 21.94s/it] {'loss': 0.0002, 'grad_norm': 0.036238983078238106, 'learning_rate': 5.615999999999999e-07, 'completion_length': 147.23214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0046539306640625, 'epoch': 0.44} + 44%|████▍ | 1096/2500 [6:40:39<8:33:22, 21.94s/it] 44%|████▍ | 1097/2500 [6:41:01<8:30:55, 21.85s/it] {'loss': 0.0002, 'grad_norm': 0.018833862653534275, 'learning_rate': 5.612e-07, 'completion_length': 154.4464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005035400390625, 'epoch': 0.44} + 44%|████▍ | 1097/2500 [6:41:01<8:30:55, 21.85s/it] 44%|████▍ | 1098/2500 [6:41:23<8:34:52, 22.03s/it] {'loss': 0.0003, 'grad_norm': 0.2174135398067462, 'learning_rate': 5.608e-07, 'completion_length': 165.26786041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0079803466796875, 'epoch': 0.44} + 44%|████▍ | 1098/2500 [6:41:23<8:34:52, 22.03s/it] 44%|████▍ | 1099/2500 [6:41:46<8:36:43, 22.13s/it] {'loss': 0.0003, 'grad_norm': 0.026329492760467917, 'learning_rate': 5.604e-07, 'completion_length': 157.96429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00836181640625, 'epoch': 0.44} + 44%|████▍ | 1099/2500 [6:41:46<8:36:43, 22.13s/it] 44%|████▍ | 1100/2500 [6:42:08<8:34:57, 22.07s/it] {'loss': 0.0003, 'grad_norm': 0.4021478968078017, 'learning_rate': 5.6e-07, 'completion_length': 150.62500762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0081329345703125, 'epoch': 0.44} + 44%|████▍ | 1100/2500 [6:42:08<8:34:57, 22.07s/it] 44%|████▍ | 1101/2500 [6:45:29<29:27:44, 75.81s/it] {'loss': 0.0003, 'grad_norm': 0.756707460874586, 'learning_rate': 5.596e-07, 'completion_length': 159.33929443359375, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.007904052734375, 'epoch': 0.44} + 44%|████▍ | 1101/2500 [6:45:29<29:27:44, 75.81s/it] 44%|████▍ | 1102/2500 [6:45:51<23:08:20, 59.59s/it] {'loss': 0.0004, 'grad_norm': 0.19706542617832135, 'learning_rate': 5.592e-07, 'completion_length': 171.55358123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.009796142578125, 'epoch': 0.44} + 44%|████▍ | 1102/2500 [6:45:51<23:08:20, 59.59s/it] 44%|████▍ | 1103/2500 [6:46:13<18:47:52, 48.44s/it] {'loss': 0.0002, 'grad_norm': 0.17027339249270834, 'learning_rate': 5.588e-07, 'completion_length': 163.1964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0052490234375, 'epoch': 0.44} + 44%|████▍ | 1103/2500 [6:46:13<18:47:52, 48.44s/it] 44%|████▍ | 1104/2500 [6:46:36<15:47:13, 40.71s/it] {'loss': 0.0003, 'grad_norm': 0.03763431265964878, 'learning_rate': 5.584e-07, 'completion_length': 162.50000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0081787109375, 'epoch': 0.44} + 44%|████▍ | 1104/2500 [6:46:36<15:47:13, 40.71s/it] 44%|████▍ | 1105/2500 [6:46:58<13:37:07, 35.15s/it] {'loss': 0.0002, 'grad_norm': 0.9822791770377121, 'learning_rate': 5.58e-07, 'completion_length': 164.71429443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.00592041015625, 'epoch': 0.44} + 44%|████▍ | 1105/2500 [6:46:58<13:37:07, 35.15s/it] 44%|████▍ | 1106/2500 [6:47:20<12:03:18, 31.13s/it] {'loss': 0.0003, 'grad_norm': 2.3090601443927623, 'learning_rate': 5.576e-07, 'completion_length': 161.00000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.00714111328125, 'epoch': 0.44} + 44%|████▍ | 1106/2500 [6:47:20<12:03:18, 31.13s/it] 44%|████▍ | 1107/2500 [6:47:40<10:46:32, 27.85s/it] {'loss': 0.0002, 'grad_norm': 0.2538072551984406, 'learning_rate': 5.572e-07, 'completion_length': 136.30357360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004547119140625, 'epoch': 0.44} + 44%|████▍ | 1107/2500 [6:47:40<10:46:32, 27.85s/it] 44%|████▍ | 1108/2500 [6:48:01<10:01:27, 25.93s/it] {'loss': 0.0004, 'grad_norm': 0.24729221686713315, 'learning_rate': 5.567999999999999e-07, 'completion_length': 159.75000762939453, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0091094970703125, 'epoch': 0.44} + 44%|████▍ | 1108/2500 [6:48:01<10:01:27, 25.93s/it] 44%|████▍ | 1109/2500 [6:48:23<9:30:27, 24.61s/it] {'loss': 0.0003, 'grad_norm': 0.05202009491707832, 'learning_rate': 5.564e-07, 'completion_length': 148.08928680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0084686279296875, 'epoch': 0.44} + 44%|████▍ | 1109/2500 [6:48:23<9:30:27, 24.61s/it] 44%|████▍ | 1110/2500 [6:48:44<9:09:05, 23.70s/it] {'loss': 0.0003, 'grad_norm': 0.7192819960252284, 'learning_rate': 5.560000000000001e-07, 'completion_length': 168.12500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0068206787109375, 'epoch': 0.44} + 44%|████▍ | 1110/2500 [6:48:44<9:09:05, 23.70s/it] 44%|████▍ | 1111/2500 [6:49:07<9:01:32, 23.39s/it] {'loss': 0.0002, 'grad_norm': 0.47136456088607154, 'learning_rate': 5.555999999999999e-07, 'completion_length': 153.14286041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0060882568359375, 'epoch': 0.44} + 44%|████▍ | 1111/2500 [6:49:07<9:01:32, 23.39s/it] 44%|████▍ | 1112/2500 [6:49:28<8:42:52, 22.60s/it] {'loss': 0.0002, 'grad_norm': 0.5302938387780989, 'learning_rate': 5.552e-07, 'completion_length': 143.6964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0057525634765625, 'epoch': 0.44} + 44%|████▍ | 1112/2500 [6:49:28<8:42:52, 22.60s/it] 45%|████▍ | 1113/2500 [6:49:50<8:42:27, 22.60s/it] {'loss': 0.0002, 'grad_norm': 0.2193500274064597, 'learning_rate': 5.548e-07, 'completion_length': 158.00000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0062255859375, 'epoch': 0.45} + 45%|████▍ | 1113/2500 [6:49:50<8:42:27, 22.60s/it] 45%|████▍ | 1114/2500 [6:50:11<8:27:09, 21.95s/it] {'loss': 0.0002, 'grad_norm': 0.5117892188108009, 'learning_rate': 5.543999999999999e-07, 'completion_length': 156.1607208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0061187744140625, 'epoch': 0.45} + 45%|████▍ | 1114/2500 [6:50:11<8:27:09, 21.95s/it] 45%|████▍ | 1115/2500 [6:50:32<8:22:20, 21.76s/it] {'loss': 0.0003, 'grad_norm': 0.39806566538239496, 'learning_rate': 5.54e-07, 'completion_length': 146.0714340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0070953369140625, 'epoch': 0.45} + 45%|████▍ | 1115/2500 [6:50:32<8:22:20, 21.76s/it] 45%|████▍ | 1116/2500 [6:50:53<8:13:31, 21.40s/it] {'loss': 0.0002, 'grad_norm': 0.2997651729620232, 'learning_rate': 5.536e-07, 'completion_length': 147.0178680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.006103515625, 'epoch': 0.45} + 45%|████▍ | 1116/2500 [6:50:53<8:13:31, 21.40s/it] 45%|████▍ | 1117/2500 [6:51:14<8:11:30, 21.32s/it] {'loss': 0.0003, 'grad_norm': 1.0600867089533281, 'learning_rate': 5.532e-07, 'completion_length': 142.3214340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.006561279296875, 'epoch': 0.45} + 45%|████▍ | 1117/2500 [6:51:14<8:11:30, 21.32s/it] 45%|████▍ | 1118/2500 [6:51:35<8:10:23, 21.29s/it] {'loss': 0.0002, 'grad_norm': 0.1961369236942744, 'learning_rate': 5.527999999999999e-07, 'completion_length': 159.1607208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005157470703125, 'epoch': 0.45} + 45%|████▍ | 1118/2500 [6:51:35<8:10:23, 21.29s/it] 45%|████▍ | 1119/2500 [6:51:56<8:03:52, 21.02s/it] {'loss': 0.0002, 'grad_norm': 0.5624366259857327, 'learning_rate': 5.524e-07, 'completion_length': 136.67857360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00592041015625, 'epoch': 0.45} + 45%|████▍ | 1119/2500 [6:51:56<8:03:52, 21.02s/it] 45%|████▍ | 1120/2500 [6:52:19<8:21:44, 21.82s/it] {'loss': 0.0004, 'grad_norm': 0.16983652331410276, 'learning_rate': 5.520000000000001e-07, 'completion_length': 189.4464340209961, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.009674072265625, 'epoch': 0.45} + 45%|████▍ | 1120/2500 [6:52:19<8:21:44, 21.82s/it] 45%|████▍ | 1121/2500 [6:52:42<8:26:57, 22.06s/it] {'loss': 0.0003, 'grad_norm': 0.39992491579735945, 'learning_rate': 5.515999999999999e-07, 'completion_length': 163.80358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0079193115234375, 'epoch': 0.45} + 45%|████▍ | 1121/2500 [6:52:42<8:26:57, 22.06s/it] 45%|████▍ | 1122/2500 [6:53:05<8:30:58, 22.25s/it] {'loss': 0.0003, 'grad_norm': 0.39558522827246617, 'learning_rate': 5.512e-07, 'completion_length': 149.4821548461914, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.0074005126953125, 'epoch': 0.45} + 45%|████▍ | 1122/2500 [6:53:05<8:30:58, 22.25s/it] 45%|████▍ | 1123/2500 [6:53:26<8:25:46, 22.04s/it] {'loss': 0.0002, 'grad_norm': 0.28717260115187304, 'learning_rate': 5.508e-07, 'completion_length': 145.7857208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0051422119140625, 'epoch': 0.45} + 45%|████▍ | 1123/2500 [6:53:26<8:25:46, 22.04s/it] 45%|████▍ | 1124/2500 [6:53:48<8:26:09, 22.07s/it] {'loss': 0.0003, 'grad_norm': 1.2016401113524284, 'learning_rate': 5.504e-07, 'completion_length': 157.67857360839844, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.007415771484375, 'epoch': 0.45} + 45%|████▍ | 1124/2500 [6:53:48<8:26:09, 22.07s/it] 45%|████▌ | 1125/2500 [6:54:09<8:18:32, 21.75s/it] {'loss': 0.0002, 'grad_norm': 0.019066984285882913, 'learning_rate': 5.5e-07, 'completion_length': 160.9821548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00567626953125, 'epoch': 0.45} + 45%|████▌ | 1125/2500 [6:54:09<8:18:32, 21.75s/it] 45%|████▌ | 1126/2500 [6:54:32<8:23:08, 21.97s/it] {'loss': 0.0003, 'grad_norm': 0.24553571246834394, 'learning_rate': 5.496e-07, 'completion_length': 171.5357208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0076751708984375, 'epoch': 0.45} + 45%|████▌ | 1126/2500 [6:54:32<8:23:08, 21.97s/it] 45%|████▌ | 1127/2500 [6:54:54<8:23:04, 21.98s/it] {'loss': 0.0002, 'grad_norm': 0.44522433276958806, 'learning_rate': 5.492e-07, 'completion_length': 175.2678680419922, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.07695359364151955, 'kl': 0.005584716796875, 'epoch': 0.45} + 45%|████▌ | 1127/2500 [6:54:54<8:23:04, 21.98s/it] 45%|████▌ | 1128/2500 [6:55:15<8:18:32, 21.80s/it] {'loss': 0.0002, 'grad_norm': 0.16255791187957225, 'learning_rate': 5.487999999999999e-07, 'completion_length': 145.3928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0048980712890625, 'epoch': 0.45} + 45%|████▌ | 1128/2500 [6:55:15<8:18:32, 21.80s/it] 45%|████▌ | 1129/2500 [6:55:37<8:21:24, 21.94s/it] {'loss': 0.0002, 'grad_norm': 0.02964057865020067, 'learning_rate': 5.484e-07, 'completion_length': 168.7321548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0053558349609375, 'epoch': 0.45} + 45%|████▌ | 1129/2500 [6:55:37<8:21:24, 21.94s/it] 45%|████▌ | 1130/2500 [6:55:58<8:09:32, 21.44s/it] {'loss': 0.0002, 'grad_norm': 0.2819072995110797, 'learning_rate': 5.48e-07, 'completion_length': 142.98214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.003875732421875, 'epoch': 0.45} + 45%|████▌ | 1130/2500 [6:55:58<8:09:32, 21.44s/it] 45%|████▌ | 1131/2500 [6:56:19<8:07:05, 21.35s/it] {'loss': 0.0002, 'grad_norm': 0.30303956272186794, 'learning_rate': 5.476e-07, 'completion_length': 147.9107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005035400390625, 'epoch': 0.45} + 45%|████▌ | 1131/2500 [6:56:19<8:07:05, 21.35s/it] 45%|████▌ | 1132/2500 [6:56:40<8:03:07, 21.19s/it] {'loss': 0.0003, 'grad_norm': 0.48152129688578693, 'learning_rate': 5.472e-07, 'completion_length': 147.12500762939453, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.006439208984375, 'epoch': 0.45} + 45%|████▌ | 1132/2500 [6:56:40<8:03:07, 21.19s/it] 45%|████▌ | 1133/2500 [6:57:02<8:08:57, 21.46s/it] {'loss': 0.0004, 'grad_norm': 0.0507846976265019, 'learning_rate': 5.467999999999999e-07, 'completion_length': 162.0714340209961, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.0094757080078125, 'epoch': 0.45} + 45%|████▌ | 1133/2500 [6:57:02<8:08:57, 21.46s/it] 45%|████▌ | 1134/2500 [6:57:25<8:21:28, 22.03s/it] {'loss': 0.0004, 'grad_norm': 0.38228003132694083, 'learning_rate': 5.464e-07, 'completion_length': 160.00000762939453, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.8214285969734192, 'reward_std': 0.0714285746216774, 'kl': 0.009185791015625, 'epoch': 0.45} + 45%|████▌ | 1134/2500 [6:57:25<8:21:28, 22.03s/it] 45%|████▌ | 1135/2500 [6:57:46<8:14:54, 21.75s/it] {'loss': 0.0003, 'grad_norm': 0.31209324179066134, 'learning_rate': 5.46e-07, 'completion_length': 154.1607208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.007293701171875, 'epoch': 0.45} + 45%|████▌ | 1135/2500 [6:57:46<8:14:54, 21.75s/it] 45%|████▌ | 1136/2500 [6:58:08<8:15:28, 21.80s/it] {'loss': 0.0003, 'grad_norm': 0.28608363508900264, 'learning_rate': 5.455999999999999e-07, 'completion_length': 158.1964340209961, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.0063934326171875, 'epoch': 0.45} + 45%|████▌ | 1136/2500 [6:58:08<8:15:28, 21.80s/it] 45%|████▌ | 1137/2500 [6:58:29<8:08:48, 21.52s/it] {'loss': 0.0004, 'grad_norm': 0.4593230743690149, 'learning_rate': 5.452e-07, 'completion_length': 149.8928680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0087738037109375, 'epoch': 0.45} + 45%|████▌ | 1137/2500 [6:58:29<8:08:48, 21.52s/it] 46%|████▌ | 1138/2500 [6:58:51<8:13:06, 21.72s/it] {'loss': 0.0002, 'grad_norm': 0.026774988706192243, 'learning_rate': 5.448e-07, 'completion_length': 160.87500762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00496673583984375, 'epoch': 0.46} + 46%|████▌ | 1138/2500 [6:58:51<8:13:06, 21.72s/it] 46%|████▌ | 1139/2500 [6:59:12<8:06:01, 21.43s/it] {'loss': 0.0002, 'grad_norm': 0.2770834274766734, 'learning_rate': 5.443999999999999e-07, 'completion_length': 142.73214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0051422119140625, 'epoch': 0.46} + 46%|████▌ | 1139/2500 [6:59:12<8:06:01, 21.43s/it] 46%|████▌ | 1140/2500 [6:59:33<8:02:22, 21.28s/it] {'loss': 0.0003, 'grad_norm': 0.04482453407323049, 'learning_rate': 5.44e-07, 'completion_length': 144.0357208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00634765625, 'epoch': 0.46} + 46%|████▌ | 1140/2500 [6:59:33<8:02:22, 21.28s/it] 46%|████▌ | 1141/2500 [6:59:55<8:06:05, 21.46s/it] {'loss': 0.0002, 'grad_norm': 0.544287953444357, 'learning_rate': 5.436e-07, 'completion_length': 150.17858123779297, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.0051116943359375, 'epoch': 0.46} + 46%|████▌ | 1141/2500 [6:59:55<8:06:05, 21.46s/it] 46%|████▌ | 1142/2500 [7:00:16<8:07:25, 21.54s/it] {'loss': 0.0003, 'grad_norm': 0.3291449978903929, 'learning_rate': 5.431999999999999e-07, 'completion_length': 167.62500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.008514404296875, 'epoch': 0.46} + 46%|████▌ | 1142/2500 [7:00:16<8:07:25, 21.54s/it] 46%|████▌ | 1143/2500 [7:00:38<8:06:51, 21.53s/it] {'loss': 0.0002, 'grad_norm': 0.01729334809811784, 'learning_rate': 5.427999999999999e-07, 'completion_length': 149.35714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00391387939453125, 'epoch': 0.46} + 46%|████▌ | 1143/2500 [7:00:38<8:06:51, 21.53s/it] 46%|████▌ | 1144/2500 [7:00:59<8:03:12, 21.38s/it] {'loss': 0.0002, 'grad_norm': 0.5541982280708007, 'learning_rate': 5.424e-07, 'completion_length': 150.0357208251953, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.11266788095235825, 'kl': 0.0062255859375, 'epoch': 0.46} + 46%|████▌ | 1144/2500 [7:00:59<8:03:12, 21.38s/it] 46%|████▌ | 1145/2500 [7:01:21<8:06:44, 21.55s/it] {'loss': 0.0002, 'grad_norm': 0.19091670095856103, 'learning_rate': 5.420000000000001e-07, 'completion_length': 155.0714340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0043792724609375, 'epoch': 0.46} + 46%|████▌ | 1145/2500 [7:01:21<8:06:44, 21.55s/it] 46%|████▌ | 1146/2500 [7:01:43<8:09:06, 21.67s/it] {'loss': 0.0002, 'grad_norm': 0.025300285642119218, 'learning_rate': 5.415999999999999e-07, 'completion_length': 156.21429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00420379638671875, 'epoch': 0.46} + 46%|████▌ | 1146/2500 [7:01:43<8:09:06, 21.67s/it] 46%|████▌ | 1147/2500 [7:02:04<8:04:03, 21.47s/it] {'loss': 0.0002, 'grad_norm': 1.0269614209723203, 'learning_rate': 5.412e-07, 'completion_length': 149.1607208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.006256103515625, 'epoch': 0.46} + 46%|████▌ | 1147/2500 [7:02:04<8:04:03, 21.47s/it] 46%|████▌ | 1148/2500 [7:02:27<8:14:31, 21.95s/it] {'loss': 0.0003, 'grad_norm': 0.724663166990775, 'learning_rate': 5.408e-07, 'completion_length': 170.8928680419922, 'rewards/accuracy_reward': 0.8392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.8392857313156128, 'reward_std': 0.07695358991622925, 'kl': 0.0062713623046875, 'epoch': 0.46} + 46%|████▌ | 1148/2500 [7:02:27<8:14:31, 21.95s/it] 46%|████▌ | 1149/2500 [7:02:49<8:18:19, 22.13s/it] {'loss': 0.0003, 'grad_norm': 0.2715329090556733, 'learning_rate': 5.403999999999999e-07, 'completion_length': 156.75000762939453, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.007080078125, 'epoch': 0.46} + 46%|████▌ | 1149/2500 [7:02:49<8:18:19, 22.13s/it] 46%|████▌ | 1150/2500 [7:03:10<8:06:23, 21.62s/it] {'loss': 0.0002, 'grad_norm': 0.02618169263715104, 'learning_rate': 5.4e-07, 'completion_length': 147.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00494384765625, 'epoch': 0.46} + 46%|████▌ | 1150/2500 [7:03:10<8:06:23, 21.62s/it] 46%|████▌ | 1151/2500 [7:03:31<8:04:56, 21.57s/it] {'loss': 0.0002, 'grad_norm': 0.036895024269004574, 'learning_rate': 5.396e-07, 'completion_length': 149.00000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0056610107421875, 'epoch': 0.46} + 46%|████▌ | 1151/2500 [7:03:31<8:04:56, 21.57s/it] 46%|████▌ | 1152/2500 [7:03:53<8:08:46, 21.76s/it] {'loss': 0.0002, 'grad_norm': 0.023966928202961382, 'learning_rate': 5.392e-07, 'completion_length': 148.50000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0050811767578125, 'epoch': 0.46} + 46%|████▌ | 1152/2500 [7:03:54<8:08:46, 21.76s/it] 46%|████▌ | 1153/2500 [7:04:15<8:03:25, 21.53s/it] {'loss': 0.0004, 'grad_norm': 0.8518464617732735, 'learning_rate': 5.387999999999999e-07, 'completion_length': 153.25000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00933837890625, 'epoch': 0.46} + 46%|████▌ | 1153/2500 [7:04:15<8:03:25, 21.53s/it] 46%|████▌ | 1154/2500 [7:04:36<8:01:24, 21.46s/it] {'loss': 0.0002, 'grad_norm': 0.03763581403892708, 'learning_rate': 5.384e-07, 'completion_length': 147.46429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005035400390625, 'epoch': 0.46} + 46%|████▌ | 1154/2500 [7:04:36<8:01:24, 21.46s/it] 46%|████▌ | 1155/2500 [7:04:57<7:59:53, 21.41s/it] {'loss': 0.0002, 'grad_norm': 0.5519024339193254, 'learning_rate': 5.38e-07, 'completion_length': 145.48214721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0051116943359375, 'epoch': 0.46} + 46%|████▌ | 1155/2500 [7:04:57<7:59:53, 21.41s/it] 46%|████▌ | 1156/2500 [7:05:20<8:10:14, 21.89s/it] {'loss': 0.0002, 'grad_norm': 0.9457861581172401, 'learning_rate': 5.375999999999999e-07, 'completion_length': 158.50000762939453, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0053253173828125, 'epoch': 0.46} + 46%|████▌ | 1156/2500 [7:05:20<8:10:14, 21.89s/it] 46%|████▋ | 1157/2500 [7:05:42<8:10:44, 21.92s/it] {'loss': 0.0003, 'grad_norm': 1.803183318927623, 'learning_rate': 5.372e-07, 'completion_length': 149.4107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0065155029296875, 'epoch': 0.46} + 46%|████▋ | 1157/2500 [7:05:42<8:10:44, 21.92s/it] 46%|████▋ | 1158/2500 [7:06:05<8:15:57, 22.17s/it] {'loss': 0.0003, 'grad_norm': 0.9141537307373719, 'learning_rate': 5.368e-07, 'completion_length': 163.23214721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.0066375732421875, 'epoch': 0.46} + 46%|████▋ | 1158/2500 [7:06:05<8:15:57, 22.17s/it] 46%|████▋ | 1159/2500 [7:06:27<8:13:12, 22.07s/it] {'loss': 0.0002, 'grad_norm': 0.2960490865155484, 'learning_rate': 5.364e-07, 'completion_length': 156.12500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0056610107421875, 'epoch': 0.46} + 46%|████▋ | 1159/2500 [7:06:27<8:13:12, 22.07s/it] 46%|████▋ | 1160/2500 [7:06:49<8:14:55, 22.16s/it] {'loss': 0.0003, 'grad_norm': 0.03172291456932264, 'learning_rate': 5.36e-07, 'completion_length': 156.71429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006317138671875, 'epoch': 0.46} + 46%|████▋ | 1160/2500 [7:06:49<8:14:55, 22.16s/it] 46%|████▋ | 1161/2500 [7:07:11<8:11:56, 22.04s/it] {'loss': 0.0004, 'grad_norm': 0.35444506417319505, 'learning_rate': 5.355999999999999e-07, 'completion_length': 164.0714340209961, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.00897216796875, 'epoch': 0.46} + 46%|████▋ | 1161/2500 [7:07:11<8:11:56, 22.04s/it] 46%|████▋ | 1162/2500 [7:07:32<8:06:35, 21.82s/it] {'loss': 0.0002, 'grad_norm': 1.4068867635169313, 'learning_rate': 5.352e-07, 'completion_length': 143.1071548461914, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0053253173828125, 'epoch': 0.46} + 46%|████▋ | 1162/2500 [7:07:32<8:06:35, 21.82s/it] 47%|████▋ | 1163/2500 [7:07:54<8:05:38, 21.79s/it] {'loss': 0.0002, 'grad_norm': 1.3663636695252641, 'learning_rate': 5.348e-07, 'completion_length': 162.6964340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.005828857421875, 'epoch': 0.47} + 47%|████▋ | 1163/2500 [7:07:54<8:05:38, 21.79s/it] 47%|████▋ | 1164/2500 [7:08:17<8:13:52, 22.18s/it] {'loss': 0.0004, 'grad_norm': 1.369293414606023, 'learning_rate': 5.343999999999999e-07, 'completion_length': 166.1428680419922, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.07695359364151955, 'kl': 0.0093994140625, 'epoch': 0.47} + 47%|████▋ | 1164/2500 [7:08:17<8:13:52, 22.18s/it] 47%|████▋ | 1165/2500 [7:08:38<8:04:41, 21.78s/it] {'loss': 0.0001, 'grad_norm': 0.015806904767176406, 'learning_rate': 5.34e-07, 'completion_length': 143.62500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00350189208984375, 'epoch': 0.47} + 47%|████▋ | 1165/2500 [7:08:38<8:04:41, 21.78s/it] 47%|████▋ | 1166/2500 [7:09:00<8:05:06, 21.82s/it] {'loss': 0.0002, 'grad_norm': 0.5370449041426623, 'learning_rate': 5.336e-07, 'completion_length': 150.42858123779297, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1071428656578064, 'kl': 0.005859375, 'epoch': 0.47} + 47%|████▋ | 1166/2500 [7:09:00<8:05:06, 21.82s/it] 47%|████▋ | 1167/2500 [7:09:24<8:20:05, 22.51s/it] {'loss': 0.0002, 'grad_norm': 0.021762630291015838, 'learning_rate': 5.331999999999999e-07, 'completion_length': 156.39286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00386810302734375, 'epoch': 0.47} + 47%|████▋ | 1167/2500 [7:09:24<8:20:05, 22.51s/it] 47%|████▋ | 1168/2500 [7:09:45<8:11:07, 22.12s/it] {'loss': 0.0004, 'grad_norm': 0.9877141355406847, 'learning_rate': 5.328e-07, 'completion_length': 155.17858123779297, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.857142984867096, 'reward_std': 0.0714285746216774, 'kl': 0.0099334716796875, 'epoch': 0.47} + 47%|████▋ | 1168/2500 [7:09:45<8:11:07, 22.12s/it] 47%|████▋ | 1169/2500 [7:10:08<8:13:10, 22.23s/it] {'loss': 0.0002, 'grad_norm': 0.017320339639500876, 'learning_rate': 5.324e-07, 'completion_length': 144.6428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004486083984375, 'epoch': 0.47} + 47%|████▋ | 1169/2500 [7:10:08<8:13:10, 22.23s/it] 47%|████▋ | 1170/2500 [7:10:29<8:07:48, 22.01s/it] {'loss': 0.0003, 'grad_norm': 0.3666470654295856, 'learning_rate': 5.32e-07, 'completion_length': 164.0714340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.11266788095235825, 'kl': 0.0063323974609375, 'epoch': 0.47} + 47%|████▋ | 1170/2500 [7:10:29<8:07:48, 22.01s/it] 47%|████▋ | 1171/2500 [7:10:50<8:01:03, 21.72s/it] {'loss': 0.0002, 'grad_norm': 0.027065748472037823, 'learning_rate': 5.315999999999999e-07, 'completion_length': 151.5178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003875732421875, 'epoch': 0.47} + 47%|████▋ | 1171/2500 [7:10:50<8:01:03, 21.72s/it] 47%|█��██▋ | 1172/2500 [7:11:12<8:00:14, 21.70s/it] {'loss': 0.0002, 'grad_norm': 0.03333198122630887, 'learning_rate': 5.312e-07, 'completion_length': 152.71429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0046844482421875, 'epoch': 0.47} + 47%|████▋ | 1172/2500 [7:11:12<8:00:14, 21.70s/it] 47%|████▋ | 1173/2500 [7:11:34<8:01:14, 21.76s/it] {'loss': 0.0003, 'grad_norm': 0.3047904931555096, 'learning_rate': 5.308000000000001e-07, 'completion_length': 167.0178680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.007843017578125, 'epoch': 0.47} + 47%|████▋ | 1173/2500 [7:11:34<8:01:14, 21.76s/it] 47%|████▋ | 1174/2500 [7:11:56<8:02:44, 21.84s/it] {'loss': 0.0004, 'grad_norm': 0.025826619659736372, 'learning_rate': 5.303999999999999e-07, 'completion_length': 158.39286041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.010009765625, 'epoch': 0.47} + 47%|████▋ | 1174/2500 [7:11:56<8:02:44, 21.84s/it] 47%|████▋ | 1175/2500 [7:12:18<8:07:03, 22.06s/it] {'loss': 0.0002, 'grad_norm': 0.3876281802866087, 'learning_rate': 5.3e-07, 'completion_length': 149.6071548461914, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0052947998046875, 'epoch': 0.47} + 47%|████▋ | 1175/2500 [7:12:18<8:07:03, 22.06s/it] 47%|████▋ | 1176/2500 [7:12:40<8:05:14, 21.99s/it] {'loss': 0.0002, 'grad_norm': 0.04680234505966349, 'learning_rate': 5.296e-07, 'completion_length': 158.10714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005157470703125, 'epoch': 0.47} + 47%|████▋ | 1176/2500 [7:12:40<8:05:14, 21.99s/it] 47%|████▋ | 1177/2500 [7:13:01<8:00:29, 21.79s/it] {'loss': 0.0002, 'grad_norm': 0.25146895144071896, 'learning_rate': 5.292e-07, 'completion_length': 154.75000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005828857421875, 'epoch': 0.47} + 47%|████▋ | 1177/2500 [7:13:01<8:00:29, 21.79s/it] 47%|████▋ | 1178/2500 [7:13:24<8:06:49, 22.10s/it] {'loss': 0.0003, 'grad_norm': 0.9323240622549761, 'learning_rate': 5.288e-07, 'completion_length': 158.9107208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.008026123046875, 'epoch': 0.47} + 47%|████▋ | 1178/2500 [7:13:24<8:06:49, 22.10s/it] 47%|████▋ | 1179/2500 [7:13:46<8:07:19, 22.13s/it] {'loss': 0.0002, 'grad_norm': 0.49612602651481646, 'learning_rate': 5.284e-07, 'completion_length': 157.76786041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.005767822265625, 'epoch': 0.47} + 47%|████▋ | 1179/2500 [7:13:46<8:07:19, 22.13s/it] 47%|████▋ | 1180/2500 [7:14:10<8:15:21, 22.52s/it] {'loss': 0.0003, 'grad_norm': 2.179986417438214, 'learning_rate': 5.28e-07, 'completion_length': 173.83929443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.0074920654296875, 'epoch': 0.47} + 47%|████▋ | 1180/2500 [7:14:10<8:15:21, 22.52s/it] 47%|████▋ | 1181/2500 [7:14:33<8:18:09, 22.66s/it] {'loss': 0.0002, 'grad_norm': 0.01870516397673207, 'learning_rate': 5.275999999999999e-07, 'completion_length': 159.37500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0048370361328125, 'epoch': 0.47} + 47%|████▋ | 1181/2500 [7:14:33<8:18:09, 22.66s/it] 47%|████▋ | 1182/2500 [7:14:55<8:18:00, 22.67s/it] {'loss': 0.0003, 'grad_norm': 0.04242333164927607, 'learning_rate': 5.272e-07, 'completion_length': 161.80358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00635528564453125, 'epoch': 0.47} + 47%|████▋ | 1182/2500 [7:14:56<8:18:00, 22.67s/it] 47%|████▋ | 1183/2500 [7:15:18<8:17:58, 22.69s/it] {'loss': 0.0003, 'grad_norm': 0.23816469831438225, 'learning_rate': 5.268e-07, 'completion_length': 163.8214340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00673675537109375, 'epoch': 0.47} + 47%|████▋ | 1183/2500 [7:15:18<8:17:58, 22.69s/it] 47%|████▋ | 1184/2500 [7:15:40<8:14:32, 22.55s/it] {'loss': 0.0003, 'grad_norm': 0.030878836510074135, 'learning_rate': 5.264e-07, 'completion_length': 161.30358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0064849853515625, 'epoch': 0.47} + 47%|████▋ | 1184/2500 [7:15:40<8:14:32, 22.55s/it] 47%|████▋ | 1185/2500 [7:16:03<8:12:52, 22.49s/it] {'loss': 0.0002, 'grad_norm': 0.47026541010631107, 'learning_rate': 5.26e-07, 'completion_length': 152.89286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00604248046875, 'epoch': 0.47} + 47%|████▋ | 1185/2500 [7:16:03<8:12:52, 22.49s/it] 47%|████▋ | 1186/2500 [7:16:24<8:05:38, 22.18s/it] {'loss': 0.0002, 'grad_norm': 0.028889705788819775, 'learning_rate': 5.255999999999999e-07, 'completion_length': 148.8214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005645751953125, 'epoch': 0.47} + 47%|████▋ | 1186/2500 [7:16:24<8:05:38, 22.18s/it] 47%|████▋ | 1187/2500 [7:16:45<7:58:14, 21.85s/it] {'loss': 0.0003, 'grad_norm': 0.029303351499512784, 'learning_rate': 5.252e-07, 'completion_length': 155.55358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00640869140625, 'epoch': 0.47} + 47%|████▋ | 1187/2500 [7:16:45<7:58:14, 21.85s/it] 48%|████▊ | 1188/2500 [7:17:07<7:59:42, 21.94s/it] {'loss': 0.0002, 'grad_norm': 0.267734347611394, 'learning_rate': 5.248e-07, 'completion_length': 162.08929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0046539306640625, 'epoch': 0.48} + 48%|████▊ | 1188/2500 [7:17:07<7:59:42, 21.94s/it] 48%|████▊ | 1189/2500 [7:17:30<8:05:20, 22.21s/it] {'loss': 0.0002, 'grad_norm': 0.37316467323629143, 'learning_rate': 5.243999999999999e-07, 'completion_length': 167.08929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0055999755859375, 'epoch': 0.48} + 48%|████▊ | 1189/2500 [7:17:30<8:05:20, 22.21s/it] 48%|████▊ | 1190/2500 [7:17:52<8:02:20, 22.09s/it] {'loss': 0.0002, 'grad_norm': 0.23236080247189822, 'learning_rate': 5.24e-07, 'completion_length': 143.21428680419922, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0041961669921875, 'epoch': 0.48} + 48%|████▊ | 1190/2500 [7:17:52<8:02:20, 22.09s/it] 48%|████▊ | 1191/2500 [7:18:14<8:02:45, 22.13s/it] {'loss': 0.0003, 'grad_norm': 0.38666449662182445, 'learning_rate': 5.236e-07, 'completion_length': 161.42857360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0069122314453125, 'epoch': 0.48} + 48%|████▊ | 1191/2500 [7:18:14<8:02:45, 22.13s/it] 48%|████▊ | 1192/2500 [7:18:36<7:57:12, 21.89s/it] {'loss': 0.0002, 'grad_norm': 0.022358634599289513, 'learning_rate': 5.232e-07, 'completion_length': 152.33929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0052032470703125, 'epoch': 0.48} + 48%|████▊ | 1192/2500 [7:18:36<7:57:12, 21.89s/it] 48%|████▊ | 1193/2500 [7:18:58<7:56:23, 21.87s/it] {'loss': 0.0002, 'grad_norm': 0.23932515588253936, 'learning_rate': 5.228e-07, 'completion_length': 161.51786041259766, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0045013427734375, 'epoch': 0.48} + 48%|████▊ | 1193/2500 [7:18:58<7:56:23, 21.87s/it] 48%|████▊ | 1194/2500 [7:19:19<7:52:29, 21.71s/it] {'loss': 0.0003, 'grad_norm': 0.6405459661608794, 'learning_rate': 5.224e-07, 'completion_length': 138.96428680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0068359375, 'epoch': 0.48} + 48%|████▊ | 1194/2500 [7:19:19<7:52:29, 21.71s/it] 48%|████▊ | 1195/2500 [7:19:40<7:49:37, 21.59s/it] {'loss': 0.0002, 'grad_norm': 0.01861420214961302, 'learning_rate': 5.22e-07, 'completion_length': 139.9464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0042572021484375, 'epoch': 0.48} + 48%|████▊ | 1195/2500 [7:19:40<7:49:37, 21.59s/it] 48%|████▊ | 1196/2500 [7:20:03<7:59:16, 22.05s/it] {'loss': 0.0003, 'grad_norm': 0.2704148920541655, 'learning_rate': 5.215999999999999e-07, 'completion_length': 167.5, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0078887939453125, 'epoch': 0.48} + 48%|████▊ | 1196/2500 [7:20:03<7:59:16, 22.05s/it] 48%|████▊ | 1197/2500 [7:20:25<7:53:32, 21.81s/it] {'loss': 0.0002, 'grad_norm': 0.7624678731684243, 'learning_rate': 5.212e-07, 'completion_length': 146.00000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0050048828125, 'epoch': 0.48} + 48%|████▊ | 1197/2500 [7:20:25<7:53:32, 21.81s/it] 48%|████▊ | 1198/2500 [7:20:46<7:53:38, 21.83s/it] {'loss': 0.0002, 'grad_norm': 0.25915197400053214, 'learning_rate': 5.208000000000001e-07, 'completion_length': 153.92857360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005645751953125, 'epoch': 0.48} + 48%|████▊ | 1198/2500 [7:20:46<7:53:38, 21.83s/it] 48%|████▊ | 1199/2500 [7:21:09<7:57:34, 22.02s/it] {'loss': 0.0003, 'grad_norm': 0.276201186688378, 'learning_rate': 5.203999999999999e-07, 'completion_length': 170.05358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0071868896484375, 'epoch': 0.48} + 48%|████▊ | 1199/2500 [7:21:09<7:57:34, 22.02s/it] 48%|████▊ | 1200/2500 [7:21:33<8:11:07, 22.67s/it] {'loss': 0.0003, 'grad_norm': 0.25010906154841933, 'learning_rate': 5.2e-07, 'completion_length': 157.67858123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006622314453125, 'epoch': 0.48} + 48%|████▊ | 1200/2500 [7:21:33<8:11:07, 22.67s/it] 48%|████▊ | 1201/2500 [7:25:30<31:21:01, 86.88s/it] {'loss': 0.0002, 'grad_norm': 0.03796356478185557, 'learning_rate': 5.196e-07, 'completion_length': 150.55358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00531768798828125, 'epoch': 0.48} + 48%|████▊ | 1201/2500 [7:25:30<31:21:01, 86.88s/it] 48%|████▊ | 1202/2500 [7:25:53<24:24:13, 67.68s/it] {'loss': 0.0002, 'grad_norm': 0.026783451476104357, 'learning_rate': 5.191999999999999e-07, 'completion_length': 153.4107208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.005828857421875, 'epoch': 0.48} + 48%|████▊ | 1202/2500 [7:25:53<24:24:13, 67.68s/it] 48%|████▊ | 1203/2500 [7:26:14<19:23:58, 53.85s/it] {'loss': 0.0002, 'grad_norm': 0.23278487951380938, 'learning_rate': 5.188e-07, 'completion_length': 146.98214721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0040130615234375, 'epoch': 0.48} + 48%|████▊ | 1203/2500 [7:26:14<19:23:58, 53.85s/it] 48%|████▊ | 1204/2500 [7:26:36<15:55:31, 44.24s/it] {'loss': 0.0002, 'grad_norm': 0.025023846676919464, 'learning_rate': 5.184e-07, 'completion_length': 151.92857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0058746337890625, 'epoch': 0.48} + 48%|████▊ | 1204/2500 [7:26:36<15:55:31, 44.24s/it] 48%|████▊ | 1205/2500 [7:26:58<13:27:51, 37.43s/it] {'loss': 0.0003, 'grad_norm': 0.03934535826120743, 'learning_rate': 5.18e-07, 'completion_length': 150.3571548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0063934326171875, 'epoch': 0.48} + 48%|████▊ | 1205/2500 [7:26:58<13:27:51, 37.43s/it] 48%|████▊ | 1206/2500 [7:27:19<11:44:11, 32.65s/it] {'loss': 0.0002, 'grad_norm': 0.024628725436261458, 'learning_rate': 5.175999999999999e-07, 'completion_length': 152.73214721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00511932373046875, 'epoch': 0.48} + 48%|████▊ | 1206/2500 [7:27:19<11:44:11, 32.65s/it] 48%|████▊ | 1207/2500 [7:27:42<10:40:20, 29.71s/it] {'loss': 0.0003, 'grad_norm': 0.4053162588819567, 'learning_rate': 5.172e-07, 'completion_length': 156.5714340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00634765625, 'epoch': 0.48} + 48%|████▊ | 1207/2500 [7:27:42<10:40:20, 29.71s/it] 48%|████▊ | 1208/2500 [7:28:04<9:50:57, 27.44s/it] {'loss': 0.0002, 'grad_norm': 0.4662304049311799, 'learning_rate': 5.168e-07, 'completion_length': 163.98214721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00616455078125, 'epoch': 0.48} + 48%|████▊ | 1208/2500 [7:28:04<9:50:57, 27.44s/it] 48%|████▊ | 1209/2500 [7:28:28<9:30:45, 26.53s/it] {'loss': 0.0004, 'grad_norm': 1.1304815662252297, 'learning_rate': 5.163999999999999e-07, 'completion_length': 187.92858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.11266788095235825, 'kl': 0.0088348388671875, 'epoch': 0.48} + 48%|████▊ | 1209/2500 [7:28:28<9:30:45, 26.53s/it] 48%|████▊ | 1210/2500 [7:28:51<9:01:27, 25.18s/it] {'loss': 0.0002, 'grad_norm': 0.03846159983061448, 'learning_rate': 5.16e-07, 'completion_length': 151.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005828857421875, 'epoch': 0.48} + 48%|████▊ | 1210/2500 [7:28:51<9:01:27, 25.18s/it] 48%|████▊ | 1211/2500 [7:29:12<8:39:19, 24.17s/it] {'loss': 0.0004, 'grad_norm': 0.5524326654190139, 'learning_rate': 5.155999999999999e-07, 'completion_length': 165.75, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695358991622925, 'kl': 0.0093231201171875, 'epoch': 0.48} + 48%|████▊ | 1211/2500 [7:29:12<8:39:19, 24.17s/it] 48%|████▊ | 1212/2500 [7:29:33<8:19:31, 23.27s/it] {'loss': 0.0002, 'grad_norm': 0.03656477191846029, 'learning_rate': 5.152e-07, 'completion_length': 141.0714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0051116943359375, 'epoch': 0.48} + 48%|████▊ | 1212/2500 [7:29:34<8:19:31, 23.27s/it] 49%|████▊ | 1213/2500 [7:29:54<8:03:42, 22.55s/it] {'loss': 0.0003, 'grad_norm': 0.2448826749373785, 'learning_rate': 5.148e-07, 'completion_length': 153.23214721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.006805419921875, 'epoch': 0.49} + 49%|████▊ | 1213/2500 [7:29:54<8:03:42, 22.55s/it] 49%|████▊ | 1214/2500 [7:30:16<7:59:08, 22.36s/it] {'loss': 0.0004, 'grad_norm': 0.3655106141828756, 'learning_rate': 5.143999999999999e-07, 'completion_length': 164.75000762939453, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.0112457275390625, 'epoch': 0.49} + 49%|████▊ | 1214/2500 [7:30:16<7:59:08, 22.36s/it] 49%|████▊ | 1215/2500 [7:30:39<8:00:02, 22.41s/it] {'loss': 0.0002, 'grad_norm': 0.3019426395350761, 'learning_rate': 5.14e-07, 'completion_length': 170.5357208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.0058135986328125, 'epoch': 0.49} + 49%|████▊ | 1215/2500 [7:30:39<8:00:02, 22.41s/it] 49%|████▊ | 1216/2500 [7:31:01<7:58:30, 22.36s/it] {'loss': 0.0003, 'grad_norm': 0.4909937359199473, 'learning_rate': 5.135999999999999e-07, 'completion_length': 145.30358123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.0071258544921875, 'epoch': 0.49} + 49%|████▊ | 1216/2500 [7:31:01<7:58:30, 22.36s/it] 49%|████▊ | 1217/2500 [7:31:24<7:59:25, 22.42s/it] {'loss': 0.0003, 'grad_norm': 1.1647734160803793, 'learning_rate': 5.132e-07, 'completion_length': 159.62500762939453, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.0067138671875, 'epoch': 0.49} + 49%|████▊ | 1217/2500 [7:31:24<7:59:25, 22.42s/it] 49%|████▊ | 1218/2500 [7:31:46<7:55:44, 22.27s/it] {'loss': 0.0003, 'grad_norm': 0.5992792607380838, 'learning_rate': 5.128e-07, 'completion_length': 146.1607208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0081024169921875, 'epoch': 0.49} + 49%|████▊ | 1218/2500 [7:31:46<7:55:44, 22.27s/it] 49%|████▉ | 1219/2500 [7:32:07<7:47:30, 21.90s/it] {'loss': 0.0002, 'grad_norm': 1.0113688799375133, 'learning_rate': 5.124e-07, 'completion_length': 137.9107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0058746337890625, 'epoch': 0.49} + 49%|████▉ | 1219/2500 [7:32:07<7:47:30, 21.90s/it] 49%|████▉ | 1220/2500 [7:32:28<7:46:50, 21.88s/it] {'loss': 0.0003, 'grad_norm': 0.27137662987177685, 'learning_rate': 5.12e-07, 'completion_length': 156.25000762939453, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0068206787109375, 'epoch': 0.49} + 49%|████▉ | 1220/2500 [7:32:28<7:46:50, 21.88s/it] 49%|████▉ | 1221/2500 [7:32:50<7:46:10, 21.87s/it] {'loss': 0.0002, 'grad_norm': 0.22583717200491837, 'learning_rate': 5.116e-07, 'completion_length': 140.6964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0060882568359375, 'epoch': 0.49} + 49%|████▉ | 1221/2500 [7:32:50<7:46:10, 21.87s/it] 49%|████▉ | 1222/2500 [7:33:11<7:37:37, 21.48s/it] {'loss': 0.0002, 'grad_norm': 0.3435986559823958, 'learning_rate': 5.112e-07, 'completion_length': 145.26786041259766, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.00438690185546875, 'epoch': 0.49} + 49%|████▉ | 1222/2500 [7:33:11<7:37:37, 21.48s/it] 49%|████▉ | 1223/2500 [7:33:33<7:41:58, 21.71s/it] {'loss': 0.0002, 'grad_norm': 0.7500702575909797, 'learning_rate': 5.108e-07, 'completion_length': 157.00000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.11266788095235825, 'kl': 0.0059814453125, 'epoch': 0.49} + 49%|████▉ | 1223/2500 [7:33:33<7:41:58, 21.71s/it] 49%|████▉ | 1224/2500 [7:33:57<7:53:54, 22.28s/it] {'loss': 0.0004, 'grad_norm': 1.7969914741362627, 'learning_rate': 5.103999999999999e-07, 'completion_length': 174.4107208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.00897216796875, 'epoch': 0.49} + 49%|████▉ | 1224/2500 [7:33:57<7:53:54, 22.28s/it] 49%|████▉ | 1225/2500 [7:34:19<7:51:47, 22.20s/it] {'loss': 0.0003, 'grad_norm': 0.40897922752522137, 'learning_rate': 5.1e-07, 'completion_length': 145.46428680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00726318359375, 'epoch': 0.49} + 49%|████▉ | 1225/2500 [7:34:19<7:51:47, 22.20s/it] 49%|████▉ | 1226/2500 [7:34:41<7:50:48, 22.17s/it] {'loss': 0.0003, 'grad_norm': 0.024753384522277946, 'learning_rate': 5.096000000000001e-07, 'completion_length': 169.98214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.007232666015625, 'epoch': 0.49} + 49%|████▉ | 1226/2500 [7:34:41<7:50:48, 22.17s/it] 49%|████▉ | 1227/2500 [7:35:03<7:49:50, 22.14s/it] {'loss': 0.0002, 'grad_norm': 0.7058026738217034, 'learning_rate': 5.091999999999999e-07, 'completion_length': 159.5357208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.006103515625, 'epoch': 0.49} + 49%|████▉ | 1227/2500 [7:35:03<7:49:50, 22.14s/it] 49%|████▉ | 1228/2500 [7:35:25<7:51:31, 22.24s/it] {'loss': 0.0003, 'grad_norm': 0.0323729459735602, 'learning_rate': 5.088e-07, 'completion_length': 167.4107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0070648193359375, 'epoch': 0.49} + 49%|████▉ | 1228/2500 [7:35:25<7:51:31, 22.24s/it] 49%|████▉ | 1229/2500 [7:35:46<7:38:49, 21.66s/it] {'loss': 0.0002, 'grad_norm': 0.03250195029759873, 'learning_rate': 5.084e-07, 'completion_length': 153.35714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005035400390625, 'epoch': 0.49} + 49%|████▉ | 1229/2500 [7:35:46<7:38:49, 21.66s/it] 49%|████▉ | 1230/2500 [7:36:08<7:40:40, 21.76s/it] {'loss': 0.0002, 'grad_norm': 0.5602764834216039, 'learning_rate': 5.079999999999999e-07, 'completion_length': 158.3928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0058135986328125, 'epoch': 0.49} + 49%|████▉ | 1230/2500 [7:36:08<7:40:40, 21.76s/it] 49%|████▉ | 1231/2500 [7:36:30<7:43:51, 21.93s/it] {'loss': 0.0002, 'grad_norm': 0.03380281267225305, 'learning_rate': 5.076e-07, 'completion_length': 155.23214721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00543212890625, 'epoch': 0.49} + 49%|████▉ | 1231/2500 [7:36:30<7:43:51, 21.93s/it] 49%|████▉ | 1232/2500 [7:36:51<7:37:51, 21.67s/it] {'loss': 0.0003, 'grad_norm': 0.3534667214177742, 'learning_rate': 5.072e-07, 'completion_length': 146.05357360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.007232666015625, 'epoch': 0.49} + 49%|████▉ | 1232/2500 [7:36:51<7:37:51, 21.67s/it] 49%|████▉ | 1233/2500 [7:37:13<7:41:20, 21.85s/it] {'loss': 0.0002, 'grad_norm': 0.1871977921984694, 'learning_rate': 5.068e-07, 'completion_length': 159.58929443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0059356689453125, 'epoch': 0.49} + 49%|████▉ | 1233/2500 [7:37:13<7:41:20, 21.85s/it] 49%|████▉ | 1234/2500 [7:37:34<7:35:14, 21.58s/it] {'loss': 0.0002, 'grad_norm': 0.1979629032751049, 'learning_rate': 5.063999999999999e-07, 'completion_length': 152.23214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00415802001953125, 'epoch': 0.49} + 49%|████▉ | 1234/2500 [7:37:34<7:35:14, 21.58s/it] 49%|████▉ | 1235/2500 [7:37:56<7:38:12, 21.73s/it] {'loss': 0.0003, 'grad_norm': 0.41765343512731984, 'learning_rate': 5.06e-07, 'completion_length': 157.87500762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.0064849853515625, 'epoch': 0.49} + 49%|████▉ | 1235/2500 [7:37:56<7:38:12, 21.73s/it] 49%|████▉ | 1236/2500 [7:38:17<7:31:49, 21.45s/it] {'loss': 0.0001, 'grad_norm': 0.016856270883515812, 'learning_rate': 5.056e-07, 'completion_length': 131.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0033416748046875, 'epoch': 0.49} + 49%|████▉ | 1236/2500 [7:38:17<7:31:49, 21.45s/it] 49%|████▉ | 1237/2500 [7:38:39<7:36:43, 21.70s/it] {'loss': 0.0002, 'grad_norm': 0.30582117280250487, 'learning_rate': 5.051999999999999e-07, 'completion_length': 158.75000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00579833984375, 'epoch': 0.49} + 49%|████▉ | 1237/2500 [7:38:39<7:36:43, 21.70s/it] 50%|████▉ | 1238/2500 [7:39:01<7:37:44, 21.76s/it] {'loss': 0.0003, 'grad_norm': 0.09215853194548132, 'learning_rate': 5.048e-07, 'completion_length': 145.42857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0067138671875, 'epoch': 0.5} + 50%|████▉ | 1238/2500 [7:39:01<7:37:44, 21.76s/it] 50%|████▉ | 1239/2500 [7:39:23<7:39:34, 21.87s/it] {'loss': 0.0003, 'grad_norm': 0.056916871504278044, 'learning_rate': 5.043999999999999e-07, 'completion_length': 160.92857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00860595703125, 'epoch': 0.5} + 50%|████▉ | 1239/2500 [7:39:23<7:39:34, 21.87s/it] 50%|████▉ | 1240/2500 [7:39:46<7:45:57, 22.19s/it] {'loss': 0.0002, 'grad_norm': 0.30737477073456615, 'learning_rate': 5.04e-07, 'completion_length': 150.25000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00537109375, 'epoch': 0.5} + 50%|████▉ | 1240/2500 [7:39:46<7:45:57, 22.19s/it] 50%|████▉ | 1241/2500 [7:40:09<7:47:44, 22.29s/it] {'loss': 0.0004, 'grad_norm': 0.40939733737902534, 'learning_rate': 5.036e-07, 'completion_length': 176.39286041259766, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.0088348388671875, 'epoch': 0.5} + 50%|████▉ | 1241/2500 [7:40:09<7:47:44, 22.29s/it] 50%|████▉ | 1242/2500 [7:40:31<7:46:47, 22.26s/it] {'loss': 0.0003, 'grad_norm': 0.0345102066124481, 'learning_rate': 5.032e-07, 'completion_length': 162.37500762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0067596435546875, 'epoch': 0.5} + 50%|████▉ | 1242/2500 [7:40:31<7:46:47, 22.26s/it] 50%|████▉ | 1243/2500 [7:40:52<7:40:05, 21.96s/it] {'loss': 0.0002, 'grad_norm': 0.1880658181746327, 'learning_rate': 5.028e-07, 'completion_length': 148.26786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00494384765625, 'epoch': 0.5} + 50%|████▉ | 1243/2500 [7:40:52<7:40:05, 21.96s/it] 50%|████▉ | 1244/2500 [7:41:14<7:40:17, 21.99s/it] {'loss': 0.0002, 'grad_norm': 0.033315697777984946, 'learning_rate': 5.023999999999999e-07, 'completion_length': 153.85714721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0058135986328125, 'epoch': 0.5} + 50%|████▉ | 1244/2500 [7:41:14<7:40:17, 21.99s/it] 50%|████▉ | 1245/2500 [7:41:36<7:37:06, 21.85s/it] {'loss': 0.0003, 'grad_norm': 0.4512717765326058, 'learning_rate': 5.02e-07, 'completion_length': 153.5, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.0066986083984375, 'epoch': 0.5} + 50%|████▉ | 1245/2500 [7:41:36<7:37:06, 21.85s/it] 50%|████▉ | 1246/2500 [7:41:58<7:35:00, 21.77s/it] {'loss': 0.0002, 'grad_norm': 0.04400419140909477, 'learning_rate': 5.016e-07, 'completion_length': 165.1964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00457763671875, 'epoch': 0.5} + 50%|████▉ | 1246/2500 [7:41:58<7:35:00, 21.77s/it] 50%|████▉ | 1247/2500 [7:42:19<7:35:57, 21.83s/it] {'loss': 0.0003, 'grad_norm': 0.027967205801910866, 'learning_rate': 5.012e-07, 'completion_length': 156.0714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0067596435546875, 'epoch': 0.5} + 50%|████▉ | 1247/2500 [7:42:20<7:35:57, 21.83s/it] 50%|████▉ | 1248/2500 [7:42:43<7:44:15, 22.25s/it] {'loss': 0.0003, 'grad_norm': 0.3755503618254619, 'learning_rate': 5.008e-07, 'completion_length': 182.1964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0085296630859375, 'epoch': 0.5} + 50%|████▉ | 1248/2500 [7:42:43<7:44:15, 22.25s/it] 50%|████▉ | 1249/2500 [7:43:05<7:46:54, 22.39s/it] {'loss': 0.0003, 'grad_norm': 0.7466011017413896, 'learning_rate': 5.003999999999999e-07, 'completion_length': 176.55358123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.1428571492433548, 'kl': 0.0067291259765625, 'epoch': 0.5} + 50%|████▉ | 1249/2500 [7:43:05<7:46:54, 22.39s/it] 50%|█████ | 1250/2500 [7:43:28<7:45:54, 22.36s/it] {'loss': 0.0003, 'grad_norm': 0.28602399865803807, 'learning_rate': 5e-07, 'completion_length': 160.1607208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0079498291015625, 'epoch': 0.5} + 50%|█████ | 1250/2500 [7:43:28<7:45:54, 22.36s/it] 50%|█████ | 1251/2500 [7:43:50<7:44:06, 22.30s/it] {'loss': 0.0001, 'grad_norm': 0.01879093629760245, 'learning_rate': 4.996e-07, 'completion_length': 146.80358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0035858154296875, 'epoch': 0.5} + 50%|█████ | 1251/2500 [7:43:50<7:44:06, 22.30s/it] 50%|█████ | 1252/2500 [7:44:13<7:47:03, 22.45s/it] {'loss': 0.0004, 'grad_norm': 0.315519547501985, 'learning_rate': 4.991999999999999e-07, 'completion_length': 173.96429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0093536376953125, 'epoch': 0.5} + 50%|█████ | 1252/2500 [7:44:13<7:47:03, 22.45s/it] 50%|█████ | 1253/2500 [7:44:34<7:41:48, 22.22s/it] {'loss': 0.0002, 'grad_norm': 0.1806251395410865, 'learning_rate': 4.988e-07, 'completion_length': 151.14286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004638671875, 'epoch': 0.5} + 50%|█████ | 1253/2500 [7:44:34<7:41:48, 22.22s/it] 50%|█████ | 1254/2500 [7:44:56<7:34:45, 21.90s/it] {'loss': 0.0002, 'grad_norm': 0.06292739506665458, 'learning_rate': 4.984e-07, 'completion_length': 164.33929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005645751953125, 'epoch': 0.5} + 50%|█████ | 1254/2500 [7:44:56<7:34:45, 21.90s/it] 50%|█████ | 1255/2500 [7:45:16<7:23:09, 21.36s/it] {'loss': 0.0002, 'grad_norm': 0.04389399413883796, 'learning_rate': 4.979999999999999e-07, 'completion_length': 137.23215103149414, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00439453125, 'epoch': 0.5} + 50%|█████ | 1255/2500 [7:45:16<7:23:09, 21.36s/it] 50%|█████ | 1256/2500 [7:45:37<7:24:07, 21.42s/it] {'loss': 0.0002, 'grad_norm': 0.2605744313342057, 'learning_rate': 4.976e-07, 'completion_length': 154.3214340209961, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.005218505859375, 'epoch': 0.5} + 50%|█████ | 1256/2500 [7:45:37<7:24:07, 21.42s/it] 50%|█████ | 1257/2500 [7:45:58<7:20:15, 21.25s/it] {'loss': 0.0003, 'grad_norm': 0.4345381517478198, 'learning_rate': 4.972e-07, 'completion_length': 141.14286041259766, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.07695359364151955, 'kl': 0.0070037841796875, 'epoch': 0.5} + 50%|█████ | 1257/2500 [7:45:58<7:20:15, 21.25s/it] 50%|█████ | 1258/2500 [7:46:19<7:20:45, 21.29s/it] {'loss': 0.0002, 'grad_norm': 0.017328875119529682, 'learning_rate': 4.968e-07, 'completion_length': 146.83929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00514984130859375, 'epoch': 0.5} + 50%|█████ | 1258/2500 [7:46:19<7:20:45, 21.29s/it] 50%|█████ | 1259/2500 [7:46:42<7:31:06, 21.81s/it] {'loss': 0.0003, 'grad_norm': 0.3398518705280519, 'learning_rate': 4.964e-07, 'completion_length': 168.46428680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0080718994140625, 'epoch': 0.5} + 50%|█████ | 1259/2500 [7:46:42<7:31:06, 21.81s/it] 50%|█████ | 1260/2500 [7:47:04<7:27:43, 21.66s/it] {'loss': 0.0002, 'grad_norm': 1.2007446106085875, 'learning_rate': 4.96e-07, 'completion_length': 149.5714340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.00536346435546875, 'epoch': 0.5} + 50%|█████ | 1260/2500 [7:47:04<7:27:43, 21.66s/it] 50%|█████ | 1261/2500 [7:47:29<7:48:55, 22.71s/it] {'loss': 0.0002, 'grad_norm': 0.17490919268066238, 'learning_rate': 4.956e-07, 'completion_length': 160.19644165039062, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9285714626312256, 'reward_std': 0.058321189135313034, 'kl': 0.00569915771484375, 'epoch': 0.5} + 50%|█████ | 1261/2500 [7:47:29<7:48:55, 22.71s/it] 50%|█████ | 1262/2500 [7:47:51<7:47:05, 22.64s/it] {'loss': 0.0002, 'grad_norm': 0.02768510254608112, 'learning_rate': 4.951999999999999e-07, 'completion_length': 154.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00409698486328125, 'epoch': 0.5} + 50%|█████ | 1262/2500 [7:47:51<7:47:05, 22.64s/it] 51%|█████ | 1263/2500 [7:48:16<7:55:53, 23.08s/it] {'loss': 0.0003, 'grad_norm': 0.01844125551035379, 'learning_rate': 4.948e-07, 'completion_length': 166.08928680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0069732666015625, 'epoch': 0.51} + 51%|█████ | 1263/2500 [7:48:16<7:55:53, 23.08s/it] 51%|█████ | 1264/2500 [7:48:38<7:54:05, 23.01s/it] {'loss': 0.0003, 'grad_norm': 0.018284508069013796, 'learning_rate': 4.944e-07, 'completion_length': 150.73214721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.007598876953125, 'epoch': 0.51} + 51%|█████ | 1264/2500 [7:48:38<7:54:05, 23.01s/it] 51%|█████ | 1265/2500 [7:49:01<7:48:36, 22.77s/it] {'loss': 0.0002, 'grad_norm': 0.31113481594778364, 'learning_rate': 4.94e-07, 'completion_length': 158.92858123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0056915283203125, 'epoch': 0.51} + 51%|█████ | 1265/2500 [7:49:01<7:48:36, 22.77s/it] 51%|█████ | 1266/2500 [7:49:25<7:57:13, 23.20s/it] {'loss': 0.0002, 'grad_norm': 0.5394661686121852, 'learning_rate': 4.935999999999999e-07, 'completion_length': 171.17858123779297, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.07695359364151955, 'kl': 0.0053558349609375, 'epoch': 0.51} + 51%|█████ | 1266/2500 [7:49:25<7:57:13, 23.20s/it] 51%|█████ | 1267/2500 [7:49:47<7:52:47, 23.01s/it] {'loss': 0.0003, 'grad_norm': 0.4216231884929912, 'learning_rate': 4.932e-07, 'completion_length': 165.7857208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.008544921875, 'epoch': 0.51} + 51%|█████ | 1267/2500 [7:49:47<7:52:47, 23.01s/it] 51%|█████ | 1268/2500 [7:50:09<7:43:19, 22.56s/it] {'loss': 0.0002, 'grad_norm': 0.0320359511276914, 'learning_rate': 4.928e-07, 'completion_length': 155.48214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004486083984375, 'epoch': 0.51} + 51%|█████ | 1268/2500 [7:50:09<7:43:19, 22.56s/it] 51%|█████ | 1269/2500 [7:50:32<7:47:48, 22.80s/it] {'loss': 0.0001, 'grad_norm': 0.04490949330270182, 'learning_rate': 4.923999999999999e-07, 'completion_length': 148.2678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00366973876953125, 'epoch': 0.51} + 51%|█████ | 1269/2500 [7:50:32<7:47:48, 22.80s/it] 51%|█████ | 1270/2500 [7:50:55<7:44:44, 22.67s/it] {'loss': 0.0004, 'grad_norm': 0.4057367657627022, 'learning_rate': 4.92e-07, 'completion_length': 171.37500762939453, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0112457275390625, 'epoch': 0.51} + 51%|█████ | 1270/2500 [7:50:55<7:44:44, 22.67s/it] 51%|█████ | 1271/2500 [7:51:17<7:41:15, 22.52s/it] {'loss': 0.0002, 'grad_norm': 0.7632439190052973, 'learning_rate': 4.916e-07, 'completion_length': 155.87500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005859375, 'epoch': 0.51} + 51%|█████ | 1271/2500 [7:51:17<7:41:15, 22.52s/it] 51%|█████ | 1272/2500 [7:51:39<7:38:50, 22.42s/it] {'loss': 0.0002, 'grad_norm': 0.048176656876067, 'learning_rate': 4.912e-07, 'completion_length': 163.08929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0053558349609375, 'epoch': 0.51} + 51%|█████ | 1272/2500 [7:51:39<7:38:50, 22.42s/it] 51%|█████ | 1273/2500 [7:52:01<7:34:35, 22.23s/it] {'loss': 0.0002, 'grad_norm': 1.409718144969254, 'learning_rate': 4.908e-07, 'completion_length': 144.2857208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00521087646484375, 'epoch': 0.51} + 51%|█████ | 1273/2500 [7:52:01<7:34:35, 22.23s/it] 51%|█████ | 1274/2500 [7:52:23<7:33:22, 22.19s/it] {'loss': 0.0004, 'grad_norm': 0.8098931563197467, 'learning_rate': 4.904e-07, 'completion_length': 158.42857360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.0088958740234375, 'epoch': 0.51} + 51%|█████ | 1274/2500 [7:52:23<7:33:22, 22.19s/it] 51%|█████ | 1275/2500 [7:52:46<7:36:51, 22.38s/it] {'loss': 0.0003, 'grad_norm': 0.3645613642445515, 'learning_rate': 4.9e-07, 'completion_length': 156.12500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0067901611328125, 'epoch': 0.51} + 51%|█████ | 1275/2500 [7:52:46<7:36:51, 22.38s/it] 51%|█████ | 1276/2500 [7:53:07<7:32:29, 22.18s/it] {'loss': 0.0002, 'grad_norm': 0.02727780654003128, 'learning_rate': 4.895999999999999e-07, 'completion_length': 154.07144165039062, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00469970703125, 'epoch': 0.51} + 51%|█████ | 1276/2500 [7:53:07<7:32:29, 22.18s/it] 51%|█████ | 1277/2500 [7:53:30<7:35:30, 22.35s/it] {'loss': 0.0002, 'grad_norm': 0.01828465608222406, 'learning_rate': 4.892e-07, 'completion_length': 153.76786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0059661865234375, 'epoch': 0.51} + 51%|█████ | 1277/2500 [7:53:30<7:35:30, 22.35s/it] 51%|█████ | 1278/2500 [7:53:52<7:34:27, 22.31s/it] {'loss': 0.0002, 'grad_norm': 0.18408738388256343, 'learning_rate': 4.888e-07, 'completion_length': 146.00000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0047454833984375, 'epoch': 0.51} + 51%|█████ | 1278/2500 [7:53:52<7:34:27, 22.31s/it] 51%|█████ | 1279/2500 [7:54:15<7:33:22, 22.28s/it] {'loss': 0.0002, 'grad_norm': 0.06557896895597091, 'learning_rate': 4.884e-07, 'completion_length': 157.67858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0042877197265625, 'epoch': 0.51} + 51%|█████ | 1279/2500 [7:54:15<7:33:22, 22.28s/it] 51%|█████ | 1280/2500 [7:54:35<7:21:24, 21.71s/it] {'loss': 0.0002, 'grad_norm': 0.05270270765221782, 'learning_rate': 4.879999999999999e-07, 'completion_length': 139.01786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0056915283203125, 'epoch': 0.51} + 51%|█████ | 1280/2500 [7:54:35<7:21:24, 21.71s/it] 51%|█████ | 1281/2500 [7:54:57<7:20:30, 21.68s/it] {'loss': 0.0003, 'grad_norm': 0.2292982923690461, 'learning_rate': 4.876e-07, 'completion_length': 166.58929443359375, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.04123930633068085, 'kl': 0.00665283203125, 'epoch': 0.51} + 51%|█████ | 1281/2500 [7:54:57<7:20:30, 21.68s/it] 51%|█████▏ | 1282/2500 [7:55:19<7:23:39, 21.86s/it] {'loss': 0.0003, 'grad_norm': 0.03785191217076262, 'learning_rate': 4.872e-07, 'completion_length': 161.50000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0068206787109375, 'epoch': 0.51} + 51%|█████▏ | 1282/2500 [7:55:19<7:23:39, 21.86s/it] 51%|█████▏ | 1283/2500 [7:55:41<7:24:03, 21.89s/it] {'loss': 0.0003, 'grad_norm': 0.2746715655082128, 'learning_rate': 4.867999999999999e-07, 'completion_length': 154.7857208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.006378173828125, 'epoch': 0.51} + 51%|█████▏ | 1283/2500 [7:55:41<7:24:03, 21.89s/it] 51%|█████▏ | 1284/2500 [7:56:02<7:21:50, 21.80s/it] {'loss': 0.0003, 'grad_norm': 1.389296990535091, 'learning_rate': 4.864e-07, 'completion_length': 156.73214721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.006622314453125, 'epoch': 0.51} + 51%|█████▏ | 1284/2500 [7:56:02<7:21:50, 21.80s/it] 51%|█████▏ | 1285/2500 [7:56:23<7:16:54, 21.58s/it] {'loss': 0.0002, 'grad_norm': 0.3398858298178209, 'learning_rate': 4.86e-07, 'completion_length': 134.8928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00444793701171875, 'epoch': 0.51} + 51%|█████▏ | 1285/2500 [7:56:23<7:16:54, 21.58s/it] 51%|█████▏ | 1286/2500 [7:56:48<7:33:18, 22.40s/it] {'loss': 0.0002, 'grad_norm': 0.25628412070270445, 'learning_rate': 4.856e-07, 'completion_length': 158.17857360839844, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.00579833984375, 'epoch': 0.51} + 51%|█████▏ | 1286/2500 [7:56:48<7:33:18, 22.40s/it] 51%|█████▏ | 1287/2500 [7:57:12<7:43:04, 22.91s/it] {'loss': 0.0003, 'grad_norm': 1.6576346786471392, 'learning_rate': 4.852e-07, 'completion_length': 167.3928680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0073699951171875, 'epoch': 0.51} + 51%|█████▏ | 1287/2500 [7:57:12<7:43:04, 22.91s/it] 52%|█████▏ | 1288/2500 [7:57:34<7:35:45, 22.56s/it] {'loss': 0.0004, 'grad_norm': 0.4895424744109912, 'learning_rate': 4.848e-07, 'completion_length': 156.1071548461914, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.010284423828125, 'epoch': 0.52} + 52%|█████▏ | 1288/2500 [7:57:34<7:35:45, 22.56s/it] 52%|█████▏ | 1289/2500 [7:57:55<7:30:34, 22.32s/it] {'loss': 0.0003, 'grad_norm': 0.058985451553267614, 'learning_rate': 4.844e-07, 'completion_length': 159.50000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0084075927734375, 'epoch': 0.52} + 52%|█████▏ | 1289/2500 [7:57:55<7:30:34, 22.32s/it] 52%|█████▏ | 1290/2500 [7:58:19<7:35:21, 22.58s/it] {'loss': 0.0002, 'grad_norm': 0.4638636291504564, 'learning_rate': 4.839999999999999e-07, 'completion_length': 162.30358123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.11266788095235825, 'kl': 0.004974365234375, 'epoch': 0.52} + 52%|█████▏ | 1290/2500 [7:58:19<7:35:21, 22.58s/it] 52%|█████▏ | 1291/2500 [7:58:40<7:30:37, 22.36s/it] {'loss': 0.0002, 'grad_norm': 0.9337752066517256, 'learning_rate': 4.835999999999999e-07, 'completion_length': 141.0714340209961, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.11266787722706795, 'kl': 0.00616455078125, 'epoch': 0.52} + 52%|█████▏ | 1291/2500 [7:58:40<7:30:37, 22.36s/it] 52%|█████▏ | 1292/2500 [7:59:02<7:27:05, 22.21s/it] {'loss': 0.0003, 'grad_norm': 0.03209053650481263, 'learning_rate': 4.832e-07, 'completion_length': 167.3214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.007537841796875, 'epoch': 0.52} + 52%|█████▏ | 1292/2500 [7:59:02<7:27:05, 22.21s/it] 52%|█████▏ | 1293/2500 [7:59:24<7:23:47, 22.06s/it] {'loss': 0.0002, 'grad_norm': 0.2521068637122728, 'learning_rate': 4.828e-07, 'completion_length': 151.5, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00583648681640625, 'epoch': 0.52} + 52%|█████▏ | 1293/2500 [7:59:24<7:23:47, 22.06s/it] 52%|█████▏ | 1294/2500 [7:59:46<7:26:07, 22.20s/it] {'loss': 0.0002, 'grad_norm': 0.0757762698648577, 'learning_rate': 4.823999999999999e-07, 'completion_length': 148.8214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00567626953125, 'epoch': 0.52} + 52%|█████▏ | 1294/2500 [7:59:46<7:26:07, 22.20s/it] 52%|█████▏ | 1295/2500 [8:00:10<7:31:22, 22.47s/it] {'loss': 0.0003, 'grad_norm': 0.3083367253076218, 'learning_rate': 4.82e-07, 'completion_length': 158.35714721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0066070556640625, 'epoch': 0.52} + 52%|█████▏ | 1295/2500 [8:00:10<7:31:22, 22.47s/it] 52%|█████▏ | 1296/2500 [8:00:33<7:36:46, 22.76s/it] {'loss': 0.0003, 'grad_norm': 1.1904017127064919, 'learning_rate': 4.816e-07, 'completion_length': 167.08929443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0067596435546875, 'epoch': 0.52} + 52%|█████▏ | 1296/2500 [8:00:33<7:36:46, 22.76s/it] 52%|█████▏ | 1297/2500 [8:00:55<7:33:16, 22.61s/it] {'loss': 0.0002, 'grad_norm': 0.03158466327679572, 'learning_rate': 4.812e-07, 'completion_length': 149.80358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045013427734375, 'epoch': 0.52} + 52%|█████▏ | 1297/2500 [8:00:55<7:33:16, 22.61s/it] 52%|█████▏ | 1298/2500 [8:01:16<7:20:38, 22.00s/it] {'loss': 0.0002, 'grad_norm': 0.026868135863873918, 'learning_rate': 4.808e-07, 'completion_length': 147.05357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005035400390625, 'epoch': 0.52} + 52%|█████▏ | 1298/2500 [8:01:16<7:20:38, 22.00s/it] 52%|█████▏ | 1299/2500 [8:01:37<7:15:41, 21.77s/it] {'loss': 0.0003, 'grad_norm': 0.5740775827206783, 'learning_rate': 4.804e-07, 'completion_length': 163.875, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0066070556640625, 'epoch': 0.52} + 52%|█████▏ | 1299/2500 [8:01:37<7:15:41, 21.77s/it] 52%|█████▏ | 1300/2500 [8:01:58<7:12:18, 21.62s/it] {'loss': 0.0002, 'grad_norm': 0.022238300828831208, 'learning_rate': 4.8e-07, 'completion_length': 157.98214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0062255859375, 'epoch': 0.52} + 52%|█████▏ | 1300/2500 [8:01:58<7:12:18, 21.62s/it] 52%|█████▏ | 1301/2500 [8:05:03<23:28:26, 70.48s/it] {'loss': 0.0003, 'grad_norm': 0.3471002152250261, 'learning_rate': 4.796e-07, 'completion_length': 165.73214721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00738525390625, 'epoch': 0.52} + 52%|█████▏ | 1301/2500 [8:05:03<23:28:26, 70.48s/it] 52%|█████▏ | 1302/2500 [8:05:25<18:39:54, 56.09s/it] {'loss': 0.0002, 'grad_norm': 0.24620607214142, 'learning_rate': 4.792e-07, 'completion_length': 158.55357360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0045166015625, 'epoch': 0.52} + 52%|█████▏ | 1302/2500 [8:05:25<18:39:54, 56.09s/it] 52%|█████▏ | 1303/2500 [8:05:48<15:17:26, 45.99s/it] {'loss': 0.0003, 'grad_norm': 0.4054880881170516, 'learning_rate': 4.788e-07, 'completion_length': 150.08929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00775146484375, 'epoch': 0.52} + 52%|█████▏ | 1303/2500 [8:05:48<15:17:26, 45.99s/it] 52%|█████▏ | 1304/2500 [8:06:09<12:51:07, 38.69s/it] {'loss': 0.0003, 'grad_norm': 0.5562773611837657, 'learning_rate': 4.783999999999999e-07, 'completion_length': 145.6964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0064849853515625, 'epoch': 0.52} + 52%|█████▏ | 1304/2500 [8:06:09<12:51:07, 38.69s/it] 52%|█████▏ | 1305/2500 [8:06:31<11:10:30, 33.67s/it] {'loss': 0.0003, 'grad_norm': 0.20189052577106542, 'learning_rate': 4.779999999999999e-07, 'completion_length': 163.33929443359375, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.007232666015625, 'epoch': 0.52} + 52%|█████▏ | 1305/2500 [8:06:31<11:10:30, 33.67s/it] 52%|█████▏ | 1306/2500 [8:06:54<10:04:52, 30.40s/it] {'loss': 0.0002, 'grad_norm': 0.26651120779633103, 'learning_rate': 4.776e-07, 'completion_length': 163.10714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00501251220703125, 'epoch': 0.52} + 52%|█████▏ | 1306/2500 [8:06:54<10:04:52, 30.40s/it] 52%|█████▏ | 1307/2500 [8:07:16<9:12:56, 27.81s/it] {'loss': 0.0004, 'grad_norm': 0.02709214677638716, 'learning_rate': 4.772e-07, 'completion_length': 162.0714340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0092926025390625, 'epoch': 0.52} + 52%|█████▏ | 1307/2500 [8:07:16<9:12:56, 27.81s/it] 52%|█████▏ | 1308/2500 [8:07:38<8:36:36, 26.00s/it] {'loss': 0.0002, 'grad_norm': 0.2548785835683245, 'learning_rate': 4.768e-07, 'completion_length': 143.89286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00501251220703125, 'epoch': 0.52} + 52%|█████▏ | 1308/2500 [8:07:38<8:36:36, 26.00s/it] 52%|█████▏ | 1309/2500 [8:08:01<8:21:56, 25.29s/it] {'loss': 0.0003, 'grad_norm': 0.15188153681539812, 'learning_rate': 4.7639999999999995e-07, 'completion_length': 166.2857208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00640869140625, 'epoch': 0.52} + 52%|█████▏ | 1309/2500 [8:08:01<8:21:56, 25.29s/it] 52%|█████▏ | 1310/2500 [8:08:24<8:08:30, 24.63s/it] {'loss': 0.0002, 'grad_norm': 0.28038048955600686, 'learning_rate': 4.76e-07, 'completion_length': 153.0357208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00383758544921875, 'epoch': 0.52} + 52%|█████▏ | 1310/2500 [8:08:24<8:08:30, 24.63s/it] 52%|█████▏ | 1311/2500 [8:08:46<7:52:08, 23.83s/it] {'loss': 0.0003, 'grad_norm': 0.38394486385904614, 'learning_rate': 4.756e-07, 'completion_length': 164.83929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0062713623046875, 'epoch': 0.52} + 52%|█████▏ | 1311/2500 [8:08:46<7:52:08, 23.83s/it] 52%|█████▏ | 1312/2500 [8:09:10<7:50:32, 23.76s/it] {'loss': 0.0002, 'grad_norm': 0.034179498173602094, 'learning_rate': 4.7519999999999997e-07, 'completion_length': 138.4107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005706787109375, 'epoch': 0.52} + 52%|█████▏ | 1312/2500 [8:09:10<7:50:32, 23.76s/it] 53%|█████▎ | 1313/2500 [8:09:32<7:40:15, 23.26s/it] {'loss': 0.0002, 'grad_norm': 0.4403560738394743, 'learning_rate': 4.748e-07, 'completion_length': 169.0178680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0057220458984375, 'epoch': 0.53} + 53%|█████▎ | 1313/2500 [8:09:32<7:40:15, 23.26s/it] 53%|█████▎ | 1314/2500 [8:09:54<7:32:24, 22.89s/it] {'loss': 0.0003, 'grad_norm': 0.02334006198763294, 'learning_rate': 4.7439999999999996e-07, 'completion_length': 169.33928680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00689697265625, 'epoch': 0.53} + 53%|█████▎ | 1314/2500 [8:09:54<7:32:24, 22.89s/it] 53%|█████▎ | 1315/2500 [8:10:17<7:33:42, 22.97s/it] {'loss': 0.0003, 'grad_norm': 0.8969576275277171, 'learning_rate': 4.7399999999999993e-07, 'completion_length': 167.00000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.007904052734375, 'epoch': 0.53} + 53%|█████▎ | 1315/2500 [8:10:17<7:33:42, 22.97s/it] 53%|█████▎ | 1316/2500 [8:10:39<7:28:45, 22.74s/it] {'loss': 0.0002, 'grad_norm': 0.01837747658698651, 'learning_rate': 4.736e-07, 'completion_length': 173.25000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0056915283203125, 'epoch': 0.53} + 53%|█████▎ | 1316/2500 [8:10:39<7:28:45, 22.74s/it] 53%|█████▎ | 1317/2500 [8:11:02<7:25:34, 22.60s/it] {'loss': 0.0003, 'grad_norm': 0.623520862826943, 'learning_rate': 4.732e-07, 'completion_length': 164.26786041259766, 'rewards/accuracy_reward': 0.75, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.18409644067287445, 'kl': 0.00647735595703125, 'epoch': 0.53} + 53%|█████▎ | 1317/2500 [8:11:02<7:25:34, 22.60s/it] 53%|█████▎ | 1318/2500 [8:11:23<7:18:46, 22.27s/it] {'loss': 0.0002, 'grad_norm': 0.2968923179901085, 'learning_rate': 4.728e-07, 'completion_length': 158.55357360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00614166259765625, 'epoch': 0.53} + 53%|█████▎ | 1318/2500 [8:11:23<7:18:46, 22.27s/it] 53%|█████▎ | 1319/2500 [8:11:45<7:16:58, 22.20s/it] {'loss': 0.0003, 'grad_norm': 0.5201868371418019, 'learning_rate': 4.7239999999999997e-07, 'completion_length': 157.50000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0068359375, 'epoch': 0.53} + 53%|█████▎ | 1319/2500 [8:11:45<7:16:58, 22.20s/it] 53%|█████▎ | 1320/2500 [8:12:10<7:31:23, 22.95s/it] {'loss': 0.0002, 'grad_norm': 0.31492174383601645, 'learning_rate': 4.7199999999999994e-07, 'completion_length': 161.55358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0062103271484375, 'epoch': 0.53} + 53%|█████▎ | 1320/2500 [8:12:10<7:31:23, 22.95s/it] 53%|█████▎ | 1321/2500 [8:12:31<7:19:40, 22.38s/it] {'loss': 0.0002, 'grad_norm': 0.02892669712141297, 'learning_rate': 4.716e-07, 'completion_length': 150.375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00579833984375, 'epoch': 0.53} + 53%|█████▎ | 1321/2500 [8:12:31<7:19:40, 22.38s/it] 53%|█████▎ | 1322/2500 [8:12:53<7:17:30, 22.28s/it] {'loss': 0.0003, 'grad_norm': 0.2942735914065892, 'learning_rate': 4.712e-07, 'completion_length': 159.1428680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006744384765625, 'epoch': 0.53} + 53%|█████▎ | 1322/2500 [8:12:53<7:17:30, 22.28s/it] 53%|█████▎ | 1323/2500 [8:13:16<7:19:52, 22.42s/it] {'loss': 0.0004, 'grad_norm': 0.2291007594715714, 'learning_rate': 4.7079999999999995e-07, 'completion_length': 161.67857360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0090789794921875, 'epoch': 0.53} + 53%|█████▎ | 1323/2500 [8:13:16<7:19:52, 22.42s/it] 53%|█████▎ | 1324/2500 [8:13:37<7:10:16, 21.95s/it] {'loss': 0.0002, 'grad_norm': 0.018007350843546967, 'learning_rate': 4.704e-07, 'completion_length': 150.05357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00452423095703125, 'epoch': 0.53} + 53%|█████▎ | 1324/2500 [8:13:37<7:10:16, 21.95s/it] 53%|█████▎ | 1325/2500 [8:13:58<7:08:04, 21.86s/it] {'loss': 0.0003, 'grad_norm': 0.20642593386881114, 'learning_rate': 4.6999999999999995e-07, 'completion_length': 152.55357360839844, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.008270263671875, 'epoch': 0.53} + 53%|█████▎ | 1325/2500 [8:13:58<7:08:04, 21.86s/it] 53%|█████▎ | 1326/2500 [8:14:21<7:14:32, 22.21s/it] {'loss': 0.0001, 'grad_norm': 0.023359807923027342, 'learning_rate': 4.6959999999999997e-07, 'completion_length': 160.37500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0033721923828125, 'epoch': 0.53} + 53%|█████▎ | 1326/2500 [8:14:21<7:14:32, 22.21s/it] 53%|█████▎ | 1327/2500 [8:14:43<7:09:25, 21.97s/it] {'loss': 0.0003, 'grad_norm': 0.4225555460763001, 'learning_rate': 4.692e-07, 'completion_length': 174.50000762939453, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.07695359364151955, 'kl': 0.00726318359375, 'epoch': 0.53} + 53%|█████▎ | 1327/2500 [8:14:43<7:09:25, 21.97s/it] 53%|█████▎ | 1328/2500 [8:15:07<7:21:17, 22.59s/it] {'loss': 0.0002, 'grad_norm': 0.4057920850388906, 'learning_rate': 4.6879999999999996e-07, 'completion_length': 172.28572845458984, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0060272216796875, 'epoch': 0.53} + 53%|█████▎ | 1328/2500 [8:15:07<7:21:17, 22.59s/it] 53%|█████▎ | 1329/2500 [8:15:28<7:14:37, 22.27s/it] {'loss': 0.0002, 'grad_norm': 0.3517468008461251, 'learning_rate': 4.684e-07, 'completion_length': 144.50000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.003997802734375, 'epoch': 0.53} + 53%|█████▎ | 1329/2500 [8:15:28<7:14:37, 22.27s/it] 53%|█████▎ | 1330/2500 [8:15:52<7:20:37, 22.60s/it] {'loss': 0.0003, 'grad_norm': 0.3655731510480858, 'learning_rate': 4.68e-07, 'completion_length': 171.2678680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.006378173828125, 'epoch': 0.53} + 53%|█████▎ | 1330/2500 [8:15:52<7:20:37, 22.60s/it] 53%|█████▎ | 1331/2500 [8:16:13<7:13:47, 22.26s/it] {'loss': 0.0002, 'grad_norm': 0.7274416487678548, 'learning_rate': 4.676e-07, 'completion_length': 159.75000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.005615234375, 'epoch': 0.53} + 53%|█████▎ | 1331/2500 [8:16:13<7:13:47, 22.26s/it] 53%|█████▎ | 1332/2500 [8:16:35<7:12:33, 22.22s/it] {'loss': 0.0002, 'grad_norm': 0.2608695320476718, 'learning_rate': 4.672e-07, 'completion_length': 156.12500762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.004913330078125, 'epoch': 0.53} + 53%|█████▎ | 1332/2500 [8:16:35<7:12:33, 22.22s/it] 53%|█████▎ | 1333/2500 [8:16:58<7:18:11, 22.53s/it] {'loss': 0.0002, 'grad_norm': 0.02645680432794953, 'learning_rate': 4.6679999999999997e-07, 'completion_length': 169.3214340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0057373046875, 'epoch': 0.53} + 53%|█████▎ | 1333/2500 [8:16:59<7:18:11, 22.53s/it] 53%|█████▎ | 1334/2500 [8:17:20<7:14:44, 22.37s/it] {'loss': 0.0002, 'grad_norm': 0.3875541983099859, 'learning_rate': 4.6639999999999994e-07, 'completion_length': 150.6607208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.005157470703125, 'epoch': 0.53} + 53%|█████▎ | 1334/2500 [8:17:20<7:14:44, 22.37s/it] 53%|█████▎ | 1335/2500 [8:17:43<7:17:14, 22.52s/it] {'loss': 0.0002, 'grad_norm': 0.55663462405172, 'learning_rate': 4.66e-07, 'completion_length': 155.55358123779297, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.00553131103515625, 'epoch': 0.53} + 53%|█████▎ | 1335/2500 [8:17:43<7:17:14, 22.52s/it] 53%|█████▎ | 1336/2500 [8:18:06<7:16:10, 22.48s/it] {'loss': 0.0002, 'grad_norm': 0.2661908388909955, 'learning_rate': 4.656e-07, 'completion_length': 152.9464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0048065185546875, 'epoch': 0.53} + 53%|█████▎ | 1336/2500 [8:18:06<7:16:10, 22.48s/it] 53%|█████▎ | 1337/2500 [8:18:29<7:17:34, 22.57s/it] {'loss': 0.0003, 'grad_norm': 0.05494928635378884, 'learning_rate': 4.6519999999999996e-07, 'completion_length': 167.67858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0074005126953125, 'epoch': 0.53} + 53%|█████▎ | 1337/2500 [8:18:29<7:17:34, 22.57s/it] 54%|█████▎ | 1338/2500 [8:18:51<7:16:31, 22.54s/it] {'loss': 0.0002, 'grad_norm': 0.24741888586993996, 'learning_rate': 4.648e-07, 'completion_length': 149.2857208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0047454833984375, 'epoch': 0.54} + 54%|█████▎ | 1338/2500 [8:18:51<7:16:31, 22.54s/it] 54%|█████▎ | 1339/2500 [8:19:13<7:14:58, 22.48s/it] {'loss': 0.0003, 'grad_norm': 0.02521005668232019, 'learning_rate': 4.6439999999999995e-07, 'completion_length': 155.60714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006439208984375, 'epoch': 0.54} + 54%|█████▎ | 1339/2500 [8:19:13<7:14:58, 22.48s/it] 54%|█████▎ | 1340/2500 [8:19:35<7:09:26, 22.21s/it] {'loss': 0.0002, 'grad_norm': 0.019344824301242625, 'learning_rate': 4.64e-07, 'completion_length': 144.33929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00569915771484375, 'epoch': 0.54} + 54%|█████▎ | 1340/2500 [8:19:35<7:09:26, 22.21s/it] 54%|█████▎ | 1341/2500 [8:19:58<7:11:54, 22.36s/it] {'loss': 0.0004, 'grad_norm': 0.2345860557842387, 'learning_rate': 4.636e-07, 'completion_length': 167.8928680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0091705322265625, 'epoch': 0.54} + 54%|█████▎ | 1341/2500 [8:19:58<7:11:54, 22.36s/it] 54%|█████▎ | 1342/2500 [8:20:19<7:07:04, 22.13s/it] {'loss': 0.0002, 'grad_norm': 0.02001171668960781, 'learning_rate': 4.6319999999999997e-07, 'completion_length': 144.83929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00595855712890625, 'epoch': 0.54} + 54%|█████▎ | 1342/2500 [8:20:19<7:07:04, 22.13s/it] 54%|█████▎ | 1343/2500 [8:20:42<7:10:33, 22.33s/it] {'loss': 0.0002, 'grad_norm': 0.0161562927241772, 'learning_rate': 4.628e-07, 'completion_length': 147.33929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0047607421875, 'epoch': 0.54} + 54%|█████▎ | 1343/2500 [8:20:42<7:10:33, 22.33s/it] 54%|█████▍ | 1344/2500 [8:21:04<7:07:23, 22.18s/it] {'loss': 0.0002, 'grad_norm': 0.03182767716620051, 'learning_rate': 4.6239999999999996e-07, 'completion_length': 155.55358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006072998046875, 'epoch': 0.54} + 54%|█████▍ | 1344/2500 [8:21:04<7:07:23, 22.18s/it] 54%|█████▍ | 1345/2500 [8:21:27<7:14:42, 22.58s/it] {'loss': 0.0003, 'grad_norm': 0.8046430341972584, 'learning_rate': 4.62e-07, 'completion_length': 167.78572845458984, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0073089599609375, 'epoch': 0.54} + 54%|█████▍ | 1345/2500 [8:21:27<7:14:42, 22.58s/it] 54%|█████▍ | 1346/2500 [8:21:49<7:06:32, 22.18s/it] {'loss': 0.0002, 'grad_norm': 0.46016883010542925, 'learning_rate': 4.616e-07, 'completion_length': 139.60714721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0044403076171875, 'epoch': 0.54} + 54%|█████▍ | 1346/2500 [8:21:49<7:06:32, 22.18s/it] 54%|█████▍ | 1347/2500 [8:22:11<7:09:55, 22.37s/it] {'loss': 0.0002, 'grad_norm': 0.027846116130702793, 'learning_rate': 4.612e-07, 'completion_length': 170.33929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0053253173828125, 'epoch': 0.54} + 54%|█████▍ | 1347/2500 [8:22:11<7:09:55, 22.37s/it] 54%|█████▍ | 1348/2500 [8:22:34<7:10:58, 22.45s/it] {'loss': 0.0002, 'grad_norm': 0.01944039097091892, 'learning_rate': 4.6079999999999994e-07, 'completion_length': 154.5178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005706787109375, 'epoch': 0.54} + 54%|█████▍ | 1348/2500 [8:22:34<7:10:58, 22.45s/it] 54%|█████▍ | 1349/2500 [8:22:55<6:59:44, 21.88s/it] {'loss': 0.0002, 'grad_norm': 0.018779836599094518, 'learning_rate': 4.6039999999999997e-07, 'completion_length': 140.46429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0046844482421875, 'epoch': 0.54} + 54%|█████▍ | 1349/2500 [8:22:55<6:59:44, 21.88s/it] 54%|█████▍ | 1350/2500 [8:23:16<6:56:45, 21.74s/it] {'loss': 0.0002, 'grad_norm': 0.49110428330704775, 'learning_rate': 4.6e-07, 'completion_length': 162.6964340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0055084228515625, 'epoch': 0.54} + 54%|█████▍ | 1350/2500 [8:23:16<6:56:45, 21.74s/it] 54%|█████▍ | 1351/2500 [8:23:38<6:58:23, 21.85s/it] {'loss': 0.0002, 'grad_norm': 0.021055920099893332, 'learning_rate': 4.596e-07, 'completion_length': 150.08928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0052490234375, 'epoch': 0.54} + 54%|█████▍ | 1351/2500 [8:23:38<6:58:23, 21.85s/it] 54%|█████▍ | 1352/2500 [8:24:01<7:02:19, 22.07s/it] {'loss': 0.0004, 'grad_norm': 0.02976748696875612, 'learning_rate': 4.592e-07, 'completion_length': 162.08928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.008880615234375, 'epoch': 0.54} + 54%|█████▍ | 1352/2500 [8:24:01<7:02:19, 22.07s/it] 54%|█████▍ | 1353/2500 [8:24:23<7:03:09, 22.14s/it] {'loss': 0.0002, 'grad_norm': 0.019218728722878334, 'learning_rate': 4.5879999999999995e-07, 'completion_length': 161.92858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0054473876953125, 'epoch': 0.54} + 54%|█████▍ | 1353/2500 [8:24:23<7:03:09, 22.14s/it] 54%|█████▍ | 1354/2500 [8:24:45<6:59:29, 21.96s/it] {'loss': 0.0002, 'grad_norm': 0.022441634536704485, 'learning_rate': 4.584e-07, 'completion_length': 146.5357208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004730224609375, 'epoch': 0.54} + 54%|█████▍ | 1354/2500 [8:24:45<6:59:29, 21.96s/it] 54%|█████▍ | 1355/2500 [8:25:08<7:06:10, 22.33s/it] {'loss': 0.0002, 'grad_norm': 0.3216543892579761, 'learning_rate': 4.58e-07, 'completion_length': 172.60714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.005950927734375, 'epoch': 0.54} + 54%|█████▍ | 1355/2500 [8:25:08<7:06:10, 22.33s/it] 54%|█████▍ | 1356/2500 [8:25:29<7:02:15, 22.15s/it] {'loss': 0.0002, 'grad_norm': 0.34263695533921507, 'learning_rate': 4.5759999999999997e-07, 'completion_length': 150.30358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0050048828125, 'epoch': 0.54} + 54%|█████▍ | 1356/2500 [8:25:29<7:02:15, 22.15s/it] 54%|█████▍ | 1357/2500 [8:25:51<6:59:26, 22.02s/it] {'loss': 0.0002, 'grad_norm': 1.209674692712502, 'learning_rate': 4.572e-07, 'completion_length': 147.92857360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0053863525390625, 'epoch': 0.54} + 54%|█████▍ | 1357/2500 [8:25:51<6:59:26, 22.02s/it] 54%|█████▍ | 1358/2500 [8:26:12<6:54:53, 21.80s/it] {'loss': 0.0002, 'grad_norm': 0.020524042625003402, 'learning_rate': 4.5679999999999996e-07, 'completion_length': 149.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00494384765625, 'epoch': 0.54} + 54%|█████▍ | 1358/2500 [8:26:12<6:54:53, 21.80s/it] 54%|█████▍ | 1359/2500 [8:26:35<6:58:08, 21.99s/it] {'loss': 0.0002, 'grad_norm': 0.25800412598820427, 'learning_rate': 4.5639999999999993e-07, 'completion_length': 164.00000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00457763671875, 'epoch': 0.54} + 54%|█████▍ | 1359/2500 [8:26:35<6:58:08, 21.99s/it] 54%|█████▍ | 1360/2500 [8:26:58<7:01:17, 22.17s/it] {'loss': 0.0002, 'grad_norm': 0.015418789832762634, 'learning_rate': 4.56e-07, 'completion_length': 153.12500762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00507354736328125, 'epoch': 0.54} + 54%|█████▍ | 1360/2500 [8:26:58<7:01:17, 22.17s/it] 54%|█████▍ | 1361/2500 [8:27:26<7:34:43, 23.95s/it] {'loss': 0.0002, 'grad_norm': 0.02115296106114843, 'learning_rate': 4.556e-07, 'completion_length': 155.9821548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00617218017578125, 'epoch': 0.54} + 54%|█████▍ | 1361/2500 [8:27:26<7:34:43, 23.95s/it] 54%|█████▍ | 1362/2500 [8:27:48<7:24:47, 23.45s/it] {'loss': 0.0003, 'grad_norm': 0.024720851995379503, 'learning_rate': 4.5519999999999995e-07, 'completion_length': 162.55357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006988525390625, 'epoch': 0.54} + 54%|█████▍ | 1362/2500 [8:27:48<7:24:47, 23.45s/it] 55%|█████▍ | 1363/2500 [8:28:12<7:26:59, 23.59s/it] {'loss': 0.0003, 'grad_norm': 0.025035351066241125, 'learning_rate': 4.5479999999999997e-07, 'completion_length': 169.9821548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0080718994140625, 'epoch': 0.55} + 55%|████���▍ | 1363/2500 [8:28:12<7:26:59, 23.59s/it] 55%|█████▍ | 1364/2500 [8:28:33<7:12:47, 22.86s/it] {'loss': 0.0002, 'grad_norm': 0.0360072939985506, 'learning_rate': 4.544e-07, 'completion_length': 146.4821548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0048370361328125, 'epoch': 0.55} + 55%|█████▍ | 1364/2500 [8:28:33<7:12:47, 22.86s/it] 55%|█████▍ | 1365/2500 [8:28:55<7:09:31, 22.71s/it] {'loss': 0.0002, 'grad_norm': 0.020097969040968475, 'learning_rate': 4.54e-07, 'completion_length': 151.58929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0047760009765625, 'epoch': 0.55} + 55%|█████▍ | 1365/2500 [8:28:55<7:09:31, 22.71s/it] 55%|█████▍ | 1366/2500 [8:29:15<6:53:10, 21.86s/it] {'loss': 0.0002, 'grad_norm': 2.888177955901827, 'learning_rate': 4.536e-07, 'completion_length': 140.8214340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00412750244140625, 'epoch': 0.55} + 55%|█████▍ | 1366/2500 [8:29:15<6:53:10, 21.86s/it] 55%|█████▍ | 1367/2500 [8:29:36<6:49:34, 21.69s/it] {'loss': 0.0003, 'grad_norm': 0.3127841720079088, 'learning_rate': 4.5319999999999996e-07, 'completion_length': 151.7857208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.006805419921875, 'epoch': 0.55} + 55%|█████▍ | 1367/2500 [8:29:37<6:49:34, 21.69s/it] 55%|█████▍ | 1368/2500 [8:29:59<6:55:58, 22.05s/it] {'loss': 0.0003, 'grad_norm': 0.4006557040964145, 'learning_rate': 4.528e-07, 'completion_length': 165.39286041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00860595703125, 'epoch': 0.55} + 55%|█████▍ | 1368/2500 [8:29:59<6:55:58, 22.05s/it] 55%|█████▍ | 1369/2500 [8:30:22<6:58:49, 22.22s/it] {'loss': 0.0003, 'grad_norm': 0.1706561905539888, 'learning_rate': 4.524e-07, 'completion_length': 150.87500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0071258544921875, 'epoch': 0.55} + 55%|█████▍ | 1369/2500 [8:30:22<6:58:49, 22.22s/it] 55%|█████▍ | 1370/2500 [8:30:45<7:01:39, 22.39s/it] {'loss': 0.0002, 'grad_norm': 0.019812646102393822, 'learning_rate': 4.5199999999999997e-07, 'completion_length': 159.7678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045166015625, 'epoch': 0.55} + 55%|█████▍ | 1370/2500 [8:30:45<7:01:39, 22.39s/it] 55%|█████▍ | 1371/2500 [8:31:08<7:03:39, 22.52s/it] {'loss': 0.0002, 'grad_norm': 0.029964501683755678, 'learning_rate': 4.516e-07, 'completion_length': 163.92858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0061798095703125, 'epoch': 0.55} + 55%|█████▍ | 1371/2500 [8:31:08<7:03:39, 22.52s/it] 55%|█████▍ | 1372/2500 [8:31:29<6:57:05, 22.19s/it] {'loss': 0.0003, 'grad_norm': 0.366535469558594, 'learning_rate': 4.5119999999999996e-07, 'completion_length': 149.6964340209961, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.006805419921875, 'epoch': 0.55} + 55%|█████▍ | 1372/2500 [8:31:29<6:57:05, 22.19s/it] 55%|█████▍ | 1373/2500 [8:31:51<6:57:57, 22.25s/it] {'loss': 0.0003, 'grad_norm': 0.9663458404161416, 'learning_rate': 4.5079999999999993e-07, 'completion_length': 164.4821548461914, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.008026123046875, 'epoch': 0.55} + 55%|█████▍ | 1373/2500 [8:31:51<6:57:57, 22.25s/it] 55%|█████▍ | 1374/2500 [8:32:16<7:11:55, 23.02s/it] {'loss': 0.0003, 'grad_norm': 0.2139349791214599, 'learning_rate': 4.504e-07, 'completion_length': 164.83929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0070343017578125, 'epoch': 0.55} + 55%|█████▍ | 1374/2500 [8:32:16<7:11:55, 23.02s/it] 55%|█████▌ | 1375/2500 [8:32:38<7:07:15, 22.79s/it] {'loss': 0.0002, 'grad_norm': 0.33638698320280874, 'learning_rate': 4.5e-07, 'completion_length': 150.39286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0046539306640625, 'epoch': 0.55} + 55%|█████▌ | 1375/2500 [8:32:38<7:07:15, 22.79s/it] 55%|█████▌ | 1376/2500 [8:33:00<6:57:50, 22.30s/it] {'loss': 0.0002, 'grad_norm': 0.013774234258036868, 'learning_rate': 4.496e-07, 'completion_length': 140.10714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0038909912109375, 'epoch': 0.55} + 55%|█████▌ | 1376/2500 [8:33:00<6:57:50, 22.30s/it] 55%|█████▌ | 1377/2500 [8:33:23<7:01:17, 22.51s/it] {'loss': 0.0002, 'grad_norm': 0.03029337025226487, 'learning_rate': 4.4919999999999997e-07, 'completion_length': 158.80357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0055389404296875, 'epoch': 0.55} + 55%|█████▌ | 1377/2500 [8:33:23<7:01:17, 22.51s/it] 55%|█████▌ | 1378/2500 [8:33:47<7:13:48, 23.20s/it] {'loss': 0.0002, 'grad_norm': 0.737366411132075, 'learning_rate': 4.4879999999999994e-07, 'completion_length': 150.85714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0059356689453125, 'epoch': 0.55} + 55%|█████▌ | 1378/2500 [8:33:47<7:13:48, 23.20s/it] 55%|█████▌ | 1379/2500 [8:34:09<7:02:58, 22.64s/it] {'loss': 0.0003, 'grad_norm': 0.03467854604649952, 'learning_rate': 4.484e-07, 'completion_length': 150.60714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.007720947265625, 'epoch': 0.55} + 55%|█████▌ | 1379/2500 [8:34:09<7:02:58, 22.64s/it] 55%|█████▌ | 1380/2500 [8:34:31<6:59:11, 22.46s/it] {'loss': 0.0003, 'grad_norm': 0.38085128335513835, 'learning_rate': 4.48e-07, 'completion_length': 153.5357208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006256103515625, 'epoch': 0.55} + 55%|█████▌ | 1380/2500 [8:34:31<6:59:11, 22.46s/it] 55%|█████▌ | 1381/2500 [8:34:52<6:52:54, 22.14s/it] {'loss': 0.0001, 'grad_norm': 0.22042980385528504, 'learning_rate': 4.4759999999999996e-07, 'completion_length': 147.7857208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00354766845703125, 'epoch': 0.55} + 55%|█████▌ | 1381/2500 [8:34:52<6:52:54, 22.14s/it] 55%|█████▌ | 1382/2500 [8:35:14<6:52:03, 22.11s/it] {'loss': 0.0002, 'grad_norm': 0.023680908130964493, 'learning_rate': 4.472e-07, 'completion_length': 157.98214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005706787109375, 'epoch': 0.55} + 55%|█████▌ | 1382/2500 [8:35:14<6:52:03, 22.11s/it] 55%|█████▌ | 1383/2500 [8:35:37<6:53:37, 22.22s/it] {'loss': 0.0002, 'grad_norm': 0.23297722852812847, 'learning_rate': 4.4679999999999995e-07, 'completion_length': 150.1964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00421905517578125, 'epoch': 0.55} + 55%|█████▌ | 1383/2500 [8:35:37<6:53:37, 22.22s/it] 55%|█████▌ | 1384/2500 [8:36:00<6:59:11, 22.54s/it] {'loss': 0.0002, 'grad_norm': 0.2635472072346104, 'learning_rate': 4.464e-07, 'completion_length': 158.5178680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.004913330078125, 'epoch': 0.55} + 55%|█████▌ | 1384/2500 [8:36:00<6:59:11, 22.54s/it] 55%|█████▌ | 1385/2500 [8:36:22<6:55:52, 22.38s/it] {'loss': 0.0002, 'grad_norm': 0.41446558047717336, 'learning_rate': 4.46e-07, 'completion_length': 172.00000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0056610107421875, 'epoch': 0.55} + 55%|█████▌ | 1385/2500 [8:36:22<6:55:52, 22.38s/it] 55%|█████▌ | 1386/2500 [8:36:43<6:49:04, 22.03s/it] {'loss': 0.0003, 'grad_norm': 0.33566487737221645, 'learning_rate': 4.4559999999999997e-07, 'completion_length': 157.6964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006561279296875, 'epoch': 0.55} + 55%|█████▌ | 1386/2500 [8:36:43<6:49:04, 22.03s/it] 55%|█████▌ | 1387/2500 [8:37:05<6:49:48, 22.09s/it] {'loss': 0.0003, 'grad_norm': 0.02077609076965833, 'learning_rate': 4.452e-07, 'completion_length': 154.48214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006500244140625, 'epoch': 0.55} + 55%|█████▌ | 1387/2500 [8:37:05<6:49:48, 22.09s/it] 56%|█████▌ | 1388/2500 [8:37:28<6:50:41, 22.16s/it] {'loss': 0.0004, 'grad_norm': 0.2206207259776257, 'learning_rate': 4.4479999999999996e-07, 'completion_length': 163.17858123779297, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.00909423828125, 'epoch': 0.56} + 56%|█████▌ | 1388/2500 [8:37:28<6:50:41, 22.16s/it] 56%|█████▌ | 1389/2500 [8:37:49<6:44:47, 21.86s/it] {'loss': 0.0002, 'grad_norm': 0.7907815603192772, 'learning_rate': 4.444e-07, 'completion_length': 152.3214340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.0058746337890625, 'epoch': 0.56} + 56%|█████▌ | 1389/2500 [8:37:49<6:44:47, 21.86s/it] 56%|█████▌ | 1390/2500 [8:38:11<6:45:28, 21.92s/it] {'loss': 0.0002, 'grad_norm': 0.02146257627438081, 'learning_rate': 4.44e-07, 'completion_length': 162.2857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0046539306640625, 'epoch': 0.56} + 56%|█████▌ | 1390/2500 [8:38:11<6:45:28, 21.92s/it] 56%|█████▌ | 1391/2500 [8:38:33<6:45:42, 21.95s/it] {'loss': 0.0002, 'grad_norm': 0.022684740581311372, 'learning_rate': 4.436e-07, 'completion_length': 147.48214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00414276123046875, 'epoch': 0.56} + 56%|█████▌ | 1391/2500 [8:38:33<6:45:42, 21.95s/it] 56%|█████▌ | 1392/2500 [8:38:56<6:53:44, 22.40s/it] {'loss': 0.0002, 'grad_norm': 0.01433063966983531, 'learning_rate': 4.4319999999999995e-07, 'completion_length': 165.26786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00494384765625, 'epoch': 0.56} + 56%|█████▌ | 1392/2500 [8:38:56<6:53:44, 22.40s/it] 56%|█████▌ | 1393/2500 [8:39:19<6:55:03, 22.50s/it] {'loss': 0.0002, 'grad_norm': 0.02660009959937115, 'learning_rate': 4.428e-07, 'completion_length': 145.30357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00457763671875, 'epoch': 0.56} + 56%|█████▌ | 1393/2500 [8:39:19<6:55:03, 22.50s/it] 56%|█████▌ | 1394/2500 [8:39:43<7:01:59, 22.89s/it] {'loss': 0.0003, 'grad_norm': 0.4417709418355036, 'learning_rate': 4.424e-07, 'completion_length': 176.87500762939453, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.006866455078125, 'epoch': 0.56} + 56%|█████▌ | 1394/2500 [8:39:43<7:01:59, 22.89s/it] 56%|█████▌ | 1395/2500 [8:40:05<6:55:25, 22.56s/it] {'loss': 0.0002, 'grad_norm': 0.03018250465984452, 'learning_rate': 4.4199999999999996e-07, 'completion_length': 139.98214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0049591064453125, 'epoch': 0.56} + 56%|█████▌ | 1395/2500 [8:40:05<6:55:25, 22.56s/it] 56%|█████▌ | 1396/2500 [8:40:27<6:54:02, 22.50s/it] {'loss': 0.0002, 'grad_norm': 0.18851270277718993, 'learning_rate': 4.416e-07, 'completion_length': 150.10714721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0044403076171875, 'epoch': 0.56} + 56%|█████▌ | 1396/2500 [8:40:27<6:54:02, 22.50s/it] 56%|█████▌ | 1397/2500 [8:40:50<6:55:06, 22.58s/it] {'loss': 0.0003, 'grad_norm': 0.3028059061399268, 'learning_rate': 4.4119999999999995e-07, 'completion_length': 172.1428680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0064849853515625, 'epoch': 0.56} + 56%|█████▌ | 1397/2500 [8:40:50<6:55:06, 22.58s/it] 56%|█████▌ | 1398/2500 [8:41:13<6:58:33, 22.79s/it] {'loss': 0.0003, 'grad_norm': 0.26434819233266205, 'learning_rate': 4.4080000000000003e-07, 'completion_length': 159.9107208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.00732421875, 'epoch': 0.56} + 56%|█████▌ | 1398/2500 [8:41:13<6:58:33, 22.79s/it] 56%|█████▌ | 1399/2500 [8:41:35<6:53:13, 22.52s/it] {'loss': 0.0003, 'grad_norm': 0.02823786343425553, 'learning_rate': 4.404e-07, 'completion_length': 156.6428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0069122314453125, 'epoch': 0.56} + 56%|█████▌ | 1399/2500 [8:41:35<6:53:13, 22.52s/it] 56%|█████▌ | 1400/2500 [8:41:56<6:46:09, 22.15s/it] {'loss': 0.0003, 'grad_norm': 0.3018341268465122, 'learning_rate': 4.3999999999999997e-07, 'completion_length': 151.21429443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0066375732421875, 'epoch': 0.56} + 56%|█████▌ | 1400/2500 [8:41:56<6:46:09, 22.15s/it] 56%|█████▌ | 1401/2500 [8:45:06<22:06:33, 72.42s/it] {'loss': 0.0003, 'grad_norm': 0.03204294166305377, 'learning_rate': 4.396e-07, 'completion_length': 150.375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0064544677734375, 'epoch': 0.56} + 56%|█████▌ | 1401/2500 [8:45:06<22:06:33, 72.42s/it] 56%|█████▌ | 1402/2500 [8:45:26<17:19:15, 56.79s/it] {'loss': 0.0003, 'grad_norm': 0.033873941476621344, 'learning_rate': 4.3919999999999996e-07, 'completion_length': 138.7321548461914, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0068511962890625, 'epoch': 0.56} + 56%|█████▌ | 1402/2500 [8:45:26<17:19:15, 56.79s/it] 56%|█████▌ | 1403/2500 [8:45:48<14:03:02, 46.11s/it] {'loss': 0.0002, 'grad_norm': 0.01851389672934845, 'learning_rate': 4.388e-07, 'completion_length': 139.75000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0043792724609375, 'epoch': 0.56} + 56%|█████▌ | 1403/2500 [8:45:48<14:03:02, 46.11s/it] 56%|█████▌ | 1404/2500 [8:46:10<11:54:48, 39.13s/it] {'loss': 0.0002, 'grad_norm': 0.28957800659055943, 'learning_rate': 4.384e-07, 'completion_length': 160.0357208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.004669189453125, 'epoch': 0.56} + 56%|█████▌ | 1404/2500 [8:46:10<11:54:48, 39.13s/it] 56%|█████▌ | 1405/2500 [8:46:34<10:26:26, 34.33s/it] {'loss': 0.0002, 'grad_norm': 0.23876277497723986, 'learning_rate': 4.38e-07, 'completion_length': 169.00000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0051727294921875, 'epoch': 0.56} + 56%|█████▌ | 1405/2500 [8:46:34<10:26:26, 34.33s/it] 56%|█████▌ | 1406/2500 [8:46:55<9:16:01, 30.49s/it] {'loss': 0.0002, 'grad_norm': 0.04963920157109847, 'learning_rate': 4.3759999999999995e-07, 'completion_length': 156.58929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005340576171875, 'epoch': 0.56} + 56%|█████▌ | 1406/2500 [8:46:55<9:16:01, 30.49s/it] 56%|█████▋ | 1407/2500 [8:47:17<8:27:26, 27.86s/it] {'loss': 0.0002, 'grad_norm': 0.443688092147691, 'learning_rate': 4.3719999999999997e-07, 'completion_length': 154.08929443359375, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07695358991622925, 'kl': 0.00420379638671875, 'epoch': 0.56} + 56%|█████▋ | 1407/2500 [8:47:17<8:27:26, 27.86s/it] 56%|█████▋ | 1408/2500 [8:47:38<7:51:33, 25.91s/it] {'loss': 0.0003, 'grad_norm': 1.4027176256401843, 'learning_rate': 4.368e-07, 'completion_length': 157.30358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0065460205078125, 'epoch': 0.56} + 56%|█████▋ | 1408/2500 [8:47:38<7:51:33, 25.91s/it] 56%|█████▋ | 1409/2500 [8:47:59<7:25:47, 24.52s/it] {'loss': 0.0002, 'grad_norm': 0.3390158913808264, 'learning_rate': 4.364e-07, 'completion_length': 138.60714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0060577392578125, 'epoch': 0.56} + 56%|█████▋ | 1409/2500 [8:47:59<7:25:47, 24.52s/it] 56%|█████▋ | 1410/2500 [8:48:21<7:11:22, 23.75s/it] {'loss': 0.0002, 'grad_norm': 0.029420801315451148, 'learning_rate': 4.36e-07, 'completion_length': 153.2857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006195068359375, 'epoch': 0.56} + 56%|█████▋ | 1410/2500 [8:48:21<7:11:22, 23.75s/it] 56%|█████▋ | 1411/2500 [8:48:43<7:01:37, 23.23s/it] {'loss': 0.0002, 'grad_norm': 0.18581348710692722, 'learning_rate': 4.3559999999999996e-07, 'completion_length': 166.1071548461914, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005889892578125, 'epoch': 0.56} + 56%|█████▋ | 1411/2500 [8:48:43<7:01:37, 23.23s/it] 56%|█████▋ | 1412/2500 [8:49:07<7:03:34, 23.36s/it] {'loss': 0.0003, 'grad_norm': 0.2804641843802098, 'learning_rate': 4.352e-07, 'completion_length': 163.23214721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.0062713623046875, 'epoch': 0.56} + 56%|█████▋ | 1412/2500 [8:49:07<7:03:34, 23.36s/it] 57%|█████▋ | 1413/2500 [8:49:29<6:54:24, 22.87s/it] {'loss': 0.0002, 'grad_norm': 0.49187148339646636, 'learning_rate': 4.348e-07, 'completion_length': 161.33929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.005279541015625, 'epoch': 0.57} + 57%|█████▋ | 1413/2500 [8:49:29<6:54:24, 22.87s/it] 57%|█████▋ | 1414/2500 [8:49:50<6:44:02, 22.32s/it] {'loss': 0.0002, 'grad_norm': 0.4724652056841099, 'learning_rate': 4.3439999999999997e-07, 'completion_length': 138.55358123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0060272216796875, 'epoch': 0.57} + 57%|█████▋ | 1414/2500 [8:49:50<6:44:02, 22.32s/it] 57%|█████▋ | 1415/2500 [8:50:13<6:48:02, 22.56s/it] {'loss': 0.0002, 'grad_norm': 0.018672042233330476, 'learning_rate': 4.34e-07, 'completion_length': 165.46429443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0061798095703125, 'epoch': 0.57} + 57%|█████▋ | 1415/2500 [8:50:13<6:48:02, 22.56s/it] 57%|█████▋ | 1416/2500 [8:50:35<6:43:01, 22.31s/it] {'loss': 0.0002, 'grad_norm': 0.02226482907436444, 'learning_rate': 4.3359999999999997e-07, 'completion_length': 148.4464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00400543212890625, 'epoch': 0.57} + 57%|█████▋ | 1416/2500 [8:50:35<6:43:01, 22.31s/it] 57%|█████▋ | 1417/2500 [8:50:57<6:44:37, 22.42s/it] {'loss': 0.0002, 'grad_norm': 0.29024177712124816, 'learning_rate': 4.3319999999999994e-07, 'completion_length': 148.25000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0051727294921875, 'epoch': 0.57} + 57%|█████▋ | 1417/2500 [8:50:57<6:44:37, 22.42s/it] 57%|█████▋ | 1418/2500 [8:51:19<6:41:10, 22.25s/it] {'loss': 0.0002, 'grad_norm': 0.03126175839130514, 'learning_rate': 4.328e-07, 'completion_length': 148.5357208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00472259521484375, 'epoch': 0.57} + 57%|█████▋ | 1418/2500 [8:51:19<6:41:10, 22.25s/it] 57%|█████▋ | 1419/2500 [8:51:40<6:35:14, 21.94s/it] {'loss': 0.0002, 'grad_norm': 0.9039803266561944, 'learning_rate': 4.324e-07, 'completion_length': 136.01786041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00543212890625, 'epoch': 0.57} + 57%|█████▋ | 1419/2500 [8:51:40<6:35:14, 21.94s/it] 57%|█████▋ | 1420/2500 [8:52:02<6:33:13, 21.85s/it] {'loss': 0.0002, 'grad_norm': 1.4284424599484082, 'learning_rate': 4.3199999999999995e-07, 'completion_length': 147.00000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.006195068359375, 'epoch': 0.57} + 57%|█████▋ | 1420/2500 [8:52:02<6:33:13, 21.85s/it] 57%|█████▋ | 1421/2500 [8:52:24<6:33:32, 21.88s/it] {'loss': 0.0003, 'grad_norm': 0.4620464591493278, 'learning_rate': 4.316e-07, 'completion_length': 160.76786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.0069427490234375, 'epoch': 0.57} + 57%|█████▋ | 1421/2500 [8:52:24<6:33:32, 21.88s/it] 57%|█████▋ | 1422/2500 [8:52:45<6:30:16, 21.72s/it] {'loss': 0.0002, 'grad_norm': 0.01941374739813255, 'learning_rate': 4.312e-07, 'completion_length': 147.8214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00421142578125, 'epoch': 0.57} + 57%|█████▋ | 1422/2500 [8:52:45<6:30:16, 21.72s/it] 57%|█████▋ | 1423/2500 [8:53:07<6:29:23, 21.69s/it] {'loss': 0.0003, 'grad_norm': 0.28732067561991564, 'learning_rate': 4.308e-07, 'completion_length': 155.37500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0066680908203125, 'epoch': 0.57} + 57%|█████▋ | 1423/2500 [8:53:07<6:29:23, 21.69s/it] 57%|█████▋ | 1424/2500 [8:53:29<6:33:08, 21.92s/it] {'loss': 0.0002, 'grad_norm': 0.4453534289805664, 'learning_rate': 4.304e-07, 'completion_length': 154.55358123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.00441741943359375, 'epoch': 0.57} + 57%|█████▋ | 1424/2500 [8:53:29<6:33:08, 21.92s/it] 57%|█████▋ | 1425/2500 [8:53:52<6:33:43, 21.98s/it] {'loss': 0.0003, 'grad_norm': 0.026415133266703753, 'learning_rate': 4.2999999999999996e-07, 'completion_length': 141.76786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.007049560546875, 'epoch': 0.57} + 57%|█████▋ | 1425/2500 [8:53:52<6:33:43, 21.98s/it] 57%|█████▋ | 1426/2500 [8:54:14<6:36:59, 22.18s/it] {'loss': 0.0003, 'grad_norm': 0.025037954450839884, 'learning_rate': 4.296e-07, 'completion_length': 182.00000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00823974609375, 'epoch': 0.57} + 57%|█████▋ | 1426/2500 [8:54:14<6:36:59, 22.18s/it] 57%|█████▋ | 1427/2500 [8:54:36<6:34:07, 22.04s/it] {'loss': 0.0003, 'grad_norm': 0.576229752218064, 'learning_rate': 4.292e-07, 'completion_length': 152.60714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0068511962890625, 'epoch': 0.57} + 57%|█████▋ | 1427/2500 [8:54:36<6:34:07, 22.04s/it] 57%|█████▋ | 1428/2500 [8:54:58<6:31:14, 21.90s/it] {'loss': 0.0002, 'grad_norm': 0.019474293634158577, 'learning_rate': 4.288e-07, 'completion_length': 159.39286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0038909912109375, 'epoch': 0.57} + 57%|█████▋ | 1428/2500 [8:54:58<6:31:14, 21.90s/it] 57%|█████▋ | 1429/2500 [8:55:20<6:32:29, 21.99s/it] {'loss': 0.0002, 'grad_norm': 0.22385842284554128, 'learning_rate': 4.284e-07, 'completion_length': 158.7321548461914, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0051727294921875, 'epoch': 0.57} + 57%|█████▋ | 1429/2500 [8:55:20<6:32:29, 21.99s/it] 57%|█████▋ | 1430/2500 [8:55:42<6:31:06, 21.93s/it] {'loss': 0.0002, 'grad_norm': 0.018428440760309146, 'learning_rate': 4.2799999999999997e-07, 'completion_length': 160.08929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00417327880859375, 'epoch': 0.57} + 57%|█████▋ | 1430/2500 [8:55:42<6:31:06, 21.93s/it] 57%|█████▋ | 1431/2500 [8:56:04<6:34:33, 22.15s/it] {'loss': 0.0002, 'grad_norm': 0.20006561228972786, 'learning_rate': 4.2759999999999994e-07, 'completion_length': 153.21429443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0053558349609375, 'epoch': 0.57} + 57%|█████▋ | 1431/2500 [8:56:04<6:34:33, 22.15s/it] 57%|█████▋ | 1432/2500 [8:56:27<6:36:00, 22.25s/it] {'loss': 0.0002, 'grad_norm': 0.02886502522424563, 'learning_rate': 4.272e-07, 'completion_length': 165.5357208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0045623779296875, 'epoch': 0.57} + 57%|█████▋ | 1432/2500 [8:56:27<6:36:00, 22.25s/it] 57%|█████▋ | 1433/2500 [8:56:48<6:33:24, 22.12s/it] {'loss': 0.0003, 'grad_norm': 0.4132974590410161, 'learning_rate': 4.268e-07, 'completion_length': 155.87500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.006561279296875, 'epoch': 0.57} + 57%|█████▋ | 1433/2500 [8:56:48<6:33:24, 22.12s/it] 57%|█████▋ | 1434/2500 [8:57:10<6:31:37, 22.04s/it] {'loss': 0.0002, 'grad_norm': 0.020698026631924802, 'learning_rate': 4.264e-07, 'completion_length': 153.48214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00518798828125, 'epoch': 0.57} + 57%|█████▋ | 1434/2500 [8:57:10<6:31:37, 22.04s/it] 57%|█████▋ | 1435/2500 [8:57:32<6:29:23, 21.94s/it] {'loss': 0.0002, 'grad_norm': 0.03357535646874694, 'learning_rate': 4.26e-07, 'completion_length': 169.39286041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0050048828125, 'epoch': 0.57} + 57%|█████▋ | 1435/2500 [8:57:32<6:29:23, 21.94s/it] 57%|█████▋ | 1436/2500 [8:57:55<6:32:41, 22.14s/it] {'loss': 0.0003, 'grad_norm': 0.022209022122393993, 'learning_rate': 4.2559999999999995e-07, 'completion_length': 162.1071548461914, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0067291259765625, 'epoch': 0.57} + 57%|█████▋ | 1436/2500 [8:57:55<6:32:41, 22.14s/it] 57%|█████▋ | 1437/2500 [8:58:17<6:30:49, 22.06s/it] {'loss': 0.0002, 'grad_norm': 0.22215299015177709, 'learning_rate': 4.252e-07, 'completion_length': 152.96429443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0049896240234375, 'epoch': 0.57} + 57%|█████▋ | 1437/2500 [8:58:17<6:30:49, 22.06s/it] 58%|███��█▊ | 1438/2500 [8:58:40<6:36:14, 22.39s/it] {'loss': 0.0002, 'grad_norm': 0.2069360116581349, 'learning_rate': 4.248e-07, 'completion_length': 159.2321548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006256103515625, 'epoch': 0.58} + 58%|█████▊ | 1438/2500 [8:58:40<6:36:14, 22.39s/it] 58%|█████▊ | 1439/2500 [8:59:02<6:34:10, 22.29s/it] {'loss': 0.0002, 'grad_norm': 0.3282312374969017, 'learning_rate': 4.2439999999999996e-07, 'completion_length': 155.80357360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0055389404296875, 'epoch': 0.58} + 58%|█████▊ | 1439/2500 [8:59:02<6:34:10, 22.29s/it] 58%|█████▊ | 1440/2500 [8:59:25<6:41:26, 22.72s/it] {'loss': 0.0003, 'grad_norm': 0.43541070410198224, 'learning_rate': 4.24e-07, 'completion_length': 157.85714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0063018798828125, 'epoch': 0.58} + 58%|█████▊ | 1440/2500 [8:59:25<6:41:26, 22.72s/it] 58%|█████▊ | 1441/2500 [8:59:47<6:33:16, 22.28s/it] {'loss': 0.0002, 'grad_norm': 0.02456893940700869, 'learning_rate': 4.2359999999999995e-07, 'completion_length': 154.2857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004638671875, 'epoch': 0.58} + 58%|█████▊ | 1441/2500 [8:59:47<6:33:16, 22.28s/it] 58%|█████▊ | 1442/2500 [9:00:09<6:34:22, 22.37s/it] {'loss': 0.0001, 'grad_norm': 0.371641361760697, 'learning_rate': 4.232e-07, 'completion_length': 153.6428680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.002960205078125, 'epoch': 0.58} + 58%|█████▊ | 1442/2500 [9:00:09<6:34:22, 22.37s/it] 58%|█████▊ | 1443/2500 [9:00:31<6:30:49, 22.18s/it] {'loss': 0.0002, 'grad_norm': 0.0386587521796354, 'learning_rate': 4.228e-07, 'completion_length': 149.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00445556640625, 'epoch': 0.58} + 58%|█████▊ | 1443/2500 [9:00:31<6:30:49, 22.18s/it] 58%|█████▊ | 1444/2500 [9:00:52<6:26:34, 21.96s/it] {'loss': 0.0001, 'grad_norm': 0.016037920290829524, 'learning_rate': 4.2239999999999997e-07, 'completion_length': 145.375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0034027099609375, 'epoch': 0.58} + 58%|█████▊ | 1444/2500 [9:00:52<6:26:34, 21.96s/it] 58%|█████▊ | 1445/2500 [9:01:15<6:29:31, 22.15s/it] {'loss': 0.0004, 'grad_norm': 0.021212899867010186, 'learning_rate': 4.2199999999999994e-07, 'completion_length': 161.96429443359375, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.0091552734375, 'epoch': 0.58} + 58%|█████▊ | 1445/2500 [9:01:15<6:29:31, 22.15s/it] 58%|█████▊ | 1446/2500 [9:01:37<6:28:03, 22.09s/it] {'loss': 0.0002, 'grad_norm': 0.836250785528648, 'learning_rate': 4.2159999999999996e-07, 'completion_length': 156.4107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00616455078125, 'epoch': 0.58} + 58%|█████▊ | 1446/2500 [9:01:37<6:28:03, 22.09s/it] 58%|█████▊ | 1447/2500 [9:01:59<6:29:39, 22.20s/it] {'loss': 0.0002, 'grad_norm': 0.03917350083995192, 'learning_rate': 4.212e-07, 'completion_length': 156.25, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0042724609375, 'epoch': 0.58} + 58%|█████▊ | 1447/2500 [9:01:59<6:29:39, 22.20s/it] 58%|█████▊ | 1448/2500 [9:02:22<6:33:10, 22.42s/it] {'loss': 0.0002, 'grad_norm': 0.02312888656681158, 'learning_rate': 4.208e-07, 'completion_length': 153.30357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00464630126953125, 'epoch': 0.58} + 58%|█████▊ | 1448/2500 [9:02:22<6:33:10, 22.42s/it] 58%|█████▊ | 1449/2500 [9:02:45<6:33:18, 22.45s/it] {'loss': 0.0002, 'grad_norm': 0.20332296644358563, 'learning_rate': 4.204e-07, 'completion_length': 168.62500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0060272216796875, 'epoch': 0.58} + 58%|█████▊ | 1449/2500 [9:02:45<6:33:18, 22.45s/it] 58%|█████▊ | 1450/2500 [9:03:08<6:34:56, 22.57s/it] {'loss': 0.0003, 'grad_norm': 0.272912370249747, 'learning_rate': 4.1999999999999995e-07, 'completion_length': 171.5357208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0072784423828125, 'epoch': 0.58} + 58%|█████▊ | 1450/2500 [9:03:08<6:34:56, 22.57s/it] 58%|█████▊ | 1451/2500 [9:03:30<6:34:58, 22.59s/it] {'loss': 0.0002, 'grad_norm': 0.01826533253574427, 'learning_rate': 4.1959999999999997e-07, 'completion_length': 140.6607208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00389862060546875, 'epoch': 0.58} + 58%|█████▊ | 1451/2500 [9:03:30<6:34:58, 22.59s/it] 58%|█████▊ | 1452/2500 [9:03:51<6:26:19, 22.12s/it] {'loss': 0.0003, 'grad_norm': 0.4401397751420804, 'learning_rate': 4.192e-07, 'completion_length': 148.6964340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0071258544921875, 'epoch': 0.58} + 58%|█████▊ | 1452/2500 [9:03:51<6:26:19, 22.12s/it] 58%|█████▊ | 1453/2500 [9:04:14<6:28:30, 22.26s/it] {'loss': 0.0003, 'grad_norm': 0.17962244236113975, 'learning_rate': 4.1879999999999996e-07, 'completion_length': 173.0357208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00750732421875, 'epoch': 0.58} + 58%|█████▊ | 1453/2500 [9:04:14<6:28:30, 22.26s/it] 58%|█████▊ | 1454/2500 [9:04:36<6:28:16, 22.27s/it] {'loss': 0.0002, 'grad_norm': 0.8565075674824996, 'learning_rate': 4.184e-07, 'completion_length': 167.2678680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.006195068359375, 'epoch': 0.58} + 58%|█████▊ | 1454/2500 [9:04:36<6:28:16, 22.27s/it] 58%|█████▊ | 1455/2500 [9:04:57<6:18:14, 21.72s/it] {'loss': 0.0002, 'grad_norm': 0.3716648727096669, 'learning_rate': 4.1799999999999996e-07, 'completion_length': 144.2857208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005615234375, 'epoch': 0.58} + 58%|█████▊ | 1455/2500 [9:04:57<6:18:14, 21.72s/it] 58%|█████▊ | 1456/2500 [9:05:19<6:19:50, 21.83s/it] {'loss': 0.0002, 'grad_norm': 0.06327237186109502, 'learning_rate': 4.1760000000000003e-07, 'completion_length': 162.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00473785400390625, 'epoch': 0.58} + 58%|█████▊ | 1456/2500 [9:05:19<6:19:50, 21.83s/it] 58%|█████▊ | 1457/2500 [9:05:41<6:22:49, 22.02s/it] {'loss': 0.0002, 'grad_norm': 0.024769488100273945, 'learning_rate': 4.172e-07, 'completion_length': 159.64286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00604248046875, 'epoch': 0.58} + 58%|█████▊ | 1457/2500 [9:05:41<6:22:49, 22.02s/it] 58%|█████▊ | 1458/2500 [9:06:02<6:16:50, 21.70s/it] {'loss': 0.0002, 'grad_norm': 0.02493062447083011, 'learning_rate': 4.1679999999999997e-07, 'completion_length': 145.64286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0048065185546875, 'epoch': 0.58} + 58%|█████▊ | 1458/2500 [9:06:02<6:16:50, 21.70s/it] 58%|█████▊ | 1459/2500 [9:06:25<6:21:59, 22.02s/it] {'loss': 0.0003, 'grad_norm': 0.4284434710076825, 'learning_rate': 4.164e-07, 'completion_length': 171.80358123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.006439208984375, 'epoch': 0.58} + 58%|█████▊ | 1459/2500 [9:06:25<6:21:59, 22.02s/it] 58%|█████▊ | 1460/2500 [9:06:47<6:23:54, 22.15s/it] {'loss': 0.0002, 'grad_norm': 0.21703731417220168, 'learning_rate': 4.1599999999999997e-07, 'completion_length': 151.9464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005523681640625, 'epoch': 0.58} + 58%|█████▊ | 1460/2500 [9:06:48<6:23:54, 22.15s/it] 58%|█████▊ | 1461/2500 [9:07:09<6:18:22, 21.85s/it] {'loss': 0.0003, 'grad_norm': 0.7553816403566344, 'learning_rate': 4.156e-07, 'completion_length': 154.3214340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.007080078125, 'epoch': 0.58} + 58%|█████▊ | 1461/2500 [9:07:09<6:18:22, 21.85s/it] 58%|█████▊ | 1462/2500 [9:07:30<6:13:38, 21.60s/it] {'loss': 0.0003, 'grad_norm': 0.026721437042214184, 'learning_rate': 4.152e-07, 'completion_length': 149.89286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0070343017578125, 'epoch': 0.58} + 58%|█████▊ | 1462/2500 [9:07:30<6:13:38, 21.60s/it] 59%|█████▊ | 1463/2500 [9:07:52<6:15:30, 21.73s/it] {'loss': 0.0002, 'grad_norm': 0.026847958069119632, 'learning_rate': 4.148e-07, 'completion_length': 157.98214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0052642822265625, 'epoch': 0.59} + 59%|█████▊ | 1463/2500 [9:07:52<6:15:30, 21.73s/it] 59%|█████▊ | 1464/2500 [9:08:13<6:13:24, 21.63s/it] {'loss': 0.0003, 'grad_norm': 0.5026934205415031, 'learning_rate': 4.1439999999999995e-07, 'completion_length': 154.83929443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.0068206787109375, 'epoch': 0.59} + 59%|█████▊ | 1464/2500 [9:08:13<6:13:24, 21.63s/it] 59%|█████▊ | 1465/2500 [9:08:34<6:09:46, 21.44s/it] {'loss': 0.0002, 'grad_norm': 0.026535409651589036, 'learning_rate': 4.14e-07, 'completion_length': 143.08928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00469970703125, 'epoch': 0.59} + 59%|█████▊ | 1465/2500 [9:08:34<6:09:46, 21.44s/it] 59%|█████▊ | 1466/2500 [9:08:56<6:12:51, 21.64s/it] {'loss': 0.0002, 'grad_norm': 1.4652773492562763, 'learning_rate': 4.136e-07, 'completion_length': 145.67857360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.00478363037109375, 'epoch': 0.59} + 59%|█████▊ | 1466/2500 [9:08:56<6:12:51, 21.64s/it] 59%|█████▊ | 1467/2500 [9:09:17<6:08:43, 21.42s/it] {'loss': 0.0002, 'grad_norm': 0.05630406789773322, 'learning_rate': 4.1319999999999997e-07, 'completion_length': 147.96428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00405120849609375, 'epoch': 0.59} + 59%|█████▊ | 1467/2500 [9:09:17<6:08:43, 21.42s/it] 59%|█████▊ | 1468/2500 [9:09:40<6:14:19, 21.76s/it] {'loss': 0.0002, 'grad_norm': 0.17235239875382452, 'learning_rate': 4.128e-07, 'completion_length': 164.92858123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0042266845703125, 'epoch': 0.59} + 59%|█████▊ | 1468/2500 [9:09:40<6:14:19, 21.76s/it] 59%|█████▉ | 1469/2500 [9:10:02<6:15:36, 21.86s/it] {'loss': 0.0002, 'grad_norm': 0.034398727478067874, 'learning_rate': 4.1239999999999996e-07, 'completion_length': 174.89286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00563812255859375, 'epoch': 0.59} + 59%|█████▉ | 1469/2500 [9:10:02<6:15:36, 21.86s/it] 59%|█████▉ | 1470/2500 [9:10:23<6:13:40, 21.77s/it] {'loss': 0.0003, 'grad_norm': 0.018004510535907795, 'learning_rate': 4.12e-07, 'completion_length': 166.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00628662109375, 'epoch': 0.59} + 59%|█████▉ | 1470/2500 [9:10:23<6:13:40, 21.77s/it] 59%|█████▉ | 1471/2500 [9:10:45<6:11:55, 21.69s/it] {'loss': 0.0002, 'grad_norm': 0.041944518925645005, 'learning_rate': 4.116e-07, 'completion_length': 152.87500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00437164306640625, 'epoch': 0.59} + 59%|█████▉ | 1471/2500 [9:10:45<6:11:55, 21.69s/it] 59%|█████▉ | 1472/2500 [9:11:08<6:19:13, 22.13s/it] {'loss': 0.0002, 'grad_norm': 0.26103709935444586, 'learning_rate': 4.112e-07, 'completion_length': 160.21429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0059051513671875, 'epoch': 0.59} + 59%|█████▉ | 1472/2500 [9:11:08<6:19:13, 22.13s/it] 59%|█████▉ | 1473/2500 [9:11:30<6:20:22, 22.22s/it] {'loss': 0.0003, 'grad_norm': 0.35383379351357874, 'learning_rate': 4.108e-07, 'completion_length': 151.5178680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.006439208984375, 'epoch': 0.59} + 59%|█████▉ | 1473/2500 [9:11:30<6:20:22, 22.22s/it] 59%|█████▉ | 1474/2500 [9:11:54<6:25:42, 22.56s/it] {'loss': 0.0003, 'grad_norm': 1.2014158096372445, 'learning_rate': 4.1039999999999997e-07, 'completion_length': 160.00000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.006866455078125, 'epoch': 0.59} + 59%|█████▉ | 1474/2500 [9:11:54<6:25:42, 22.56s/it] 59%|█████▉ | 1475/2500 [9:12:16<6:25:52, 22.59s/it] {'loss': 0.0002, 'grad_norm': 0.6642154115351047, 'learning_rate': 4.0999999999999994e-07, 'completion_length': 157.33929443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0049285888671875, 'epoch': 0.59} + 59%|█████▉ | 1475/2500 [9:12:16<6:25:52, 22.59s/it] 59%|█████▉ | 1476/2500 [9:12:40<6:28:35, 22.77s/it] {'loss': 0.0002, 'grad_norm': 0.19815332570643363, 'learning_rate': 4.096e-07, 'completion_length': 161.26786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0058746337890625, 'epoch': 0.59} + 59%|█████▉ | 1476/2500 [9:12:40<6:28:35, 22.77s/it] 59%|█████▉ | 1477/2500 [9:13:01<6:23:01, 22.46s/it] {'loss': 0.0001, 'grad_norm': 0.016846077507085493, 'learning_rate': 4.092e-07, 'completion_length': 150.7678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0032196044921875, 'epoch': 0.59} + 59%|█████▉ | 1477/2500 [9:13:01<6:23:01, 22.46s/it] 59%|█████▉ | 1478/2500 [9:13:25<6:29:25, 22.86s/it] {'loss': 0.0002, 'grad_norm': 0.02015653763350023, 'learning_rate': 4.0879999999999995e-07, 'completion_length': 156.8214340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0053558349609375, 'epoch': 0.59} + 59%|█████▉ | 1478/2500 [9:13:25<6:29:25, 22.86s/it] 59%|█████▉ | 1479/2500 [9:13:47<6:25:56, 22.68s/it] {'loss': 0.0002, 'grad_norm': 0.22232978669689174, 'learning_rate': 4.084e-07, 'completion_length': 170.25000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.005462646484375, 'epoch': 0.59} + 59%|█████▉ | 1479/2500 [9:13:47<6:25:56, 22.68s/it] 59%|█████▉ | 1480/2500 [9:14:09<6:22:37, 22.51s/it] {'loss': 0.0002, 'grad_norm': 0.6347940701798407, 'learning_rate': 4.0799999999999995e-07, 'completion_length': 148.32144165039062, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.00543212890625, 'epoch': 0.59} + 59%|█████▉ | 1480/2500 [9:14:09<6:22:37, 22.51s/it] 59%|█████▉ | 1481/2500 [9:14:32<6:20:53, 22.43s/it] {'loss': 0.0002, 'grad_norm': 0.14030338437186615, 'learning_rate': 4.076e-07, 'completion_length': 147.75000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0057525634765625, 'epoch': 0.59} + 59%|█████▉ | 1481/2500 [9:14:32<6:20:53, 22.43s/it] 59%|█████▉ | 1482/2500 [9:14:55<6:26:33, 22.78s/it] {'loss': 0.0003, 'grad_norm': 0.48468449432240274, 'learning_rate': 4.072e-07, 'completion_length': 164.60714721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0070037841796875, 'epoch': 0.59} + 59%|█████▉ | 1482/2500 [9:14:55<6:26:33, 22.78s/it] 59%|█████▉ | 1483/2500 [9:15:19<6:31:43, 23.11s/it] {'loss': 0.0002, 'grad_norm': 0.014864153345801932, 'learning_rate': 4.0679999999999996e-07, 'completion_length': 148.37500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045928955078125, 'epoch': 0.59} + 59%|█████▉ | 1483/2500 [9:15:19<6:31:43, 23.11s/it] 59%|█████▉ | 1484/2500 [9:15:41<6:25:05, 22.74s/it] {'loss': 0.0003, 'grad_norm': 0.4718835686185006, 'learning_rate': 4.064e-07, 'completion_length': 155.62500762939453, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0085296630859375, 'epoch': 0.59} + 59%|█████▉ | 1484/2500 [9:15:41<6:25:05, 22.74s/it] 59%|█████▉ | 1485/2500 [9:16:04<6:23:33, 22.67s/it] {'loss': 0.0002, 'grad_norm': 0.02429231088259965, 'learning_rate': 4.06e-07, 'completion_length': 142.62500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0062103271484375, 'epoch': 0.59} + 59%|█████▉ | 1485/2500 [9:16:04<6:23:33, 22.67s/it] 59%|█████▉ | 1486/2500 [9:16:26<6:21:12, 22.56s/it] {'loss': 0.0004, 'grad_norm': 0.03152408386335799, 'learning_rate': 4.056e-07, 'completion_length': 153.50000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0093994140625, 'epoch': 0.59} + 59%|█████▉ | 1486/2500 [9:16:26<6:21:12, 22.56s/it] 59%|█████▉ | 1487/2500 [9:16:47<6:14:41, 22.19s/it] {'loss': 0.0002, 'grad_norm': 0.01704992004049752, 'learning_rate': 4.052e-07, 'completion_length': 153.35714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0054931640625, 'epoch': 0.59} + 59%|█████▉ | 1487/2500 [9:16:47<6:14:41, 22.19s/it] 60%|█████▉ | 1488/2500 [9:17:09<6:13:15, 22.13s/it] {'loss': 0.0002, 'grad_norm': 0.018225814638297262, 'learning_rate': 4.0479999999999997e-07, 'completion_length': 156.55358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00440216064453125, 'epoch': 0.6} + 60%|█████▉ | 1488/2500 [9:17:09<6:13:15, 22.13s/it] 60%|█████▉ | 1489/2500 [9:17:32<6:17:33, 22.41s/it] {'loss': 0.0002, 'grad_norm': 0.22616077702003073, 'learning_rate': 4.0439999999999994e-07, 'completion_length': 151.83929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0045928955078125, 'epoch': 0.6} + 60%|█████▉ | 1489/2500 [9:17:32<6:17:33, 22.41s/it] 60%|█████▉ | 1490/2500 [9:17:54<6:14:51, 22.27s/it] {'loss': 0.0002, 'grad_norm': 0.22606118766379885, 'learning_rate': 4.04e-07, 'completion_length': 160.33929443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0050048828125, 'epoch': 0.6} + 60%|█████▉ | 1490/2500 [9:17:54<6:14:51, 22.27s/it] 60%|█████▉ | 1491/2500 [9:18:17<6:17:04, 22.42s/it] {'loss': 0.0002, 'grad_norm': 1.0141077964777045, 'learning_rate': 4.036e-07, 'completion_length': 156.48214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00408935546875, 'epoch': 0.6} + 60%|█████▉ | 1491/2500 [9:18:17<6:17:04, 22.42s/it] 60%|█████▉ | 1492/2500 [9:18:39<6:13:03, 22.21s/it] {'loss': 0.0002, 'grad_norm': 0.018629803093851725, 'learning_rate': 4.032e-07, 'completion_length': 157.6964340209961, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.0060882568359375, 'epoch': 0.6} + 60%|█████▉ | 1492/2500 [9:18:39<6:13:03, 22.21s/it] 60%|█████▉ | 1493/2500 [9:19:01<6:10:41, 22.09s/it] {'loss': 0.0002, 'grad_norm': 0.024350376881052665, 'learning_rate': 4.028e-07, 'completion_length': 149.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006134033203125, 'epoch': 0.6} + 60%|█████▉ | 1493/2500 [9:19:01<6:10:41, 22.09s/it] 60%|█████▉ | 1494/2500 [9:19:24<6:15:05, 22.37s/it] {'loss': 0.0002, 'grad_norm': 0.4888821150359301, 'learning_rate': 4.0239999999999995e-07, 'completion_length': 157.2857208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00579071044921875, 'epoch': 0.6} + 60%|█████▉ | 1494/2500 [9:19:24<6:15:05, 22.37s/it] 60%|█████▉ | 1495/2500 [9:19:44<6:05:57, 21.85s/it] {'loss': 0.0002, 'grad_norm': 0.03279370922857068, 'learning_rate': 4.02e-07, 'completion_length': 140.2857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00460052490234375, 'epoch': 0.6} + 60%|█████▉ | 1495/2500 [9:19:44<6:05:57, 21.85s/it] 60%|█████▉ | 1496/2500 [9:20:06<6:05:17, 21.83s/it] {'loss': 0.0002, 'grad_norm': 1.2619077180176208, 'learning_rate': 4.016e-07, 'completion_length': 150.17857360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00385284423828125, 'epoch': 0.6} + 60%|█████▉ | 1496/2500 [9:20:06<6:05:17, 21.83s/it] 60%|█████▉ | 1497/2500 [9:20:28<6:04:14, 21.79s/it] {'loss': 0.0001, 'grad_norm': 0.8900808740838232, 'learning_rate': 4.0119999999999997e-07, 'completion_length': 146.01786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00318145751953125, 'epoch': 0.6} + 60%|█████▉ | 1497/2500 [9:20:28<6:04:14, 21.79s/it] 60%|█████▉ | 1498/2500 [9:20:49<6:01:37, 21.65s/it] {'loss': 0.0003, 'grad_norm': 0.5258724774513144, 'learning_rate': 4.008e-07, 'completion_length': 159.46429443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.00775146484375, 'epoch': 0.6} + 60%|█████▉ | 1498/2500 [9:20:49<6:01:37, 21.65s/it] 60%|█████▉ | 1499/2500 [9:21:11<6:04:25, 21.84s/it] {'loss': 0.0003, 'grad_norm': 0.03093014240267437, 'learning_rate': 4.0039999999999996e-07, 'completion_length': 168.33929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0073394775390625, 'epoch': 0.6} + 60%|█████▉ | 1499/2500 [9:21:11<6:04:25, 21.84s/it] 60%|██████ | 1500/2500 [9:21:34<6:06:46, 22.01s/it] {'loss': 0.0002, 'grad_norm': 0.018184215847293718, 'learning_rate': 4e-07, 'completion_length': 156.60714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0050506591796875, 'epoch': 0.6} + 60%|██████ | 1500/2500 [9:21:34<6:06:46, 22.01s/it] 60%|██████ | 1501/2500 [9:24:25<18:30:14, 66.68s/it] {'loss': 0.0002, 'grad_norm': 0.01676041545981702, 'learning_rate': 3.996e-07, 'completion_length': 149.92858123779297, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.0041656494140625, 'epoch': 0.6} + 60%|██████ | 1501/2500 [9:24:25<18:30:14, 66.68s/it] 60%|██████ | 1502/2500 [9:24:46<14:41:20, 52.99s/it] {'loss': 0.0002, 'grad_norm': 0.7116642905467323, 'learning_rate': 3.992e-07, 'completion_length': 152.0714340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.005218505859375, 'epoch': 0.6} + 60%|██████ | 1502/2500 [9:24:46<14:41:20, 52.99s/it] 60%|██████ | 1503/2500 [9:25:08<12:09:28, 43.90s/it] {'loss': 0.0004, 'grad_norm': 0.5051165488043311, 'learning_rate': 3.9879999999999994e-07, 'completion_length': 167.9464340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.01068115234375, 'epoch': 0.6} + 60%|██████ | 1503/2500 [9:25:08<12:09:28, 43.90s/it] 60%|██████ | 1504/2500 [9:25:29<10:15:01, 37.05s/it] {'loss': 0.0002, 'grad_norm': 0.5797342452287115, 'learning_rate': 3.9839999999999997e-07, 'completion_length': 135.6964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0048065185546875, 'epoch': 0.6} + 60%|██████ | 1504/2500 [9:25:29<10:15:01, 37.05s/it] 60%|██████ | 1505/2500 [9:25:51<8:59:41, 32.54s/it] {'loss': 0.0003, 'grad_norm': 0.8799697363235844, 'learning_rate': 3.98e-07, 'completion_length': 162.6071548461914, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.1181928962469101, 'kl': 0.008331298828125, 'epoch': 0.6} + 60%|██████ | 1505/2500 [9:25:51<8:59:41, 32.54s/it] 60%|██████ | 1506/2500 [9:26:15<8:14:43, 29.86s/it] {'loss': 0.0002, 'grad_norm': 0.06625586097003705, 'learning_rate': 3.976e-07, 'completion_length': 164.3571548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0059967041015625, 'epoch': 0.6} + 60%|██████ | 1506/2500 [9:26:15<8:14:43, 29.86s/it] 60%|██████ | 1507/2500 [9:26:38<7:39:04, 27.74s/it] {'loss': 0.0001, 'grad_norm': 0.015090111683621489, 'learning_rate': 3.972e-07, 'completion_length': 147.4107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00351715087890625, 'epoch': 0.6} + 60%|██████ | 1507/2500 [9:26:38<7:39:04, 27.74s/it] 60%|██████ | 1508/2500 [9:26:59<7:08:21, 25.91s/it] {'loss': 0.0003, 'grad_norm': 0.034051078843815405, 'learning_rate': 3.9679999999999995e-07, 'completion_length': 159.1964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006866455078125, 'epoch': 0.6} + 60%|██████ | 1508/2500 [9:26:59<7:08:21, 25.91s/it] 60%|██████ | 1509/2500 [9:27:21<6:47:36, 24.68s/it] {'loss': 0.0002, 'grad_norm': 0.04960639314489433, 'learning_rate': 3.964e-07, 'completion_length': 151.12500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0046844482421875, 'epoch': 0.6} + 60%|██████ | 1509/2500 [9:27:21<6:47:36, 24.68s/it] 60%|██████ | 1510/2500 [9:27:44<6:36:02, 24.00s/it] {'loss': 0.0002, 'grad_norm': 0.16571160944801575, 'learning_rate': 3.96e-07, 'completion_length': 157.8571548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00592803955078125, 'epoch': 0.6} + 60%|██████ | 1510/2500 [9:27:44<6:36:02, 24.00s/it] 60%|██████ | 1511/2500 [9:28:05<6:22:35, 23.21s/it] {'loss': 0.0001, 'grad_norm': 0.03030837296384698, 'learning_rate': 3.9559999999999997e-07, 'completion_length': 139.98215103149414, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0027008056640625, 'epoch': 0.6} + 60%|██████ | 1511/2500 [9:28:05<6:22:35, 23.21s/it] 60%|██████ | 1512/2500 [9:28:27<6:16:07, 22.84s/it] {'loss': 0.0003, 'grad_norm': 0.030906087820662793, 'learning_rate': 3.952e-07, 'completion_length': 158.98214721679688, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.008544921875, 'epoch': 0.6} + 60%|██████ | 1512/2500 [9:28:27<6:16:07, 22.84s/it] 61%|██████ | 1513/2500 [9:28:48<6:07:46, 22.36s/it] {'loss': 0.0002, 'grad_norm': 0.04126234383693304, 'learning_rate': 3.9479999999999996e-07, 'completion_length': 147.01786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00390625, 'epoch': 0.61} + 61%|██████ | 1513/2500 [9:28:48<6:07:46, 22.36s/it] 61%|██████ | 1514/2500 [9:29:10<6:04:10, 22.16s/it] {'loss': 0.0003, 'grad_norm': 0.32800504811225945, 'learning_rate': 3.9439999999999993e-07, 'completion_length': 158.57144165039062, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.007568359375, 'epoch': 0.61} + 61%|██████ | 1514/2500 [9:29:10<6:04:10, 22.16s/it] 61%|██████ | 1515/2500 [9:29:31<6:00:41, 21.97s/it] {'loss': 0.0003, 'grad_norm': 0.33288473458010237, 'learning_rate': 3.94e-07, 'completion_length': 144.4107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0062713623046875, 'epoch': 0.61} + 61%|██████ | 1515/2500 [9:29:31<6:00:41, 21.97s/it] 61%|██████ | 1516/2500 [9:29:53<6:00:22, 21.97s/it] {'loss': 0.0002, 'grad_norm': 0.03732426135760503, 'learning_rate': 3.936e-07, 'completion_length': 157.4464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0037841796875, 'epoch': 0.61} + 61%|██████ | 1516/2500 [9:29:53<6:00:22, 21.97s/it] 61%|██████ | 1517/2500 [9:30:15<5:59:10, 21.92s/it] {'loss': 0.0003, 'grad_norm': 0.04674025177596526, 'learning_rate': 3.932e-07, 'completion_length': 162.64286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0071258544921875, 'epoch': 0.61} + 61%|██████ | 1517/2500 [9:30:15<5:59:10, 21.92s/it] 61%|██████ | 1518/2500 [9:30:36<5:55:20, 21.71s/it] {'loss': 0.0001, 'grad_norm': 0.2504942636619503, 'learning_rate': 3.9279999999999997e-07, 'completion_length': 149.4107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00335693359375, 'epoch': 0.61} + 61%|██████ | 1518/2500 [9:30:36<5:55:20, 21.71s/it] 61%|██████ | 1519/2500 [9:30:57<5:50:55, 21.46s/it] {'loss': 0.0001, 'grad_norm': 0.020046120252644403, 'learning_rate': 3.924e-07, 'completion_length': 147.2678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00365447998046875, 'epoch': 0.61} + 61%|██████ | 1519/2500 [9:30:57<5:50:55, 21.46s/it] 61%|██████ | 1520/2500 [9:31:19<5:50:17, 21.45s/it] {'loss': 0.0002, 'grad_norm': 0.016451258788058063, 'learning_rate': 3.92e-07, 'completion_length': 150.17857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0037841796875, 'epoch': 0.61} + 61%|██████ | 1520/2500 [9:31:19<5:50:17, 21.45s/it] 61%|██████ | 1521/2500 [9:31:41<5:55:06, 21.76s/it] {'loss': 0.0003, 'grad_norm': 0.016661366712071937, 'learning_rate': 3.916e-07, 'completion_length': 147.55358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006317138671875, 'epoch': 0.61} + 61%|██████ | 1521/2500 [9:31:41<5:55:06, 21.76s/it] 61%|██████ | 1522/2500 [9:32:02<5:49:03, 21.41s/it] {'loss': 0.0002, 'grad_norm': 0.023134233552647713, 'learning_rate': 3.9119999999999996e-07, 'completion_length': 148.64286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00616455078125, 'epoch': 0.61} + 61%|██████ | 1522/2500 [9:32:02<5:49:03, 21.41s/it] 61%|██████ | 1523/2500 [9:32:23<5:47:38, 21.35s/it] {'loss': 0.0002, 'grad_norm': 0.019332210270272588, 'learning_rate': 3.908e-07, 'completion_length': 140.5714340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0040130615234375, 'epoch': 0.61} + 61%|██████ | 1523/2500 [9:32:23<5:47:38, 21.35s/it] 61%|██████ | 1524/2500 [9:32:45<5:48:25, 21.42s/it] {'loss': 0.0004, 'grad_norm': 0.2363825780197733, 'learning_rate': 3.904e-07, 'completion_length': 158.1607208251953, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0095367431640625, 'epoch': 0.61} + 61%|██████ | 1524/2500 [9:32:45<5:48:25, 21.42s/it] 61%|██████ | 1525/2500 [9:33:06<5:47:48, 21.40s/it] {'loss': 0.0001, 'grad_norm': 0.5894922893927221, 'learning_rate': 3.8999999999999997e-07, 'completion_length': 148.62500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00327301025390625, 'epoch': 0.61} + 61%|██████ | 1525/2500 [9:33:06<5:47:48, 21.40s/it] 61%|██████ | 1526/2500 [9:33:28<5:51:25, 21.65s/it] {'loss': 0.0002, 'grad_norm': 0.46923749146852994, 'learning_rate': 3.896e-07, 'completion_length': 155.6964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00469970703125, 'epoch': 0.61} + 61%|██████ | 1526/2500 [9:33:28<5:51:25, 21.65s/it] 61%|██████ | 1527/2500 [9:33:50<5:53:46, 21.82s/it] {'loss': 0.0002, 'grad_norm': 0.3234885187546106, 'learning_rate': 3.8919999999999996e-07, 'completion_length': 151.4464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005340576171875, 'epoch': 0.61} + 61%|██████ | 1527/2500 [9:33:50<5:53:46, 21.82s/it] 61%|██████ | 1528/2500 [9:34:13<5:57:15, 22.05s/it] {'loss': 0.0003, 'grad_norm': 0.31217209758636477, 'learning_rate': 3.888e-07, 'completion_length': 153.7678680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0075836181640625, 'epoch': 0.61} + 61%|██████ | 1528/2500 [9:34:13<5:57:15, 22.05s/it] 61%|██████ | 1529/2500 [9:34:35<5:56:31, 22.03s/it] {'loss': 0.0002, 'grad_norm': 0.26094891805534415, 'learning_rate': 3.884e-07, 'completion_length': 157.2678680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0059661865234375, 'epoch': 0.61} + 61%|██████ | 1529/2500 [9:34:35<5:56:31, 22.03s/it] 61%|██████ | 1530/2500 [9:34:56<5:51:18, 21.73s/it] {'loss': 0.0002, 'grad_norm': 0.06343214597609961, 'learning_rate': 3.88e-07, 'completion_length': 138.37500762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00485992431640625, 'epoch': 0.61} + 61%|██████ | 1530/2500 [9:34:56<5:51:18, 21.73s/it] 61%|██████ | 1531/2500 [9:35:19<5:59:01, 22.23s/it] {'loss': 0.0002, 'grad_norm': 0.2401984150339232, 'learning_rate': 3.876e-07, 'completion_length': 165.00000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005035400390625, 'epoch': 0.61} + 61%|██████ | 1531/2500 [9:35:19<5:59:01, 22.23s/it] 61%|██████▏ | 1532/2500 [9:35:41<5:53:27, 21.91s/it] {'loss': 0.0002, 'grad_norm': 0.248634254263835, 'learning_rate': 3.8719999999999997e-07, 'completion_length': 145.1964340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0051727294921875, 'epoch': 0.61} + 61%|██████▏ | 1532/2500 [9:35:41<5:53:27, 21.91s/it] 61%|██████▏ | 1533/2500 [9:36:03<5:54:10, 21.98s/it] {'loss': 0.0003, 'grad_norm': 0.02103690279952468, 'learning_rate': 3.8679999999999994e-07, 'completion_length': 167.9107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.007659912109375, 'epoch': 0.61} + 61%|██████▏ | 1533/2500 [9:36:03<5:54:10, 21.98s/it] 61%|██████▏ | 1534/2500 [9:36:24<5:50:42, 21.78s/it] {'loss': 0.0001, 'grad_norm': 0.0652939004193407, 'learning_rate': 3.864e-07, 'completion_length': 141.30357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00337982177734375, 'epoch': 0.61} + 61%|██████▏ | 1534/2500 [9:36:24<5:50:42, 21.78s/it] 61%|██████▏ | 1535/2500 [9:36:46<5:51:37, 21.86s/it] {'loss': 0.0003, 'grad_norm': 0.023410873046768445, 'learning_rate': 3.86e-07, 'completion_length': 159.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0075836181640625, 'epoch': 0.61} + 61%|██████▏ | 1535/2500 [9:36:46<5:51:37, 21.86s/it] 61%|██████▏ | 1536/2500 [9:37:07<5:48:42, 21.70s/it] {'loss': 0.0002, 'grad_norm': 0.026278104799288656, 'learning_rate': 3.8559999999999996e-07, 'completion_length': 150.8214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003936767578125, 'epoch': 0.61} + 61%|██████▏ | 1536/2500 [9:37:07<5:48:42, 21.70s/it] 61%|██████▏ | 1537/2500 [9:37:31<5:54:54, 22.11s/it] {'loss': 0.0003, 'grad_norm': 0.4724028812312216, 'learning_rate': 3.852e-07, 'completion_length': 172.6428680419922, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.00665283203125, 'epoch': 0.61} + 61%|██████▏ | 1537/2500 [9:37:31<5:54:54, 22.11s/it] 62%|██████▏ | 1538/2500 [9:37:52<5:53:25, 22.04s/it] {'loss': 0.0002, 'grad_norm': 0.0196218816918935, 'learning_rate': 3.8479999999999995e-07, 'completion_length': 157.08929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004791259765625, 'epoch': 0.62} + 62%|██████▏ | 1538/2500 [9:37:52<5:53:25, 22.04s/it] 62%|██████▏ | 1539/2500 [9:38:13<5:46:40, 21.64s/it] {'loss': 0.0002, 'grad_norm': 0.023874043470277825, 'learning_rate': 3.8440000000000003e-07, 'completion_length': 138.6964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004566192626953125, 'epoch': 0.62} + 62%|██████▏ | 1539/2500 [9:38:13<5:46:40, 21.64s/it] 62%|██████▏ | 1540/2500 [9:38:35<5:45:41, 21.61s/it] {'loss': 0.0002, 'grad_norm': 0.3445744686401845, 'learning_rate': 3.84e-07, 'completion_length': 158.17858123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005523681640625, 'epoch': 0.62} + 62%|██████▏ | 1540/2500 [9:38:35<5:45:41, 21.61s/it] 62%|██████▏ | 1541/2500 [9:38:56<5:43:24, 21.49s/it] {'loss': 0.0002, 'grad_norm': 0.19656181835172287, 'learning_rate': 3.8359999999999997e-07, 'completion_length': 146.9107208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00537109375, 'epoch': 0.62} + 62%|██████▏ | 1541/2500 [9:38:56<5:43:24, 21.49s/it] 62%|██████▏ | 1542/2500 [9:39:18<5:45:31, 21.64s/it] {'loss': 0.0002, 'grad_norm': 0.029344936349199713, 'learning_rate': 3.832e-07, 'completion_length': 151.05358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00403594970703125, 'epoch': 0.62} + 62%|██████▏ | 1542/2500 [9:39:18<5:45:31, 21.64s/it] 62%|██████▏ | 1543/2500 [9:39:40<5:46:18, 21.71s/it] {'loss': 0.0002, 'grad_norm': 0.01709809701560625, 'learning_rate': 3.8279999999999996e-07, 'completion_length': 151.0714340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0057373046875, 'epoch': 0.62} + 62%|██████▏ | 1543/2500 [9:39:40<5:46:18, 21.71s/it] 62%|██████▏ | 1544/2500 [9:40:00<5:41:11, 21.41s/it] {'loss': 0.0002, 'grad_norm': 3.013620173129407, 'learning_rate': 3.824e-07, 'completion_length': 151.71429443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0055389404296875, 'epoch': 0.62} + 62%|██████▏ | 1544/2500 [9:40:00<5:41:11, 21.41s/it] 62%|██████▏ | 1545/2500 [9:40:23<5:48:28, 21.89s/it] {'loss': 0.0003, 'grad_norm': 0.43485564433027035, 'learning_rate': 3.82e-07, 'completion_length': 166.9464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006988525390625, 'epoch': 0.62} + 62%|██████▏ | 1545/2500 [9:40:23<5:48:28, 21.89s/it] 62%|██████▏ | 1546/2500 [9:40:46<5:48:55, 21.94s/it] {'loss': 0.0003, 'grad_norm': 0.3734326907574791, 'learning_rate': 3.816e-07, 'completion_length': 160.6607208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00750732421875, 'epoch': 0.62} + 62%|██████▏ | 1546/2500 [9:40:46<5:48:55, 21.94s/it] 62%|██████▏ | 1547/2500 [9:41:06<5:41:42, 21.51s/it] {'loss': 0.0003, 'grad_norm': 1.562240885639401, 'learning_rate': 3.8119999999999995e-07, 'completion_length': 144.375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.11266788095235825, 'kl': 0.0085906982421875, 'epoch': 0.62} + 62%|██████▏ | 1547/2500 [9:41:06<5:41:42, 21.51s/it] 62%|██████▏ | 1548/2500 [9:41:26<5:34:36, 21.09s/it] {'loss': 0.0001, 'grad_norm': 0.022212543259399194, 'learning_rate': 3.808e-07, 'completion_length': 135.85714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00373077392578125, 'epoch': 0.62} + 62%|██████▏ | 1548/2500 [9:41:26<5:34:36, 21.09s/it] 62%|██████▏ | 1549/2500 [9:41:47<5:35:13, 21.15s/it] {'loss': 0.0003, 'grad_norm': 0.2335877753745026, 'learning_rate': 3.804e-07, 'completion_length': 152.75000762939453, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.0072784423828125, 'epoch': 0.62} + 62%|██████▏ | 1549/2500 [9:41:47<5:35:13, 21.15s/it] 62%|██████▏ | 1550/2500 [9:42:07<5:29:06, 20.79s/it] {'loss': 0.0002, 'grad_norm': 1.6220598990540709, 'learning_rate': 3.7999999999999996e-07, 'completion_length': 127.39286422729492, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0046234130859375, 'epoch': 0.62} + 62%|██████▏ | 1550/2500 [9:42:07<5:29:06, 20.79s/it] 62%|██████▏ | 1551/2500 [9:42:30<5:36:49, 21.30s/it] {'loss': 0.0003, 'grad_norm': 0.4132560274312943, 'learning_rate': 3.796e-07, 'completion_length': 181.8214340209961, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.00726318359375, 'epoch': 0.62} + 62%|██████▏ | 1551/2500 [9:42:30<5:36:49, 21.30s/it] 62%|██████▏ | 1552/2500 [9:42:51<5:36:39, 21.31s/it] {'loss': 0.0002, 'grad_norm': 0.025357878659260653, 'learning_rate': 3.7919999999999995e-07, 'completion_length': 137.67858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0057830810546875, 'epoch': 0.62} + 62%|██████▏ | 1552/2500 [9:42:51<5:36:39, 21.31s/it] 62%|██████▏ | 1553/2500 [9:43:14<5:41:10, 21.62s/it] {'loss': 0.0002, 'grad_norm': 0.3053599146869993, 'learning_rate': 3.7880000000000003e-07, 'completion_length': 155.80358123779297, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695358991622925, 'kl': 0.005126953125, 'epoch': 0.62} + 62%|██████▏ | 1553/2500 [9:43:14<5:41:10, 21.62s/it] 62%|██████▏ | 1554/2500 [9:43:35<5:39:31, 21.53s/it] {'loss': 0.0002, 'grad_norm': 0.42041244927739785, 'learning_rate': 3.784e-07, 'completion_length': 158.2857208251953, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.0048065185546875, 'epoch': 0.62} + 62%|██████▏ | 1554/2500 [9:43:35<5:39:31, 21.53s/it] 62%|██████▏ | 1555/2500 [9:43:59<5:51:03, 22.29s/it] {'loss': 0.0003, 'grad_norm': 0.45853852132383194, 'learning_rate': 3.7799999999999997e-07, 'completion_length': 166.55358123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0064544677734375, 'epoch': 0.62} + 62%|██████▏ | 1555/2500 [9:43:59<5:51:03, 22.29s/it] 62%|██████▏ | 1556/2500 [9:44:22<5:52:25, 22.40s/it] {'loss': 0.0003, 'grad_norm': 0.8318454083345876, 'learning_rate': 3.776e-07, 'completion_length': 159.4107208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.0081634521484375, 'epoch': 0.62} + 62%|██████▏ | 1556/2500 [9:44:22<5:52:25, 22.40s/it] 62%|██████▏ | 1557/2500 [9:44:45<5:55:25, 22.61s/it] {'loss': 0.0002, 'grad_norm': 0.14471767319675716, 'learning_rate': 3.7719999999999996e-07, 'completion_length': 172.87500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00604248046875, 'epoch': 0.62} + 62%|██████▏ | 1557/2500 [9:44:45<5:55:25, 22.61s/it] 62%|██████▏ | 1558/2500 [9:45:07<5:51:38, 22.40s/it] {'loss': 0.0002, 'grad_norm': 0.028222300958322864, 'learning_rate': 3.768e-07, 'completion_length': 150.1964340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0042572021484375, 'epoch': 0.62} + 62%|██████▏ | 1558/2500 [9:45:07<5:51:38, 22.40s/it] 62%|██████▏ | 1559/2500 [9:45:29<5:52:39, 22.49s/it] {'loss': 0.0002, 'grad_norm': 0.03493628410577496, 'learning_rate': 3.764e-07, 'completion_length': 163.75000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0042877197265625, 'epoch': 0.62} + 62%|██████▏ | 1559/2500 [9:45:29<5:52:39, 22.49s/it] 62%|██████▏ | 1560/2500 [9:45:51<5:48:29, 22.24s/it] {'loss': 0.0001, 'grad_norm': 0.02112537679054133, 'learning_rate': 3.76e-07, 'completion_length': 154.12500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00348663330078125, 'epoch': 0.62} + 62%|██████▏ | 1560/2500 [9:45:51<5:48:29, 22.24s/it] 62%|██████▏ | 1561/2500 [9:46:14<5:49:39, 22.34s/it] {'loss': 0.0004, 'grad_norm': 0.4367166430527568, 'learning_rate': 3.7559999999999995e-07, 'completion_length': 183.71429443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.008880615234375, 'epoch': 0.62} + 62%|██████▏ | 1561/2500 [9:46:14<5:49:39, 22.34s/it] 62%|██████▏ | 1562/2500 [9:46:35<5:43:45, 21.99s/it] {'loss': 0.0001, 'grad_norm': 0.3140543412427965, 'learning_rate': 3.7519999999999997e-07, 'completion_length': 148.9464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00348663330078125, 'epoch': 0.62} + 62%|██████▏ | 1562/2500 [9:46:35<5:43:45, 21.99s/it] 63%|██████▎ | 1563/2500 [9:46:57<5:43:11, 21.98s/it] {'loss': 0.0002, 'grad_norm': 0.031358464460501646, 'learning_rate': 3.748e-07, 'completion_length': 152.125, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005584716796875, 'epoch': 0.63} + 63%|██████▎ | 1563/2500 [9:46:57<5:43:11, 21.98s/it] 63%|██████▎ | 1564/2500 [9:47:21<5:53:14, 22.64s/it] {'loss': 0.0003, 'grad_norm': 0.14574821264467377, 'learning_rate': 3.744e-07, 'completion_length': 162.92858123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006317138671875, 'epoch': 0.63} + 63%|██████▎ | 1564/2500 [9:47:21<5:53:14, 22.64s/it] 63%|██████▎ | 1565/2500 [9:47:41<5:43:17, 22.03s/it] {'loss': 0.0001, 'grad_norm': 0.014209142033443388, 'learning_rate': 3.74e-07, 'completion_length': 138.3214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00307464599609375, 'epoch': 0.63} + 63%|██████▎ | 1565/2500 [9:47:41<5:43:17, 22.03s/it] 63%|██████▎ | 1566/2500 [9:48:04<5:46:39, 22.27s/it] {'loss': 0.0003, 'grad_norm': 0.2950944163376232, 'learning_rate': 3.7359999999999996e-07, 'completion_length': 168.67858123779297, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695358991622925, 'kl': 0.0067138671875, 'epoch': 0.63} + 63%|██████▎ | 1566/2500 [9:48:04<5:46:39, 22.27s/it] 63%|██████▎ | 1567/2500 [9:48:26<5:45:33, 22.22s/it] {'loss': 0.0002, 'grad_norm': 0.5258443382953379, 'learning_rate': 3.732e-07, 'completion_length': 150.39286041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00466156005859375, 'epoch': 0.63} + 63%|██████▎ | 1567/2500 [9:48:26<5:45:33, 22.22s/it] 63%|██████▎ | 1568/2500 [9:48:47<5:35:49, 21.62s/it] {'loss': 0.0003, 'grad_norm': 0.8920920415263313, 'learning_rate': 3.728e-07, 'completion_length': 143.0357208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0067901611328125, 'epoch': 0.63} + 63%|██████▎ | 1568/2500 [9:48:47<5:35:49, 21.62s/it] 63%|██████▎ | 1569/2500 [9:49:08<5:34:54, 21.58s/it] {'loss': 0.0002, 'grad_norm': 0.23265740679640212, 'learning_rate': 3.7239999999999997e-07, 'completion_length': 152.89286041259766, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005157470703125, 'epoch': 0.63} + 63%|██████▎ | 1569/2500 [9:49:08<5:34:54, 21.58s/it] 63%|██████▎ | 1570/2500 [9:49:30<5:34:19, 21.57s/it] {'loss': 0.0002, 'grad_norm': 0.32810937633350573, 'learning_rate': 3.72e-07, 'completion_length': 143.9464340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0058135986328125, 'epoch': 0.63} + 63%|██████▎ | 1570/2500 [9:49:30<5:34:19, 21.57s/it] 63%|██████▎ | 1571/2500 [9:49:51<5:34:52, 21.63s/it] {'loss': 0.0002, 'grad_norm': 0.24628057101528128, 'learning_rate': 3.7159999999999997e-07, 'completion_length': 157.55357360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004913330078125, 'epoch': 0.63} + 63%|██████▎ | 1571/2500 [9:49:51<5:34:52, 21.63s/it] 63%|██████▎ | 1572/2500 [9:50:14<5:39:37, 21.96s/it] {'loss': 0.0003, 'grad_norm': 0.7405890199750054, 'learning_rate': 3.7119999999999994e-07, 'completion_length': 161.3214340209961, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0078277587890625, 'epoch': 0.63} + 63%|██████▎ | 1572/2500 [9:50:14<5:39:37, 21.96s/it] 63%|██████▎ | 1573/2500 [9:50:35<5:35:44, 21.73s/it] {'loss': 0.0002, 'grad_norm': 0.03470367209774367, 'learning_rate': 3.708e-07, 'completion_length': 158.8928680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0062103271484375, 'epoch': 0.63} + 63%|██████▎ | 1573/2500 [9:50:35<5:35:44, 21.73s/it] 63%|██████▎ | 1574/2500 [9:50:57<5:36:23, 21.80s/it] {'loss': 0.0003, 'grad_norm': 0.02960450642244312, 'learning_rate': 3.704e-07, 'completion_length': 169.25000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.008026123046875, 'epoch': 0.63} + 63%|██████▎ | 1574/2500 [9:50:57<5:36:23, 21.80s/it] 63%|██████▎ | 1575/2500 [9:51:20<5:39:17, 22.01s/it] {'loss': 0.0002, 'grad_norm': 1.4309817309670254, 'learning_rate': 3.7e-07, 'completion_length': 160.7857208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0058441162109375, 'epoch': 0.63} + 63%|██████▎ | 1575/2500 [9:51:20<5:39:17, 22.01s/it] 63%|██████▎ | 1576/2500 [9:51:41<5:35:46, 21.80s/it] {'loss': 0.0003, 'grad_norm': 0.5363972780269844, 'learning_rate': 3.696e-07, 'completion_length': 148.30358123779297, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0073699951171875, 'epoch': 0.63} + 63%|██████▎ | 1576/2500 [9:51:41<5:35:46, 21.80s/it] 63%|██████▎ | 1577/2500 [9:52:03<5:34:23, 21.74s/it] {'loss': 0.0002, 'grad_norm': 0.019491156136669605, 'learning_rate': 3.6919999999999994e-07, 'completion_length': 150.4107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00403594970703125, 'epoch': 0.63} + 63%|██████▎ | 1577/2500 [9:52:03<5:34:23, 21.74s/it] 63%|██████▎ | 1578/2500 [9:52:24<5:33:39, 21.71s/it] {'loss': 0.0003, 'grad_norm': 0.466870468049027, 'learning_rate': 3.688e-07, 'completion_length': 155.05357360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.006317138671875, 'epoch': 0.63} + 63%|██████▎ | 1578/2500 [9:52:24<5:33:39, 21.71s/it] 63%|██████▎ | 1579/2500 [9:52:46<5:30:54, 21.56s/it] {'loss': 0.0001, 'grad_norm': 0.017534370479764368, 'learning_rate': 3.684e-07, 'completion_length': 147.33929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00327301025390625, 'epoch': 0.63} + 63%|██████▎ | 1579/2500 [9:52:46<5:30:54, 21.56s/it] 63%|██████▎ | 1580/2500 [9:53:08<5:35:29, 21.88s/it] {'loss': 0.0002, 'grad_norm': 0.5151728395860339, 'learning_rate': 3.6799999999999996e-07, 'completion_length': 158.21429443359375, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07695358991622925, 'kl': 0.0062408447265625, 'epoch': 0.63} + 63%|██████▎ | 1580/2500 [9:53:08<5:35:29, 21.88s/it] 63%|██████▎ | 1581/2500 [9:53:31<5:39:43, 22.18s/it] {'loss': 0.0003, 'grad_norm': 0.05189091647934994, 'learning_rate': 3.676e-07, 'completion_length': 180.71429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0071868896484375, 'epoch': 0.63} + 63%|██████▎ | 1581/2500 [9:53:31<5:39:43, 22.18s/it] 63%|██████▎ | 1582/2500 [9:53:52<5:32:59, 21.76s/it] {'loss': 0.0002, 'grad_norm': 0.020678057829832806, 'learning_rate': 3.672e-07, 'completion_length': 146.80358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00470733642578125, 'epoch': 0.63} + 63%|██████▎ | 1582/2500 [9:53:52<5:32:59, 21.76s/it] 63%|██████▎ | 1583/2500 [9:54:14<5:34:39, 21.90s/it] {'loss': 0.0002, 'grad_norm': 0.02735156270432299, 'learning_rate': 3.668e-07, 'completion_length': 153.62500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004791259765625, 'epoch': 0.63} + 63%|██████▎ | 1583/2500 [9:54:14<5:34:39, 21.90s/it] 63%|██████▎ | 1584/2500 [9:54:36<5:36:51, 22.06s/it] {'loss': 0.0003, 'grad_norm': 0.2661303068167022, 'learning_rate': 3.664e-07, 'completion_length': 170.37500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.007415771484375, 'epoch': 0.63} + 63%|██████▎ | 1584/2500 [9:54:37<5:36:51, 22.06s/it] 63%|██████▎ | 1585/2500 [9:54:59<5:37:38, 22.14s/it] {'loss': 0.0002, 'grad_norm': 0.02730709926245346, 'learning_rate': 3.6599999999999997e-07, 'completion_length': 164.5178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00418853759765625, 'epoch': 0.63} + 63%|██████▎ | 1585/2500 [9:54:59<5:37:38, 22.14s/it] 63%|██████▎ | 1586/2500 [9:55:22<5:40:46, 22.37s/it] {'loss': 0.0003, 'grad_norm': 0.42157715568360987, 'learning_rate': 3.6559999999999994e-07, 'completion_length': 170.7857208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.00689697265625, 'epoch': 0.63} + 63%|██████▎ | 1586/2500 [9:55:22<5:40:46, 22.37s/it] 63%|██████▎ | 1587/2500 [9:55:44<5:39:08, 22.29s/it] {'loss': 0.0002, 'grad_norm': 0.2724382473087955, 'learning_rate': 3.652e-07, 'completion_length': 151.6428680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0059814453125, 'epoch': 0.63} + 63%|██████▎ | 1587/2500 [9:55:44<5:39:08, 22.29s/it] 64%|██████▎ | 1588/2500 [9:56:05<5:35:34, 22.08s/it] {'loss': 0.0002, 'grad_norm': 0.021843413547471296, 'learning_rate': 3.648e-07, 'completion_length': 160.42858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005950927734375, 'epoch': 0.64} + 64%|██████▎ | 1588/2500 [9:56:05<5:35:34, 22.08s/it] 64%|██████▎ | 1589/2500 [9:56:27<5:31:25, 21.83s/it] {'loss': 0.0002, 'grad_norm': 0.8505620515853899, 'learning_rate': 3.644e-07, 'completion_length': 152.4464340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0057220458984375, 'epoch': 0.64} + 64%|██████▎ | 1589/2500 [9:56:27<5:31:25, 21.83s/it] 64%|██████▎ | 1590/2500 [9:56:49<5:31:36, 21.86s/it] {'loss': 0.0002, 'grad_norm': 0.01942712958759558, 'learning_rate': 3.64e-07, 'completion_length': 160.62500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006072998046875, 'epoch': 0.64} + 64%|██████▎ | 1590/2500 [9:56:49<5:31:36, 21.86s/it] 64%|██████▎ | 1591/2500 [9:57:11<5:32:10, 21.93s/it] {'loss': 0.0002, 'grad_norm': 0.8418736433941834, 'learning_rate': 3.6359999999999995e-07, 'completion_length': 147.85714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00505828857421875, 'epoch': 0.64} + 64%|██████▎ | 1591/2500 [9:57:11<5:32:10, 21.93s/it] 64%|██████▎ | 1592/2500 [9:57:33<5:32:21, 21.96s/it] {'loss': 0.0004, 'grad_norm': 2.855343102508509, 'learning_rate': 3.632e-07, 'completion_length': 175.8928680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.00933837890625, 'epoch': 0.64} + 64%|██████▎ | 1592/2500 [9:57:33<5:32:21, 21.96s/it] 64%|██████▎ | 1593/2500 [9:57:54<5:30:45, 21.88s/it] {'loss': 0.0003, 'grad_norm': 0.019616623697383522, 'learning_rate': 3.628e-07, 'completion_length': 151.1428680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0069122314453125, 'epoch': 0.64} + 64%|██████▎ | 1593/2500 [9:57:54<5:30:45, 21.88s/it] 64%|██████▍ | 1594/2500 [9:58:17<5:33:56, 22.12s/it] {'loss': 0.0002, 'grad_norm': 0.016706985089720742, 'learning_rate': 3.6239999999999996e-07, 'completion_length': 151.01786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0052490234375, 'epoch': 0.64} + 64%|██████▍ | 1594/2500 [9:58:17<5:33:56, 22.12s/it] 64%|██████▍ | 1595/2500 [9:58:41<5:41:27, 22.64s/it] {'loss': 0.0002, 'grad_norm': 0.018366167337461145, 'learning_rate': 3.62e-07, 'completion_length': 174.58929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.004852294921875, 'epoch': 0.64} + 64%|██████▍ | 1595/2500 [9:58:41<5:41:27, 22.64s/it] 64%|██████▍ | 1596/2500 [9:59:03<5:40:10, 22.58s/it] {'loss': 0.0001, 'grad_norm': 0.15481959752803767, 'learning_rate': 3.6159999999999996e-07, 'completion_length': 152.26786041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00365447998046875, 'epoch': 0.64} + 64%|██████▍ | 1596/2500 [9:59:03<5:40:10, 22.58s/it] 64%|██████▍ | 1597/2500 [9:59:25<5:34:48, 22.25s/it] {'loss': 0.0002, 'grad_norm': 0.23855970215354508, 'learning_rate': 3.612e-07, 'completion_length': 150.01786041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0041961669921875, 'epoch': 0.64} + 64%|██████▍ | 1597/2500 [9:59:25<5:34:48, 22.25s/it] 64%|██████▍ | 1598/2500 [9:59:46<5:31:49, 22.07s/it] {'loss': 0.0001, 'grad_norm': 0.018457649255353425, 'learning_rate': 3.608e-07, 'completion_length': 148.21429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00286102294921875, 'epoch': 0.64} + 64%|██████▍ | 1598/2500 [9:59:47<5:31:49, 22.07s/it] 64%|██████▍ | 1599/2500 [10:00:08<5:29:08, 21.92s/it] {'loss': 0.0002, 'grad_norm': 0.019144229642566065, 'learning_rate': 3.6039999999999997e-07, 'completion_length': 161.62500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00490570068359375, 'epoch': 0.64} + 64%|██████▍ | 1599/2500 [10:00:08<5:29:08, 21.92s/it] 64%|██████▍ | 1600/2500 [10:00:30<5:30:07, 22.01s/it] {'loss': 0.0003, 'grad_norm': 0.6463156658223221, 'learning_rate': 3.6e-07, 'completion_length': 165.03572845458984, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006683349609375, 'epoch': 0.64} + 64%|██████▍ | 1600/2500 [10:00:30<5:30:07, 22.01s/it] 64%|██████▍ | 1601/2500 [10:03:36<17:47:02, 71.22s/it] {'loss': 0.0002, 'grad_norm': 0.622169266484291, 'learning_rate': 3.5959999999999996e-07, 'completion_length': 163.35714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00435638427734375, 'epoch': 0.64} + 64%|██████▍ | 1601/2500 [10:03:36<17:47:02, 71.22s/it] 64%|██████▍ | 1602/2500 [10:03:53<13:39:45, 54.77s/it] {'loss': 0.0002, 'grad_norm': 0.02932296964299294, 'learning_rate': 3.592e-07, 'completion_length': 162.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00531005859375, 'epoch': 0.64} + 64%|██████▍ | 1602/2500 [10:03:53<13:39:45, 54.77s/it] 64%|██████▍ | 1603/2500 [10:04:08<10:43:47, 43.06s/it] {'loss': 0.0002, 'grad_norm': 0.6454204394269701, 'learning_rate': 3.588e-07, 'completion_length': 161.25000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0051116943359375, 'epoch': 0.64} + 64%|██████▍ | 1603/2500 [10:04:08<10:43:47, 43.06s/it] 64%|██████▍ | 1604/2500 [10:04:26<8:48:56, 35.42s/it] {'loss': 0.0002, 'grad_norm': 0.025903369819582675, 'learning_rate': 3.584e-07, 'completion_length': 166.87500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0040130615234375, 'epoch': 0.64} + 64%|██████▍ | 1604/2500 [10:04:26<8:48:56, 35.42s/it] 64%|██████▍ | 1605/2500 [10:04:43<7:25:46, 29.88s/it] {'loss': 0.0002, 'grad_norm': 2.9472548627678363, 'learning_rate': 3.5799999999999995e-07, 'completion_length': 163.85714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00567626953125, 'epoch': 0.64} + 64%|██████▍ | 1605/2500 [10:04:43<7:25:46, 29.88s/it] 64%|██████▍ | 1606/2500 [10:05:00<6:29:07, 26.12s/it] {'loss': 0.0002, 'grad_norm': 0.2925472896633128, 'learning_rate': 3.5759999999999997e-07, 'completion_length': 160.9107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0048828125, 'epoch': 0.64} + 64%|██████▍ | 1606/2500 [10:05:00<6:29:07, 26.12s/it] 64%|██████▍ | 1607/2500 [10:05:22<6:08:50, 24.78s/it] {'loss': 0.0002, 'grad_norm': 0.39341808546686313, 'learning_rate': 3.572e-07, 'completion_length': 151.67857360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0048980712890625, 'epoch': 0.64} + 64%|██████▍ | 1607/2500 [10:05:22<6:08:50, 24.78s/it] 64%|██████▍ | 1608/2500 [10:05:44<5:55:11, 23.89s/it] {'loss': 0.0003, 'grad_norm': 0.481131069001222, 'learning_rate': 3.5679999999999997e-07, 'completion_length': 167.0178680419922, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.12371791899204254, 'kl': 0.0074920654296875, 'epoch': 0.64} + 64%|██████▍ | 1608/2500 [10:05:44<5:55:11, 23.89s/it] 64%|██████▍ | 1609/2500 [10:06:06<5:46:27, 23.33s/it] {'loss': 0.0002, 'grad_norm': 0.0210458554248035, 'learning_rate': 3.564e-07, 'completion_length': 157.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005645751953125, 'epoch': 0.64} + 64%|██████▍ | 1609/2500 [10:06:06<5:46:27, 23.33s/it] 64%|██████▍ | 1610/2500 [10:06:28<5:39:13, 22.87s/it] {'loss': 0.0002, 'grad_norm': 0.6511081527103765, 'learning_rate': 3.5599999999999996e-07, 'completion_length': 156.8214340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0053558349609375, 'epoch': 0.64} + 64%|██████▍ | 1610/2500 [10:06:28<5:39:13, 22.87s/it] 64%|██████▍ | 1611/2500 [10:06:50<5:35:19, 22.63s/it] {'loss': 0.0002, 'grad_norm': 0.020757662668571895, 'learning_rate': 3.5560000000000003e-07, 'completion_length': 147.75, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.005706787109375, 'epoch': 0.64} + 64%|██████▍ | 1611/2500 [10:06:50<5:35:19, 22.63s/it] 64%|██████▍ | 1612/2500 [10:07:11<5:28:57, 22.23s/it] {'loss': 0.0002, 'grad_norm': 0.5726509962094503, 'learning_rate': 3.552e-07, 'completion_length': 148.5714340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.005157470703125, 'epoch': 0.64} + 64%|██████▍ | 1612/2500 [10:07:11<5:28:57, 22.23s/it] 65%|██████▍ | 1613/2500 [10:07:34<5:32:09, 22.47s/it] {'loss': 0.0002, 'grad_norm': 0.028321108342342236, 'learning_rate': 3.548e-07, 'completion_length': 170.1964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00467681884765625, 'epoch': 0.65} + 65%|██████▍ | 1613/2500 [10:07:34<5:32:09, 22.47s/it] 65%|██████▍ | 1614/2500 [10:07:56<5:27:59, 22.21s/it] {'loss': 0.0004, 'grad_norm': 2.055206582346046, 'learning_rate': 3.544e-07, 'completion_length': 150.12500762939453, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.1071428619325161, 'kl': 0.0098419189453125, 'epoch': 0.65} + 65%|██████▍ | 1614/2500 [10:07:56<5:27:59, 22.21s/it] 65%|██████▍ | 1615/2500 [10:08:17<5:24:00, 21.97s/it] {'loss': 0.0002, 'grad_norm': 0.022801314864980847, 'learning_rate': 3.5399999999999997e-07, 'completion_length': 156.4464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00395965576171875, 'epoch': 0.65} + 65%|██████▍ | 1615/2500 [10:08:17<5:24:00, 21.97s/it] 65%|██████▍ | 1616/2500 [10:08:39<5:24:10, 22.00s/it] {'loss': 0.0002, 'grad_norm': 0.2429387629617505, 'learning_rate': 3.536e-07, 'completion_length': 150.37500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004852294921875, 'epoch': 0.65} + 65%|██████▍ | 1616/2500 [10:08:39<5:24:10, 22.00s/it] 65%|██████▍ | 1617/2500 [10:09:03<5:31:06, 22.50s/it] {'loss': 0.0003, 'grad_norm': 0.20752619698056593, 'learning_rate': 3.532e-07, 'completion_length': 173.5357208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0067138671875, 'epoch': 0.65} + 65%|█��████▍ | 1617/2500 [10:09:03<5:31:06, 22.50s/it] 65%|██████▍ | 1618/2500 [10:09:24<5:27:22, 22.27s/it] {'loss': 0.0003, 'grad_norm': 0.019071939970287943, 'learning_rate': 3.528e-07, 'completion_length': 164.05357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0076751708984375, 'epoch': 0.65} + 65%|██████▍ | 1618/2500 [10:09:25<5:27:22, 22.27s/it] 65%|██████▍ | 1619/2500 [10:09:47<5:27:01, 22.27s/it] {'loss': 0.0002, 'grad_norm': 0.02327609286486461, 'learning_rate': 3.5239999999999995e-07, 'completion_length': 161.87500762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00620269775390625, 'epoch': 0.65} + 65%|██████▍ | 1619/2500 [10:09:47<5:27:01, 22.27s/it] 65%|██████▍ | 1620/2500 [10:10:09<5:25:47, 22.21s/it] {'loss': 0.0003, 'grad_norm': 0.07540583784498722, 'learning_rate': 3.52e-07, 'completion_length': 155.50000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0064544677734375, 'epoch': 0.65} + 65%|██████▍ | 1620/2500 [10:10:09<5:25:47, 22.21s/it] 65%|██████▍ | 1621/2500 [10:10:29<5:17:39, 21.68s/it] {'loss': 0.0002, 'grad_norm': 0.023419232947766287, 'learning_rate': 3.516e-07, 'completion_length': 146.96429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00482177734375, 'epoch': 0.65} + 65%|██████▍ | 1621/2500 [10:10:29<5:17:39, 21.68s/it] 65%|██████▍ | 1622/2500 [10:10:51<5:15:51, 21.59s/it] {'loss': 0.0002, 'grad_norm': 0.023224170347956468, 'learning_rate': 3.512e-07, 'completion_length': 144.4107208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00528717041015625, 'epoch': 0.65} + 65%|██████▍ | 1622/2500 [10:10:51<5:15:51, 21.59s/it] 65%|██████▍ | 1623/2500 [10:11:12<5:14:15, 21.50s/it] {'loss': 0.0003, 'grad_norm': 0.5001116560259349, 'learning_rate': 3.508e-07, 'completion_length': 149.6607208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0064697265625, 'epoch': 0.65} + 65%|██████▍ | 1623/2500 [10:11:12<5:14:15, 21.50s/it] 65%|██████▍ | 1624/2500 [10:11:34<5:14:13, 21.52s/it] {'loss': 0.0002, 'grad_norm': 0.014801518599191734, 'learning_rate': 3.5039999999999996e-07, 'completion_length': 150.6964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004669189453125, 'epoch': 0.65} + 65%|██████▍ | 1624/2500 [10:11:34<5:14:13, 21.52s/it] 65%|██████▌ | 1625/2500 [10:11:55<5:14:59, 21.60s/it] {'loss': 0.0002, 'grad_norm': 0.016231285091537435, 'learning_rate': 3.5e-07, 'completion_length': 158.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005706787109375, 'epoch': 0.65} + 65%|██████▌ | 1625/2500 [10:11:55<5:14:59, 21.60s/it] 65%|██████▌ | 1626/2500 [10:12:17<5:13:39, 21.53s/it] {'loss': 0.0003, 'grad_norm': 0.3717490481054568, 'learning_rate': 3.496e-07, 'completion_length': 168.25000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.007904052734375, 'epoch': 0.65} + 65%|██████▌ | 1626/2500 [10:12:17<5:13:39, 21.53s/it] 65%|██████▌ | 1627/2500 [10:12:38<5:13:41, 21.56s/it] {'loss': 0.0003, 'grad_norm': 0.6497032667935898, 'learning_rate': 3.492e-07, 'completion_length': 158.6607208251953, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695358991622925, 'kl': 0.0070953369140625, 'epoch': 0.65} + 65%|██████▌ | 1627/2500 [10:12:38<5:13:41, 21.56s/it] 65%|██████▌ | 1628/2500 [10:12:59<5:10:43, 21.38s/it] {'loss': 0.0002, 'grad_norm': 0.0453661660930203, 'learning_rate': 3.488e-07, 'completion_length': 153.23214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004669189453125, 'epoch': 0.65} + 65%|██████▌ | 1628/2500 [10:12:59<5:10:43, 21.38s/it] 65%|██████▌ | 1629/2500 [10:13:20<5:08:21, 21.24s/it] {'loss': 0.0002, 'grad_norm': 0.22479938091793977, 'learning_rate': 3.4839999999999997e-07, 'completion_length': 159.8214340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0039825439453125, 'epoch': 0.65} + 65%|██████▌ | 1629/2500 [10:13:20<5:08:21, 21.24s/it] 65%|██████▌ | 1630/2500 [10:13:41<5:05:25, 21.06s/it] {'loss': 0.0002, 'grad_norm': 0.015713885564004666, 'learning_rate': 3.4799999999999994e-07, 'completion_length': 144.5, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00408935546875, 'epoch': 0.65} + 65%|██████▌ | 1630/2500 [10:13:41<5:05:25, 21.06s/it] 65%|██████▌ | 1631/2500 [10:14:01<5:03:01, 20.92s/it] {'loss': 0.0004, 'grad_norm': 0.5212147793013554, 'learning_rate': 3.476e-07, 'completion_length': 146.5, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0094451904296875, 'epoch': 0.65} + 65%|██████▌ | 1631/2500 [10:14:01<5:03:01, 20.92s/it] 65%|██████▌ | 1632/2500 [10:14:24<5:08:24, 21.32s/it] {'loss': 0.0002, 'grad_norm': 0.02540420200003225, 'learning_rate': 3.472e-07, 'completion_length': 169.14286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004730224609375, 'epoch': 0.65} + 65%|██████▌ | 1632/2500 [10:14:24<5:08:24, 21.32s/it] 65%|██████▌ | 1633/2500 [10:14:45<5:07:10, 21.26s/it] {'loss': 0.0002, 'grad_norm': 0.24461862001069606, 'learning_rate': 3.4679999999999996e-07, 'completion_length': 147.05358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00494384765625, 'epoch': 0.65} + 65%|██████▌ | 1633/2500 [10:14:45<5:07:10, 21.26s/it] 65%|██████▌ | 1634/2500 [10:15:06<5:08:21, 21.36s/it] {'loss': 0.0003, 'grad_norm': 0.025426935912716372, 'learning_rate': 3.464e-07, 'completion_length': 165.4464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00872802734375, 'epoch': 0.65} + 65%|██████▌ | 1634/2500 [10:15:06<5:08:21, 21.36s/it] 65%|██████▌ | 1635/2500 [10:15:28<5:07:51, 21.35s/it] {'loss': 0.0002, 'grad_norm': 0.016528570318864018, 'learning_rate': 3.4599999999999995e-07, 'completion_length': 152.1607208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00450897216796875, 'epoch': 0.65} + 65%|██████▌ | 1635/2500 [10:15:28<5:07:51, 21.35s/it] 65%|██████▌ | 1636/2500 [10:15:51<5:13:45, 21.79s/it] {'loss': 0.0003, 'grad_norm': 0.019777172226729943, 'learning_rate': 3.456e-07, 'completion_length': 160.17857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006439208984375, 'epoch': 0.65} + 65%|██████▌ | 1636/2500 [10:15:51<5:13:45, 21.79s/it] 65%|██████▌ | 1637/2500 [10:16:12<5:11:30, 21.66s/it] {'loss': 0.0002, 'grad_norm': 0.3054144174152557, 'learning_rate': 3.452e-07, 'completion_length': 150.8214340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0044708251953125, 'epoch': 0.65} + 65%|██████▌ | 1637/2500 [10:16:12<5:11:30, 21.66s/it] 66%|██████▌ | 1638/2500 [10:16:34<5:11:31, 21.68s/it] {'loss': 0.0002, 'grad_norm': 0.26904365550345216, 'learning_rate': 3.4479999999999996e-07, 'completion_length': 157.05357360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00536346435546875, 'epoch': 0.66} + 66%|██████▌ | 1638/2500 [10:16:34<5:11:31, 21.68s/it] 66%|██████▌ | 1639/2500 [10:16:55<5:09:33, 21.57s/it] {'loss': 0.0002, 'grad_norm': 0.017083992774418154, 'learning_rate': 3.444e-07, 'completion_length': 147.2321548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045928955078125, 'epoch': 0.66} + 66%|██████▌ | 1639/2500 [10:16:55<5:09:33, 21.57s/it] 66%|██████▌ | 1640/2500 [10:17:17<5:11:25, 21.73s/it] {'loss': 0.0003, 'grad_norm': 0.026909923369612464, 'learning_rate': 3.4399999999999996e-07, 'completion_length': 147.25000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0072021484375, 'epoch': 0.66} + 66%|██████▌ | 1640/2500 [10:17:17<5:11:25, 21.73s/it] 66%|██████▌ | 1641/2500 [10:17:39<5:10:35, 21.69s/it] {'loss': 0.0001, 'grad_norm': 0.018188005141497234, 'learning_rate': 3.436e-07, 'completion_length': 147.08928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00333404541015625, 'epoch': 0.66} + 66%|██████▌ | 1641/2500 [10:17:39<5:10:35, 21.69s/it] 66%|██████▌ | 1642/2500 [10:18:01<5:12:09, 21.83s/it] {'loss': 0.0002, 'grad_norm': 0.33546535243282105, 'learning_rate': 3.432e-07, 'completion_length': 142.14286041259766, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.00472259521484375, 'epoch': 0.66} + 66%|██████▌ | 1642/2500 [10:18:01<5:12:09, 21.83s/it] 66%|██████▌ | 1643/2500 [10:18:23<5:12:36, 21.89s/it] {'loss': 0.0002, 'grad_norm': 0.031865556286506766, 'learning_rate': 3.4279999999999997e-07, 'completion_length': 161.2678680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0047760009765625, 'epoch': 0.66} + 66%|██████▌ | 1643/2500 [10:18:23<5:12:36, 21.89s/it] 66%|██████▌ | 1644/2500 [10:18:44<5:10:12, 21.74s/it] {'loss': 0.0002, 'grad_norm': 0.3965907679759688, 'learning_rate': 3.4239999999999994e-07, 'completion_length': 154.92858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.0052642822265625, 'epoch': 0.66} + 66%|██████▌ | 1644/2500 [10:18:44<5:10:12, 21.74s/it] 66%|██████▌ | 1645/2500 [10:19:06<5:10:55, 21.82s/it] {'loss': 0.0001, 'grad_norm': 0.04135141212343755, 'learning_rate': 3.42e-07, 'completion_length': 144.2857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003448486328125, 'epoch': 0.66} + 66%|██████▌ | 1645/2500 [10:19:06<5:10:55, 21.82s/it] 66%|██████▌ | 1646/2500 [10:19:28<5:11:52, 21.91s/it] {'loss': 0.0002, 'grad_norm': 4.548394180605552, 'learning_rate': 3.416e-07, 'completion_length': 140.67857360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00402069091796875, 'epoch': 0.66} + 66%|██████▌ | 1646/2500 [10:19:28<5:11:52, 21.91s/it] 66%|██████▌ | 1647/2500 [10:19:48<5:03:04, 21.32s/it] {'loss': 0.0001, 'grad_norm': 0.021142536549450825, 'learning_rate': 3.412e-07, 'completion_length': 133.8571548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00336456298828125, 'epoch': 0.66} + 66%|██████▌ | 1647/2500 [10:19:48<5:03:04, 21.32s/it] 66%|██████▌ | 1648/2500 [10:20:12<5:11:33, 21.94s/it] {'loss': 0.0003, 'grad_norm': 0.9008159548873255, 'learning_rate': 3.408e-07, 'completion_length': 168.82144165039062, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.007568359375, 'epoch': 0.66} + 66%|██████▌ | 1648/2500 [10:20:12<5:11:33, 21.94s/it] 66%|██████▌ | 1649/2500 [10:20:33<5:09:21, 21.81s/it] {'loss': 0.0002, 'grad_norm': 0.2201679246188307, 'learning_rate': 3.4039999999999995e-07, 'completion_length': 141.4107208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0040130615234375, 'epoch': 0.66} + 66%|██████▌ | 1649/2500 [10:20:33<5:09:21, 21.81s/it] 66%|██████▌ | 1650/2500 [10:20:55<5:07:23, 21.70s/it] {'loss': 0.0001, 'grad_norm': 0.01567600786320994, 'learning_rate': 3.4000000000000003e-07, 'completion_length': 157.35714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003040313720703125, 'epoch': 0.66} + 66%|██████▌ | 1650/2500 [10:20:55<5:07:23, 21.70s/it] 66%|██████▌ | 1651/2500 [10:21:16<5:04:29, 21.52s/it] {'loss': 0.0002, 'grad_norm': 0.4888727280649455, 'learning_rate': 3.396e-07, 'completion_length': 158.58929443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.00579833984375, 'epoch': 0.66} + 66%|██████▌ | 1651/2500 [10:21:16<5:04:29, 21.52s/it] 66%|██████▌ | 1652/2500 [10:21:37<5:03:20, 21.46s/it] {'loss': 0.0002, 'grad_norm': 0.23548281391596465, 'learning_rate': 3.3919999999999997e-07, 'completion_length': 141.8214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0058135986328125, 'epoch': 0.66} + 66%|██████▌ | 1652/2500 [10:21:37<5:03:20, 21.46s/it] 66%|██████▌ | 1653/2500 [10:21:59<5:03:24, 21.49s/it] {'loss': 0.0002, 'grad_norm': 0.018994496232009783, 'learning_rate': 3.388e-07, 'completion_length': 143.2321548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005401611328125, 'epoch': 0.66} + 66%|██████▌ | 1653/2500 [10:21:59<5:03:24, 21.49s/it] 66%|██████▌ | 1654/2500 [10:22:20<5:01:12, 21.36s/it] {'loss': 0.0003, 'grad_norm': 0.3802040369200348, 'learning_rate': 3.3839999999999996e-07, 'completion_length': 165.08929443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0062713623046875, 'epoch': 0.66} + 66%|██████▌ | 1654/2500 [10:22:20<5:01:12, 21.36s/it] 66%|██████▌ | 1655/2500 [10:22:43<5:10:40, 22.06s/it] {'loss': 0.0002, 'grad_norm': 0.3145242470652572, 'learning_rate': 3.38e-07, 'completion_length': 179.12501525878906, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695358991622925, 'kl': 0.0057220458984375, 'epoch': 0.66} + 66%|██████▌ | 1655/2500 [10:22:43<5:10:40, 22.06s/it] 66%|██████▌ | 1656/2500 [10:23:04<5:04:45, 21.67s/it] {'loss': 0.0002, 'grad_norm': 0.026291518609815476, 'learning_rate': 3.376e-07, 'completion_length': 139.6964340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.005950927734375, 'epoch': 0.66} + 66%|██████▌ | 1656/2500 [10:23:04<5:04:45, 21.67s/it] 66%|██████▋ | 1657/2500 [10:23:29<5:17:16, 22.58s/it] {'loss': 0.0005, 'grad_norm': 0.743302747990573, 'learning_rate': 3.372e-07, 'completion_length': 165.0357208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.012298583984375, 'epoch': 0.66} + 66%|██████▋ | 1657/2500 [10:23:29<5:17:16, 22.58s/it] 66%|██████▋ | 1658/2500 [10:23:51<5:15:11, 22.46s/it] {'loss': 0.0002, 'grad_norm': 0.17941464428443057, 'learning_rate': 3.368e-07, 'completion_length': 158.0357208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0055694580078125, 'epoch': 0.66} + 66%|██████▋ | 1658/2500 [10:23:51<5:15:11, 22.46s/it] 66%|██████▋ | 1659/2500 [10:24:13<5:10:54, 22.18s/it] {'loss': 0.0001, 'grad_norm': 0.017749674926534304, 'learning_rate': 3.3639999999999997e-07, 'completion_length': 135.9464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0032501220703125, 'epoch': 0.66} + 66%|██████▋ | 1659/2500 [10:24:13<5:10:54, 22.18s/it] 66%|██████▋ | 1660/2500 [10:24:34<5:09:27, 22.10s/it] {'loss': 0.0002, 'grad_norm': 0.020881981342402132, 'learning_rate': 3.36e-07, 'completion_length': 152.85714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00403594970703125, 'epoch': 0.66} + 66%|██████▋ | 1660/2500 [10:24:34<5:09:27, 22.10s/it] 66%|██████▋ | 1661/2500 [10:24:57<5:10:05, 22.18s/it] {'loss': 0.0005, 'grad_norm': 1.714092826328302, 'learning_rate': 3.356e-07, 'completion_length': 170.9464340209961, 'rewards/accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.1539071798324585, 'kl': 0.012786865234375, 'epoch': 0.66} + 66%|██████▋ | 1661/2500 [10:24:57<5:10:05, 22.18s/it] 66%|██████▋ | 1662/2500 [10:25:18<5:05:43, 21.89s/it] {'loss': 0.0002, 'grad_norm': 0.3919409132259595, 'learning_rate': 3.352e-07, 'completion_length': 155.83929443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005462646484375, 'epoch': 0.66} + 66%|██████▋ | 1662/2500 [10:25:18<5:05:43, 21.89s/it] 67%|██████▋ | 1663/2500 [10:25:40<5:04:01, 21.79s/it] {'loss': 0.0003, 'grad_norm': 0.058596063590411314, 'learning_rate': 3.3479999999999995e-07, 'completion_length': 159.00000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0067138671875, 'epoch': 0.67} + 67%|██████▋ | 1663/2500 [10:25:40<5:04:01, 21.79s/it] 67%|██████▋ | 1664/2500 [10:26:02<5:05:45, 21.94s/it] {'loss': 0.0002, 'grad_norm': 0.7567426468333509, 'learning_rate': 3.344e-07, 'completion_length': 160.5357208251953, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.07695359364151955, 'kl': 0.00490570068359375, 'epoch': 0.67} + 67%|██████▋ | 1664/2500 [10:26:02<5:05:45, 21.94s/it] 67%|██████▋ | 1665/2500 [10:26:24<5:04:55, 21.91s/it] {'loss': 0.0001, 'grad_norm': 0.014997784219452423, 'learning_rate': 3.34e-07, 'completion_length': 148.375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0031280517578125, 'epoch': 0.67} + 67%|██████▋ | 1665/2500 [10:26:24<5:04:55, 21.91s/it] 67%|██████▋ | 1666/2500 [10:26:46<5:05:21, 21.97s/it] {'loss': 0.0002, 'grad_norm': 0.020988153757140696, 'learning_rate': 3.3359999999999997e-07, 'completion_length': 156.3928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004638671875, 'epoch': 0.67} + 67%|██████▋ | 1666/2500 [10:26:46<5:05:21, 21.97s/it] 67%|██████▋ | 1667/2500 [10:27:07<5:00:21, 21.63s/it] {'loss': 0.0001, 'grad_norm': 0.019151113025991585, 'learning_rate': 3.332e-07, 'completion_length': 141.00000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00293731689453125, 'epoch': 0.67} + 67%|██████▋ | 1667/2500 [10:27:07<5:00:21, 21.63s/it] 67%|██████▋ | 1668/2500 [10:27:31<5:11:10, 22.44s/it] {'loss': 0.0002, 'grad_norm': 0.35389914424625113, 'learning_rate': 3.3279999999999996e-07, 'completion_length': 167.92857360839844, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.00506591796875, 'epoch': 0.67} + 67%|██████▋ | 1668/2500 [10:27:31<5:11:10, 22.44s/it] 67%|██████▋ | 1669/2500 [10:27:53<5:08:38, 22.28s/it] {'loss': 0.0002, 'grad_norm': 0.021763607507699324, 'learning_rate': 3.3239999999999993e-07, 'completion_length': 144.21429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0051727294921875, 'epoch': 0.67} + 67%|██████▋ | 1669/2500 [10:27:53<5:08:38, 22.28s/it] 67%|██████▋ | 1670/2500 [10:28:14<5:04:49, 22.04s/it] {'loss': 0.0002, 'grad_norm': 0.04755004495047589, 'learning_rate': 3.32e-07, 'completion_length': 156.62500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045166015625, 'epoch': 0.67} + 67%|██████▋ | 1670/2500 [10:28:14<5:04:49, 22.04s/it] 67%|██████▋ | 1671/2500 [10:28:35<4:59:38, 21.69s/it] {'loss': 0.0002, 'grad_norm': 0.4399790135221658, 'learning_rate': 3.316e-07, 'completion_length': 149.92857360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.006134033203125, 'epoch': 0.67} + 67%|██████▋ | 1671/2500 [10:28:35<4:59:38, 21.69s/it] 67%|██████▋ | 1672/2500 [10:28:56<4:54:46, 21.36s/it] {'loss': 0.0001, 'grad_norm': 0.018044808278951587, 'learning_rate': 3.312e-07, 'completion_length': 143.1964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00368499755859375, 'epoch': 0.67} + 67%|██████▋ | 1672/2500 [10:28:56<4:54:46, 21.36s/it] 67%|██████▋ | 1673/2500 [10:29:18<4:55:42, 21.45s/it] {'loss': 0.0002, 'grad_norm': 0.013413769471598318, 'learning_rate': 3.3079999999999997e-07, 'completion_length': 153.9107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003997802734375, 'epoch': 0.67} + 67%|██████▋ | 1673/2500 [10:29:18<4:55:42, 21.45s/it] 67%|██████▋ | 1674/2500 [10:29:39<4:55:00, 21.43s/it] {'loss': 0.0001, 'grad_norm': 0.41478509186568085, 'learning_rate': 3.304e-07, 'completion_length': 142.9107208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.003692626953125, 'epoch': 0.67} + 67%|██████▋ | 1674/2500 [10:29:39<4:55:00, 21.43s/it] 67%|██████▋ | 1675/2500 [10:30:00<4:53:54, 21.37s/it] {'loss': 0.0001, 'grad_norm': 0.014182969831752556, 'learning_rate': 3.3e-07, 'completion_length': 140.8214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0020599365234375, 'epoch': 0.67} + 67%|██████▋ | 1675/2500 [10:30:00<4:53:54, 21.37s/it] 67%|██████▋ | 1676/2500 [10:30:22<4:56:22, 21.58s/it] {'loss': 0.0003, 'grad_norm': 0.021178418757053, 'learning_rate': 3.296e-07, 'completion_length': 144.80358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.007171630859375, 'epoch': 0.67} + 67%|██████▋ | 1676/2500 [10:30:22<4:56:22, 21.58s/it] 67%|██████▋ | 1677/2500 [10:30:43<4:54:33, 21.47s/it] {'loss': 0.0002, 'grad_norm': 0.023221887746337284, 'learning_rate': 3.2919999999999996e-07, 'completion_length': 150.75000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00507354736328125, 'epoch': 0.67} + 67%|██████▋ | 1677/2500 [10:30:43<4:54:33, 21.47s/it] 67%|██████▋ | 1678/2500 [10:31:06<4:58:48, 21.81s/it] {'loss': 0.0002, 'grad_norm': 0.4884934758750149, 'learning_rate': 3.288e-07, 'completion_length': 151.1964340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00618743896484375, 'epoch': 0.67} + 67%|██████▋ | 1678/2500 [10:31:06<4:58:48, 21.81s/it] 67%|██████▋ | 1679/2500 [10:31:28<5:00:33, 21.97s/it] {'loss': 0.0002, 'grad_norm': 0.32291132850478393, 'learning_rate': 3.284e-07, 'completion_length': 168.8214340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.0058746337890625, 'epoch': 0.67} + 67%|██████▋ | 1679/2500 [10:31:28<5:00:33, 21.97s/it] 67%|██████▋ | 1680/2500 [10:31:50<5:00:43, 22.00s/it] {'loss': 0.0002, 'grad_norm': 0.04164072550443947, 'learning_rate': 3.28e-07, 'completion_length': 144.92858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00479888916015625, 'epoch': 0.67} + 67%|██████▋ | 1680/2500 [10:31:50<5:00:43, 22.00s/it] 67%|██████▋ | 1681/2500 [10:32:12<4:57:09, 21.77s/it] {'loss': 0.0002, 'grad_norm': 0.6580700239208594, 'learning_rate': 3.276e-07, 'completion_length': 152.9107208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.004241943359375, 'epoch': 0.67} + 67%|██████▋ | 1681/2500 [10:32:12<4:57:09, 21.77s/it] 67%|██████▋ | 1682/2500 [10:32:33<4:54:59, 21.64s/it] {'loss': 0.0002, 'grad_norm': 0.01575573615834299, 'learning_rate': 3.2719999999999997e-07, 'completion_length': 148.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0044708251953125, 'epoch': 0.67} + 67%|██████▋ | 1682/2500 [10:32:33<4:54:59, 21.64s/it] 67%|██████▋ | 1683/2500 [10:32:54<4:53:44, 21.57s/it] {'loss': 0.0003, 'grad_norm': 0.4960254712954266, 'learning_rate': 3.268e-07, 'completion_length': 153.5178680419922, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.0063323974609375, 'epoch': 0.67} + 67%|██████▋ | 1683/2500 [10:32:54<4:53:44, 21.57s/it] 67%|██████▋ | 1684/2500 [10:33:16<4:55:26, 21.72s/it] {'loss': 0.0003, 'grad_norm': 0.02703618741263647, 'learning_rate': 3.264e-07, 'completion_length': 164.26786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00672149658203125, 'epoch': 0.67} + 67%|██████▋ | 1684/2500 [10:33:16<4:55:26, 21.72s/it] 67%|██████▋ | 1685/2500 [10:33:40<5:02:00, 22.23s/it] {'loss': 0.0002, 'grad_norm': 0.7220942596798257, 'learning_rate': 3.26e-07, 'completion_length': 165.71428680419922, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0824786126613617, 'kl': 0.0061187744140625, 'epoch': 0.67} + 67%|██████▋ | 1685/2500 [10:33:40<5:02:00, 22.23s/it] 67%|██████▋ | 1686/2500 [10:34:01<4:58:25, 22.00s/it] {'loss': 0.0002, 'grad_norm': 0.1672573265872212, 'learning_rate': 3.256e-07, 'completion_length': 155.7857208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.005889892578125, 'epoch': 0.67} + 67%|██████▋ | 1686/2500 [10:34:01<4:58:25, 22.00s/it] 67%|██████▋ | 1687/2500 [10:34:23<4:54:44, 21.75s/it] {'loss': 0.0003, 'grad_norm': 0.5260903913459329, 'learning_rate': 3.252e-07, 'completion_length': 145.89286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0078582763671875, 'epoch': 0.67} + 67%|██████▋ | 1687/2500 [10:34:23<4:54:44, 21.75s/it] 68%|██████▊ | 1688/2500 [10:34:46<4:59:31, 22.13s/it] {'loss': 0.0002, 'grad_norm': 0.2046988487633062, 'learning_rate': 3.2479999999999994e-07, 'completion_length': 150.4464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0050811767578125, 'epoch': 0.68} + 68%|██████▊ | 1688/2500 [10:34:46<4:59:31, 22.13s/it] 68%|██████▊ | 1689/2500 [10:35:09<5:02:30, 22.38s/it] {'loss': 0.0003, 'grad_norm': 0.4325186411383301, 'learning_rate': 3.244e-07, 'completion_length': 154.17858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.00634765625, 'epoch': 0.68} + 68%|██████▊ | 1689/2500 [10:35:09<5:02:30, 22.38s/it] 68%|██████▊ | 1690/2500 [10:35:31<5:04:00, 22.52s/it] {'loss': 0.0002, 'grad_norm': 0.031664465073427996, 'learning_rate': 3.24e-07, 'completion_length': 156.30357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0054168701171875, 'epoch': 0.68} + 68%|██████▊ | 1690/2500 [10:35:31<5:04:00, 22.52s/it] 68%|██████▊ | 1691/2500 [10:35:54<5:04:27, 22.58s/it] {'loss': 0.0001, 'grad_norm': 1.6778373554561334, 'learning_rate': 3.2359999999999996e-07, 'completion_length': 153.14286041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00325775146484375, 'epoch': 0.68} + 68%|██████▊ | 1691/2500 [10:35:54<5:04:27, 22.58s/it] 68%|██████▊ | 1692/2500 [10:36:15<4:58:20, 22.15s/it] {'loss': 0.0002, 'grad_norm': 0.07222064854516431, 'learning_rate': 3.232e-07, 'completion_length': 146.71429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00433349609375, 'epoch': 0.68} + 68%|██████▊ | 1692/2500 [10:36:15<4:58:20, 22.15s/it] 68%|██████▊ | 1693/2500 [10:36:38<4:59:02, 22.23s/it] {'loss': 0.0002, 'grad_norm': 0.20796115294003595, 'learning_rate': 3.2279999999999995e-07, 'completion_length': 174.30358123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0041961669921875, 'epoch': 0.68} + 68%|██████▊ | 1693/2500 [10:36:38<4:59:02, 22.23s/it] 68%|██████▊ | 1694/2500 [10:36:59<4:55:48, 22.02s/it] {'loss': 0.0002, 'grad_norm': 0.029682602410959272, 'learning_rate': 3.2240000000000003e-07, 'completion_length': 164.1428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005615234375, 'epoch': 0.68} + 68%|██████▊ | 1694/2500 [10:36:59<4:55:48, 22.02s/it] 68%|██████▊ | 1695/2500 [10:37:20<4:51:52, 21.75s/it] {'loss': 0.0002, 'grad_norm': 0.6294052970532142, 'learning_rate': 3.22e-07, 'completion_length': 148.3214340209961, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0049591064453125, 'epoch': 0.68} + 68%|██████▊ | 1695/2500 [10:37:20<4:51:52, 21.75s/it] 68%|██████▊ | 1696/2500 [10:37:42<4:51:44, 21.77s/it] {'loss': 0.0002, 'grad_norm': 0.1777389582761134, 'learning_rate': 3.2159999999999997e-07, 'completion_length': 152.17858123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00579833984375, 'epoch': 0.68} + 68%|██████▊ | 1696/2500 [10:37:42<4:51:44, 21.77s/it] 68%|██████▊ | 1697/2500 [10:38:03<4:48:04, 21.52s/it] {'loss': 0.0003, 'grad_norm': 1.4694016064878606, 'learning_rate': 3.212e-07, 'completion_length': 149.67857360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.006500244140625, 'epoch': 0.68} + 68%|██████▊ | 1697/2500 [10:38:03<4:48:04, 21.52s/it] 68%|██████▊ | 1698/2500 [10:38:24<4:46:03, 21.40s/it] {'loss': 0.0002, 'grad_norm': 0.1811313903155928, 'learning_rate': 3.2079999999999996e-07, 'completion_length': 150.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005950927734375, 'epoch': 0.68} + 68%|██████▊ | 1698/2500 [10:38:24<4:46:03, 21.40s/it] 68%|██████▊ | 1699/2500 [10:38:47<4:49:46, 21.71s/it] {'loss': 0.0002, 'grad_norm': 0.597842194337642, 'learning_rate': 3.204e-07, 'completion_length': 164.67858123779297, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.11266787722706795, 'kl': 0.005523681640625, 'epoch': 0.68} + 68%|██████▊ | 1699/2500 [10:38:47<4:49:46, 21.71s/it] 68%|██████▊ | 1700/2500 [10:39:08<4:48:10, 21.61s/it] {'loss': 0.0002, 'grad_norm': 0.025044401393032854, 'learning_rate': 3.2e-07, 'completion_length': 153.3928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00396728515625, 'epoch': 0.68} + 68%|██████▊ | 1700/2500 [10:39:08<4:48:10, 21.61s/it] 68%|██████▊ | 1701/2500 [10:42:13<15:41:42, 70.72s/it] {'loss': 0.0002, 'grad_norm': 0.024634221751704547, 'learning_rate': 3.196e-07, 'completion_length': 155.80358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0053253173828125, 'epoch': 0.68} + 68%|██████▊ | 1701/2500 [10:42:13<15:41:42, 70.72s/it] 68%|██████▊ | 1702/2500 [10:42:35<12:24:40, 55.99s/it] {'loss': 0.0003, 'grad_norm': 0.028224377370977213, 'learning_rate': 3.1919999999999995e-07, 'completion_length': 155.6428680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0079803466796875, 'epoch': 0.68} + 68%|██████▊ | 1702/2500 [10:42:35<12:24:40, 55.99s/it] 68%|██████▊ | 1703/2500 [10:42:56<10:03:10, 45.41s/it] {'loss': 0.0002, 'grad_norm': 0.02175776030452189, 'learning_rate': 3.1879999999999997e-07, 'completion_length': 148.8214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00543212890625, 'epoch': 0.68} + 68%|██████▊ | 1703/2500 [10:42:56<10:03:10, 45.41s/it] 68%|██████▊ | 1704/2500 [10:43:18<8:29:34, 38.41s/it] {'loss': 0.0003, 'grad_norm': 0.02158630437484418, 'learning_rate': 3.184e-07, 'completion_length': 176.69644165039062, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0064697265625, 'epoch': 0.68} + 68%|██████▊ | 1704/2500 [10:43:18<8:29:34, 38.41s/it] 68%|██████▊ | 1705/2500 [10:43:40<7:26:34, 33.70s/it] {'loss': 0.0002, 'grad_norm': 1.4237520023546781, 'learning_rate': 3.18e-07, 'completion_length': 163.5178680419922, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.1428571529686451, 'kl': 0.00616455078125, 'epoch': 0.68} + 68%|██████▊ | 1705/2500 [10:43:40<7:26:34, 33.70s/it] 68%|██████▊ | 1706/2500 [10:44:02<6:38:45, 30.13s/it] {'loss': 0.0002, 'grad_norm': 0.027787909165686086, 'learning_rate': 3.176e-07, 'completion_length': 147.9107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0059051513671875, 'epoch': 0.68} + 68%|██████▊ | 1706/2500 [10:44:02<6:38:45, 30.13s/it] 68%|██████▊ | 1707/2500 [10:44:23<6:02:49, 27.45s/it] {'loss': 0.0002, 'grad_norm': 0.025364518006556716, 'learning_rate': 3.1719999999999996e-07, 'completion_length': 158.4464340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0056610107421875, 'epoch': 0.68} + 68%|██████▊ | 1707/2500 [10:44:23<6:02:49, 27.45s/it] 68%|██████▊ | 1708/2500 [10:44:46<5:44:02, 26.06s/it] {'loss': 0.0003, 'grad_norm': 0.36953863636521905, 'learning_rate': 3.1680000000000003e-07, 'completion_length': 158.3214340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.007965087890625, 'epoch': 0.68} + 68%|██████▊ | 1708/2500 [10:44:46<5:44:02, 26.06s/it] 68%|██████▊ | 1709/2500 [10:45:08<5:24:55, 24.65s/it] {'loss': 0.0002, 'grad_norm': 0.038787840803456826, 'learning_rate': 3.164e-07, 'completion_length': 151.0178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0052947998046875, 'epoch': 0.68} + 68%|██████▊ | 1709/2500 [10:45:08<5:24:55, 24.65s/it] 68%|██████▊ | 1710/2500 [10:45:29<5:12:49, 23.76s/it] {'loss': 0.0001, 'grad_norm': 0.38349199881435353, 'learning_rate': 3.1599999999999997e-07, 'completion_length': 154.26786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0037078857421875, 'epoch': 0.68} + 68%|██████▊ | 1710/2500 [10:45:29<5:12:49, 23.76s/it] 68%|██████▊ | 1711/2500 [10:45:51<5:05:32, 23.24s/it] {'loss': 0.0002, 'grad_norm': 0.218017410936999, 'learning_rate': 3.156e-07, 'completion_length': 161.9107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0059814453125, 'epoch': 0.68} + 68%|██████▊ | 1711/2500 [10:45:51<5:05:32, 23.24s/it] 68%|██████▊ | 1712/2500 [10:46:14<5:01:12, 22.94s/it] {'loss': 0.0002, 'grad_norm': 0.20578237334097707, 'learning_rate': 3.1519999999999996e-07, 'completion_length': 159.25000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0055389404296875, 'epoch': 0.68} + 68%|██████▊ | 1712/2500 [10:46:14<5:01:12, 22.94s/it] 69%|██████▊ | 1713/2500 [10:46:36<5:00:01, 22.87s/it] {'loss': 0.0002, 'grad_norm': 0.4139891225960969, 'learning_rate': 3.148e-07, 'completion_length': 172.4107208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.005645751953125, 'epoch': 0.69} + 69%|██████▊ | 1713/2500 [10:46:36<5:00:01, 22.87s/it] 69%|██████▊ | 1714/2500 [10:46:59<4:58:42, 22.80s/it] {'loss': 0.0003, 'grad_norm': 0.3880617536217989, 'learning_rate': 3.144e-07, 'completion_length': 170.2321548461914, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.006744384765625, 'epoch': 0.69} + 69%|██████▊ | 1714/2500 [10:46:59<4:58:42, 22.80s/it] 69%|██████▊ | 1715/2500 [10:47:21<4:55:40, 22.60s/it] {'loss': 0.0003, 'grad_norm': 0.6359004000834929, 'learning_rate': 3.14e-07, 'completion_length': 173.55358123779297, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.007354736328125, 'epoch': 0.69} + 69%|██████▊ | 1715/2500 [10:47:21<4:55:40, 22.60s/it] 69%|██████▊ | 1716/2500 [10:47:43<4:52:46, 22.41s/it] {'loss': 0.0002, 'grad_norm': 0.8002777973949287, 'learning_rate': 3.1359999999999995e-07, 'completion_length': 145.98214721679688, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.00445556640625, 'epoch': 0.69} + 69%|██████▊ | 1716/2500 [10:47:43<4:52:46, 22.41s/it] 69%|██████▊ | 1717/2500 [10:48:06<4:55:58, 22.68s/it] {'loss': 0.0002, 'grad_norm': 0.030336499361570863, 'learning_rate': 3.1319999999999997e-07, 'completion_length': 152.37500762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00502777099609375, 'epoch': 0.69} + 69%|██████▊ | 1717/2500 [10:48:06<4:55:58, 22.68s/it] 69%|██████▊ | 1718/2500 [10:48:27<4:49:44, 22.23s/it] {'loss': 0.0002, 'grad_norm': 0.05277990478063234, 'learning_rate': 3.128e-07, 'completion_length': 143.75000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0055084228515625, 'epoch': 0.69} + 69%|██████▊ | 1718/2500 [10:48:27<4:49:44, 22.23s/it] 69%|██████▉ | 1719/2500 [10:48:49<4:46:51, 22.04s/it] {'loss': 0.0002, 'grad_norm': 0.07576593448410866, 'learning_rate': 3.124e-07, 'completion_length': 158.375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00412750244140625, 'epoch': 0.69} + 69%|██████▉ | 1719/2500 [10:48:49<4:46:51, 22.04s/it] 69%|██████▉ | 1720/2500 [10:49:11<4:46:00, 22.00s/it] {'loss': 0.0002, 'grad_norm': 0.6689707751130457, 'learning_rate': 3.12e-07, 'completion_length': 163.73214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0059967041015625, 'epoch': 0.69} + 69%|██████▉ | 1720/2500 [10:49:11<4:46:00, 22.00s/it] 69%|██████▉ | 1721/2500 [10:49:33<4:44:53, 21.94s/it] {'loss': 0.0002, 'grad_norm': 0.2739209538325343, 'learning_rate': 3.1159999999999996e-07, 'completion_length': 158.05358123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00567626953125, 'epoch': 0.69} + 69%|██████▉ | 1721/2500 [10:49:33<4:44:53, 21.94s/it] 69%|██████▉ | 1722/2500 [10:49:55<4:45:30, 22.02s/it] {'loss': 0.0002, 'grad_norm': 0.4137065186725948, 'learning_rate': 3.112e-07, 'completion_length': 166.14286041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.004913330078125, 'epoch': 0.69} + 69%|██████▉ | 1722/2500 [10:49:55<4:45:30, 22.02s/it] 69%|██████▉ | 1723/2500 [10:50:17<4:45:20, 22.03s/it] {'loss': 0.0002, 'grad_norm': 0.03217930372244392, 'learning_rate': 3.108e-07, 'completion_length': 169.50000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.005615234375, 'epoch': 0.69} + 69%|██████▉ | 1723/2500 [10:50:17<4:45:20, 22.03s/it] 69%|██████▉ | 1724/2500 [10:50:40<4:46:51, 22.18s/it] {'loss': 0.0001, 'grad_norm': 0.013052701121417709, 'learning_rate': 3.104e-07, 'completion_length': 150.4107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00362396240234375, 'epoch': 0.69} + 69%|██████▉ | 1724/2500 [10:50:40<4:46:51, 22.18s/it] 69%|██████▉ | 1725/2500 [10:51:01<4:43:00, 21.91s/it] {'loss': 0.0002, 'grad_norm': 0.17392610650293602, 'learning_rate': 3.1e-07, 'completion_length': 156.80358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00543212890625, 'epoch': 0.69} + 69%|██████▉ | 1725/2500 [10:51:01<4:43:00, 21.91s/it] 69%|██████▉ | 1726/2500 [10:51:23<4:43:40, 21.99s/it] {'loss': 0.0003, 'grad_norm': 0.04322334826483624, 'learning_rate': 3.0959999999999997e-07, 'completion_length': 162.73214721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00823974609375, 'epoch': 0.69} + 69%|██████▉ | 1726/2500 [10:51:23<4:43:40, 21.99s/it] 69%|██████▉ | 1727/2500 [10:51:46<4:47:11, 22.29s/it] {'loss': 0.0002, 'grad_norm': 0.43291953804455485, 'learning_rate': 3.0919999999999994e-07, 'completion_length': 176.05358123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.0058441162109375, 'epoch': 0.69} + 69%|██████▉ | 1727/2500 [10:51:46<4:47:11, 22.29s/it] 69%|██████▉ | 1728/2500 [10:52:08<4:43:45, 22.05s/it] {'loss': 0.0002, 'grad_norm': 0.11965332185690743, 'learning_rate': 3.088e-07, 'completion_length': 156.17858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0058746337890625, 'epoch': 0.69} + 69%|██████▉ | 1728/2500 [10:52:08<4:43:45, 22.05s/it] 69%|██████▉ | 1729/2500 [10:52:31<4:47:50, 22.40s/it] {'loss': 0.0003, 'grad_norm': 0.2966157025374653, 'learning_rate': 3.084e-07, 'completion_length': 181.85714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0086669921875, 'epoch': 0.69} + 69%|██████▉ | 1729/2500 [10:52:31<4:47:50, 22.40s/it] 69%|██████▉ | 1730/2500 [10:52:51<4:40:44, 21.88s/it] {'loss': 0.0002, 'grad_norm': 0.08497169029825774, 'learning_rate': 3.08e-07, 'completion_length': 146.9821548461914, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0042266845703125, 'epoch': 0.69} + 69%|██████▉ | 1730/2500 [10:52:51<4:40:44, 21.88s/it] 69%|██████▉ | 1731/2500 [10:53:12<4:35:04, 21.46s/it] {'loss': 0.0001, 'grad_norm': 0.2664678552458348, 'learning_rate': 3.076e-07, 'completion_length': 138.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00323486328125, 'epoch': 0.69} + 69%|██████▉ | 1731/2500 [10:53:12<4:35:04, 21.46s/it] 69%|██████▉ | 1732/2500 [10:53:35<4:39:55, 21.87s/it] {'loss': 0.0002, 'grad_norm': 0.015429553065526262, 'learning_rate': 3.0719999999999995e-07, 'completion_length': 158.23214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00377655029296875, 'epoch': 0.69} + 69%|██████▉ | 1732/2500 [10:53:35<4:39:55, 21.87s/it] 69%|██████▉ | 1733/2500 [10:53:57<4:41:30, 22.02s/it] {'loss': 0.0003, 'grad_norm': 1.8086487460862966, 'learning_rate': 3.068e-07, 'completion_length': 160.4464340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0068511962890625, 'epoch': 0.69} + 69%|██████▉ | 1733/2500 [10:53:57<4:41:30, 22.02s/it] 69%|██████▉ | 1734/2500 [10:54:19<4:40:20, 21.96s/it] {'loss': 0.0002, 'grad_norm': 1.0296920599410169, 'learning_rate': 3.064e-07, 'completion_length': 154.35714721679688, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.006134033203125, 'epoch': 0.69} + 69%|██████▉ | 1734/2500 [10:54:19<4:40:20, 21.96s/it] 69%|██████▉ | 1735/2500 [10:54:40<4:38:16, 21.83s/it] {'loss': 0.0002, 'grad_norm': 0.022184762432193812, 'learning_rate': 3.0599999999999996e-07, 'completion_length': 152.0714340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0045318603515625, 'epoch': 0.69} + 69%|██████▉ | 1735/2500 [10:54:40<4:38:16, 21.83s/it] 69%|██████▉ | 1736/2500 [10:55:02<4:37:22, 21.78s/it] {'loss': 0.0003, 'grad_norm': 0.31422532928362407, 'learning_rate': 3.056e-07, 'completion_length': 159.4464340209961, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0072479248046875, 'epoch': 0.69} + 69%|██████▉ | 1736/2500 [10:55:02<4:37:22, 21.78s/it] 69%|██████▉ | 1737/2500 [10:55:26<4:43:11, 22.27s/it] {'loss': 0.0003, 'grad_norm': 0.02777894878365479, 'learning_rate': 3.052e-07, 'completion_length': 172.05358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0070648193359375, 'epoch': 0.69} + 69%|██████▉ | 1737/2500 [10:55:26<4:43:11, 22.27s/it] 70%|██████▉ | 1738/2500 [10:55:48<4:42:33, 22.25s/it] {'loss': 0.0001, 'grad_norm': 0.013715396189831718, 'learning_rate': 3.048e-07, 'completion_length': 144.3928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00357818603515625, 'epoch': 0.7} + 70%|██████▉ | 1738/2500 [10:55:48<4:42:33, 22.25s/it] 70%|██████▉ | 1739/2500 [10:56:09<4:40:09, 22.09s/it] {'loss': 0.0003, 'grad_norm': 0.024163213783552167, 'learning_rate': 3.044e-07, 'completion_length': 161.23214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006622314453125, 'epoch': 0.7} + 70%|██████▉ | 1739/2500 [10:56:09<4:40:09, 22.09s/it] 70%|██████▉ | 1740/2500 [10:56:31<4:39:30, 22.07s/it] {'loss': 0.0002, 'grad_norm': 0.4476744724346675, 'learning_rate': 3.0399999999999997e-07, 'completion_length': 166.10714721679688, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.00574493408203125, 'epoch': 0.7} + 70%|██████▉ | 1740/2500 [10:56:31<4:39:30, 22.07s/it] 70%|██████▉ | 1741/2500 [10:56:54<4:40:27, 22.17s/it] {'loss': 0.0002, 'grad_norm': 0.1792519473288559, 'learning_rate': 3.036e-07, 'completion_length': 160.0178680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0050201416015625, 'epoch': 0.7} + 70%|██████▉ | 1741/2500 [10:56:54<4:40:27, 22.17s/it] 70%|██████▉ | 1742/2500 [10:57:17<4:44:32, 22.52s/it] {'loss': 0.0002, 'grad_norm': 0.033944355644768595, 'learning_rate': 3.032e-07, 'completion_length': 171.44644165039062, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0052642822265625, 'epoch': 0.7} + 70%|██████▉ | 1742/2500 [10:57:17<4:44:32, 22.52s/it] 70%|██████▉ | 1743/2500 [10:57:41<4:48:32, 22.87s/it] {'loss': 0.0003, 'grad_norm': 0.4714276292281754, 'learning_rate': 3.028e-07, 'completion_length': 170.25000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.006256103515625, 'epoch': 0.7} + 70%|██████▉ | 1743/2500 [10:57:41<4:48:32, 22.87s/it] 70%|██████▉ | 1744/2500 [10:58:04<4:49:12, 22.95s/it] {'loss': 0.0004, 'grad_norm': 0.7677872860423063, 'learning_rate': 3.024e-07, 'completion_length': 170.1964340209961, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.009613037109375, 'epoch': 0.7} + 70%|██████▉ | 1744/2500 [10:58:04<4:49:12, 22.95s/it] 70%|██████▉ | 1745/2500 [10:58:26<4:45:55, 22.72s/it] {'loss': 0.0003, 'grad_norm': 0.03686054127430348, 'learning_rate': 3.02e-07, 'completion_length': 166.35714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0066375732421875, 'epoch': 0.7} + 70%|██████▉ | 1745/2500 [10:58:26<4:45:55, 22.72s/it] 70%|██████▉ | 1746/2500 [10:58:48<4:40:52, 22.35s/it] {'loss': 0.0001, 'grad_norm': 0.015461385620249353, 'learning_rate': 3.0159999999999995e-07, 'completion_length': 149.67857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00353240966796875, 'epoch': 0.7} + 70%|██████▉ | 1746/2500 [10:58:48<4:40:52, 22.35s/it] 70%|██████▉ | 1747/2500 [10:59:10<4:42:07, 22.48s/it] {'loss': 0.0003, 'grad_norm': 0.02050949484149749, 'learning_rate': 3.012e-07, 'completion_length': 156.89286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00628662109375, 'epoch': 0.7} + 70%|██████▉ | 1747/2500 [10:59:10<4:42:07, 22.48s/it] 70%|██████▉ | 1748/2500 [10:59:32<4:37:50, 22.17s/it] {'loss': 0.0002, 'grad_norm': 0.03490311919347739, 'learning_rate': 3.008e-07, 'completion_length': 140.8214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0041351318359375, 'epoch': 0.7} + 70%|██████▉ | 1748/2500 [10:59:32<4:37:50, 22.17s/it] 70%|██████▉ | 1749/2500 [10:59:53<4:32:01, 21.73s/it] {'loss': 0.0002, 'grad_norm': 0.44687967764218395, 'learning_rate': 3.0039999999999996e-07, 'completion_length': 138.48215103149414, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00457000732421875, 'epoch': 0.7} + 70%|██████▉ | 1749/2500 [10:59:53<4:32:01, 21.73s/it] 70%|███████ | 1750/2500 [11:00:14<4:29:42, 21.58s/it] {'loss': 0.0002, 'grad_norm': 0.32079964398569877, 'learning_rate': 3e-07, 'completion_length': 142.9107208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005340576171875, 'epoch': 0.7} + 70%|███████ | 1750/2500 [11:00:14<4:29:42, 21.58s/it] 70%|███████ | 1751/2500 [11:00:36<4:31:54, 21.78s/it] {'loss': 0.0002, 'grad_norm': 0.042852429376865925, 'learning_rate': 2.9959999999999996e-07, 'completion_length': 157.4107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0052642822265625, 'epoch': 0.7} + 70%|███████ | 1751/2500 [11:00:36<4:31:54, 21.78s/it] 70%|███████ | 1752/2500 [11:00:59<4:35:52, 22.13s/it] {'loss': 0.0003, 'grad_norm': 0.018058306359367858, 'learning_rate': 2.9920000000000003e-07, 'completion_length': 160.12500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0064849853515625, 'epoch': 0.7} + 70%|███████ | 1752/2500 [11:00:59<4:35:52, 22.13s/it] 70%|███████ | 1753/2500 [11:01:21<4:35:42, 22.14s/it] {'loss': 0.0002, 'grad_norm': 0.02325726966902939, 'learning_rate': 2.988e-07, 'completion_length': 160.48214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0056304931640625, 'epoch': 0.7} + 70%|███████ | 1753/2500 [11:01:21<4:35:42, 22.14s/it] 70%|███████ | 1754/2500 [11:01:43<4:33:37, 22.01s/it] {'loss': 0.0002, 'grad_norm': 0.02406190618637808, 'learning_rate': 2.9839999999999997e-07, 'completion_length': 158.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0047607421875, 'epoch': 0.7} + 70%|███████ | 1754/2500 [11:01:43<4:33:37, 22.01s/it] 70%|███���███ | 1755/2500 [11:02:05<4:33:16, 22.01s/it] {'loss': 0.0002, 'grad_norm': 0.02054390058672208, 'learning_rate': 2.98e-07, 'completion_length': 158.25000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0055389404296875, 'epoch': 0.7} + 70%|███████ | 1755/2500 [11:02:05<4:33:16, 22.01s/it] 70%|███████ | 1756/2500 [11:02:26<4:31:18, 21.88s/it] {'loss': 0.0001, 'grad_norm': 0.0190788711166424, 'learning_rate': 2.9759999999999996e-07, 'completion_length': 164.33929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0034332275390625, 'epoch': 0.7} + 70%|███████ | 1756/2500 [11:02:27<4:31:18, 21.88s/it] 70%|███████ | 1757/2500 [11:02:49<4:33:47, 22.11s/it] {'loss': 0.0003, 'grad_norm': 0.31812478401353766, 'learning_rate': 2.972e-07, 'completion_length': 166.5357208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0067596435546875, 'epoch': 0.7} + 70%|███████ | 1757/2500 [11:02:49<4:33:47, 22.11s/it] 70%|███████ | 1758/2500 [11:03:12<4:36:40, 22.37s/it] {'loss': 0.0002, 'grad_norm': 0.6331913989544211, 'learning_rate': 2.968e-07, 'completion_length': 143.5357208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00447845458984375, 'epoch': 0.7} + 70%|███████ | 1758/2500 [11:03:12<4:36:40, 22.37s/it] 70%|███████ | 1759/2500 [11:03:35<4:38:03, 22.52s/it] {'loss': 0.0002, 'grad_norm': 0.27941748958635676, 'learning_rate': 2.964e-07, 'completion_length': 160.00000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00566864013671875, 'epoch': 0.7} + 70%|███████ | 1759/2500 [11:03:35<4:38:03, 22.52s/it] 70%|███████ | 1760/2500 [11:03:57<4:37:24, 22.49s/it] {'loss': 0.0002, 'grad_norm': 0.015727767211364693, 'learning_rate': 2.9599999999999995e-07, 'completion_length': 161.10714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00403594970703125, 'epoch': 0.7} + 70%|███████ | 1760/2500 [11:03:57<4:37:24, 22.49s/it] 70%|███████ | 1761/2500 [11:04:19<4:35:08, 22.34s/it] {'loss': 0.0004, 'grad_norm': 1.2483185310383391, 'learning_rate': 2.9559999999999997e-07, 'completion_length': 161.58929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.01080322265625, 'epoch': 0.7} + 70%|███████ | 1761/2500 [11:04:19<4:35:08, 22.34s/it] 70%|███████ | 1762/2500 [11:04:41<4:33:43, 22.25s/it] {'loss': 0.0002, 'grad_norm': 0.03739390480741309, 'learning_rate': 2.952e-07, 'completion_length': 163.12500762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00557708740234375, 'epoch': 0.7} + 70%|███████ | 1762/2500 [11:04:41<4:33:43, 22.25s/it] 71%|███████ | 1763/2500 [11:05:04<4:33:03, 22.23s/it] {'loss': 0.0002, 'grad_norm': 0.9878439810467562, 'learning_rate': 2.948e-07, 'completion_length': 156.67858123779297, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.0052490234375, 'epoch': 0.71} + 71%|███████ | 1763/2500 [11:05:04<4:33:03, 22.23s/it] 71%|███████ | 1764/2500 [11:05:25<4:30:22, 22.04s/it] {'loss': 0.0002, 'grad_norm': 0.03358201570425201, 'learning_rate': 2.944e-07, 'completion_length': 160.78572845458984, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0048065185546875, 'epoch': 0.71} + 71%|███████ | 1764/2500 [11:05:25<4:30:22, 22.04s/it] 71%|███████ | 1765/2500 [11:05:48<4:32:04, 22.21s/it] {'loss': 0.0002, 'grad_norm': 0.023143684428121477, 'learning_rate': 2.9399999999999996e-07, 'completion_length': 143.73214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045318603515625, 'epoch': 0.71} + 71%|███████ | 1765/2500 [11:05:48<4:32:04, 22.21s/it] 71%|███████ | 1766/2500 [11:06:10<4:30:11, 22.09s/it] {'loss': 0.0002, 'grad_norm': 0.016125843062076914, 'learning_rate': 2.9360000000000003e-07, 'completion_length': 155.7678680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0059967041015625, 'epoch': 0.71} + 71%|███████ | 1766/2500 [11:06:10<4:30:11, 22.09s/it] 71%|███████ | 1767/2500 [11:06:32<4:32:01, 22.27s/it] {'loss': 0.0004, 'grad_norm': 0.022246619995543566, 'learning_rate': 2.932e-07, 'completion_length': 166.89286041259766, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.01043701171875, 'epoch': 0.71} + 71%|███████ | 1767/2500 [11:06:32<4:32:01, 22.27s/it] 71%|███████ | 1768/2500 [11:06:55<4:32:29, 22.34s/it] {'loss': 0.0002, 'grad_norm': 0.2610039587643328, 'learning_rate': 2.928e-07, 'completion_length': 162.48214721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0051422119140625, 'epoch': 0.71} + 71%|███████ | 1768/2500 [11:06:55<4:32:29, 22.34s/it] 71%|███████ | 1769/2500 [11:07:17<4:30:03, 22.17s/it] {'loss': 0.0003, 'grad_norm': 0.02373002477954463, 'learning_rate': 2.924e-07, 'completion_length': 167.8928680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.006317138671875, 'epoch': 0.71} + 71%|███████ | 1769/2500 [11:07:17<4:30:03, 22.17s/it] 71%|███████ | 1770/2500 [11:07:37<4:24:14, 21.72s/it] {'loss': 0.0002, 'grad_norm': 0.05376010554005764, 'learning_rate': 2.9199999999999997e-07, 'completion_length': 142.21429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00487518310546875, 'epoch': 0.71} + 71%|███████ | 1770/2500 [11:07:37<4:24:14, 21.72s/it] 71%|███████ | 1771/2500 [11:07:59<4:22:44, 21.62s/it] {'loss': 0.0002, 'grad_norm': 0.22962330709422304, 'learning_rate': 2.916e-07, 'completion_length': 143.2678680419922, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0057220458984375, 'epoch': 0.71} + 71%|███████ | 1771/2500 [11:07:59<4:22:44, 21.62s/it] 71%|███████ | 1772/2500 [11:08:21<4:23:32, 21.72s/it] {'loss': 0.0003, 'grad_norm': 0.6085727319370511, 'learning_rate': 2.912e-07, 'completion_length': 155.48214721679688, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07695359364151955, 'kl': 0.006744384765625, 'epoch': 0.71} + 71%|███████ | 1772/2500 [11:08:21<4:23:32, 21.72s/it] 71%|███████ | 1773/2500 [11:08:43<4:24:30, 21.83s/it] {'loss': 0.0002, 'grad_norm': 0.017998690471201107, 'learning_rate': 2.908e-07, 'completion_length': 164.67858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00418853759765625, 'epoch': 0.71} + 71%|███████ | 1773/2500 [11:08:43<4:24:30, 21.83s/it] 71%|███████ | 1774/2500 [11:09:05<4:25:14, 21.92s/it] {'loss': 0.0002, 'grad_norm': 0.5203303972132544, 'learning_rate': 2.9039999999999995e-07, 'completion_length': 167.10714721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.0050201416015625, 'epoch': 0.71} + 71%|███████ | 1774/2500 [11:09:05<4:25:14, 21.92s/it] 71%|███████ | 1775/2500 [11:09:27<4:27:21, 22.13s/it] {'loss': 0.0002, 'grad_norm': 0.17865493096889326, 'learning_rate': 2.9e-07, 'completion_length': 153.89286041259766, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.004791259765625, 'epoch': 0.71} + 71%|███████ | 1775/2500 [11:09:27<4:27:21, 22.13s/it] 71%|███████ | 1776/2500 [11:09:50<4:26:52, 22.12s/it] {'loss': 0.0002, 'grad_norm': 0.05982804886749132, 'learning_rate': 2.896e-07, 'completion_length': 149.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004913330078125, 'epoch': 0.71} + 71%|███████ | 1776/2500 [11:09:50<4:26:52, 22.12s/it] 71%|███████ | 1777/2500 [11:10:12<4:28:00, 22.24s/it] {'loss': 0.0002, 'grad_norm': 0.01698085937616258, 'learning_rate': 2.892e-07, 'completion_length': 155.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0041656494140625, 'epoch': 0.71} + 71%|███████ | 1777/2500 [11:10:12<4:28:00, 22.24s/it] 71%|███████ | 1778/2500 [11:10:35<4:28:57, 22.35s/it] {'loss': 0.0002, 'grad_norm': 0.19788710620807545, 'learning_rate': 2.888e-07, 'completion_length': 160.37500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00604248046875, 'epoch': 0.71} + 71%|███████ | 1778/2500 [11:10:35<4:28:57, 22.35s/it] 71%|███████ | 1779/2500 [11:10:56<4:25:27, 22.09s/it] {'loss': 0.0002, 'grad_norm': 0.9675944923736826, 'learning_rate': 2.8839999999999996e-07, 'completion_length': 158.58929443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0043182373046875, 'epoch': 0.71} + 71%|███████ | 1779/2500 [11:10:56<4:25:27, 22.09s/it] 71%|███████ | 1780/2500 [11:11:19<4:26:51, 22.24s/it] {'loss': 0.0003, 'grad_norm': 0.3947111702826691, 'learning_rate': 2.88e-07, 'completion_length': 160.75, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0062713623046875, 'epoch': 0.71} + 71%|███████ | 1780/2500 [11:11:19<4:26:51, 22.24s/it] 71%|███████ | 1781/2500 [11:11:40<4:23:27, 21.99s/it] {'loss': 0.0003, 'grad_norm': 0.0310848124236155, 'learning_rate': 2.876e-07, 'completion_length': 166.98214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0068206787109375, 'epoch': 0.71} + 71%|███████ | 1781/2500 [11:11:40<4:23:27, 21.99s/it] 71%|███████▏ | 1782/2500 [11:12:02<4:23:56, 22.06s/it] {'loss': 0.0002, 'grad_norm': 0.018567079441170604, 'learning_rate': 2.872e-07, 'completion_length': 175.42858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.006011962890625, 'epoch': 0.71} + 71%|███████▏ | 1782/2500 [11:12:02<4:23:56, 22.06s/it] 71%|███████▏ | 1783/2500 [11:12:25<4:24:49, 22.16s/it] {'loss': 0.0002, 'grad_norm': 0.15385817470929844, 'learning_rate': 2.868e-07, 'completion_length': 150.85714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0055084228515625, 'epoch': 0.71} + 71%|███████▏ | 1783/2500 [11:12:25<4:24:49, 22.16s/it] 71%|███████▏ | 1784/2500 [11:12:46<4:22:26, 21.99s/it] {'loss': 0.0003, 'grad_norm': 0.027577769614200894, 'learning_rate': 2.8639999999999997e-07, 'completion_length': 167.9107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0074615478515625, 'epoch': 0.71} + 71%|███████▏ | 1784/2500 [11:12:46<4:22:26, 21.99s/it] 71%|███████▏ | 1785/2500 [11:13:09<4:23:34, 22.12s/it] {'loss': 0.0002, 'grad_norm': 0.1441040597161544, 'learning_rate': 2.8599999999999994e-07, 'completion_length': 154.33929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0040435791015625, 'epoch': 0.71} + 71%|███████▏ | 1785/2500 [11:13:09<4:23:34, 22.12s/it] 71%|███████▏ | 1786/2500 [11:13:30<4:20:53, 21.92s/it] {'loss': 0.0002, 'grad_norm': 2.913916634008109, 'learning_rate': 2.856e-07, 'completion_length': 148.50000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005523681640625, 'epoch': 0.71} + 71%|███████▏ | 1786/2500 [11:13:30<4:20:53, 21.92s/it] 71%|███████▏ | 1787/2500 [11:13:52<4:19:49, 21.86s/it] {'loss': 0.0002, 'grad_norm': 0.4949682781544798, 'learning_rate': 2.852e-07, 'completion_length': 156.35714721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.0054168701171875, 'epoch': 0.71} + 71%|███████▏ | 1787/2500 [11:13:52<4:19:49, 21.86s/it] 72%|███████▏ | 1788/2500 [11:14:13<4:15:39, 21.54s/it] {'loss': 0.0001, 'grad_norm': 0.02352546958038029, 'learning_rate': 2.848e-07, 'completion_length': 149.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00302886962890625, 'epoch': 0.72} + 72%|███████▏ | 1788/2500 [11:14:13<4:15:39, 21.54s/it] 72%|███████▏ | 1789/2500 [11:14:35<4:17:45, 21.75s/it] {'loss': 0.0002, 'grad_norm': 0.024083983350527322, 'learning_rate': 2.844e-07, 'completion_length': 163.26786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0040130615234375, 'epoch': 0.72} + 72%|███████▏ | 1789/2500 [11:14:35<4:17:45, 21.75s/it] 72%|███████▏ | 1790/2500 [11:14:57<4:17:57, 21.80s/it] {'loss': 0.0002, 'grad_norm': 2.37448152498569, 'learning_rate': 2.8399999999999995e-07, 'completion_length': 152.6964340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0042724609375, 'epoch': 0.72} + 72%|███████▏ | 1790/2500 [11:14:57<4:17:57, 21.80s/it] 72%|███████▏ | 1791/2500 [11:15:19<4:18:33, 21.88s/it] {'loss': 0.0002, 'grad_norm': 0.4256304628378134, 'learning_rate': 2.836e-07, 'completion_length': 142.58928680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0053558349609375, 'epoch': 0.72} + 72%|███████▏ | 1791/2500 [11:15:19<4:18:33, 21.88s/it] 72%|███████▏ | 1792/2500 [11:15:40<4:14:26, 21.56s/it] {'loss': 0.0002, 'grad_norm': 0.02328506918891171, 'learning_rate': 2.832e-07, 'completion_length': 138.5, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0052947998046875, 'epoch': 0.72} + 72%|███████▏ | 1792/2500 [11:15:40<4:14:26, 21.56s/it] 72%|███████▏ | 1793/2500 [11:16:01<4:13:45, 21.54s/it] {'loss': 0.0003, 'grad_norm': 0.501997807918209, 'learning_rate': 2.8279999999999996e-07, 'completion_length': 154.37500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00682830810546875, 'epoch': 0.72} + 72%|███████▏ | 1793/2500 [11:16:01<4:13:45, 21.54s/it] 72%|███████▏ | 1794/2500 [11:16:23<4:13:28, 21.54s/it] {'loss': 0.0002, 'grad_norm': 0.04944384937893587, 'learning_rate': 2.824e-07, 'completion_length': 151.7678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005950927734375, 'epoch': 0.72} + 72%|███████▏ | 1794/2500 [11:16:23<4:13:28, 21.54s/it] 72%|███████▏ | 1795/2500 [11:16:46<4:18:31, 22.00s/it] {'loss': 0.0003, 'grad_norm': 0.2720886232204405, 'learning_rate': 2.8199999999999996e-07, 'completion_length': 166.9107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00689697265625, 'epoch': 0.72} + 72%|███████▏ | 1795/2500 [11:16:46<4:18:31, 22.00s/it] 72%|███████▏ | 1796/2500 [11:17:08<4:20:14, 22.18s/it] {'loss': 0.0002, 'grad_norm': 0.4711620435620274, 'learning_rate': 2.816e-07, 'completion_length': 166.0178680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0045928955078125, 'epoch': 0.72} + 72%|███████▏ | 1796/2500 [11:17:09<4:20:14, 22.18s/it] 72%|███████▏ | 1797/2500 [11:17:30<4:17:16, 21.96s/it] {'loss': 0.0003, 'grad_norm': 1.0686668483725699, 'learning_rate': 2.812e-07, 'completion_length': 153.55357360839844, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.1428571492433548, 'kl': 0.00689697265625, 'epoch': 0.72} + 72%|███████▏ | 1797/2500 [11:17:30<4:17:16, 21.96s/it] 72%|███████▏ | 1798/2500 [11:17:53<4:19:18, 22.16s/it] {'loss': 0.0002, 'grad_norm': 0.020421196314548078, 'learning_rate': 2.8079999999999997e-07, 'completion_length': 157.58929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0060577392578125, 'epoch': 0.72} + 72%|███████▏ | 1798/2500 [11:17:53<4:19:18, 22.16s/it] 72%|███████▏ | 1799/2500 [11:18:14<4:17:14, 22.02s/it] {'loss': 0.0002, 'grad_norm': 0.02185542805881339, 'learning_rate': 2.804e-07, 'completion_length': 155.4464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006011962890625, 'epoch': 0.72} + 72%|███████▏ | 1799/2500 [11:18:14<4:17:14, 22.02s/it] 72%|███████▏ | 1800/2500 [11:18:37<4:20:07, 22.30s/it] {'loss': 0.0002, 'grad_norm': 0.25889793072609685, 'learning_rate': 2.8e-07, 'completion_length': 158.9464340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.005401611328125, 'epoch': 0.72} + 72%|███████▏ | 1800/2500 [11:18:37<4:20:07, 22.30s/it] 72%|███████▏ | 1801/2500 [11:21:33<13:15:33, 68.29s/it] {'loss': 0.0002, 'grad_norm': 0.02181546304119289, 'learning_rate': 2.796e-07, 'completion_length': 163.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00382232666015625, 'epoch': 0.72} + 72%|███████▏ | 1801/2500 [11:21:33<13:15:33, 68.29s/it] 72%|███████▏ | 1802/2500 [11:21:49<10:12:59, 52.69s/it] {'loss': 0.0002, 'grad_norm': 0.2994971055166227, 'learning_rate': 2.792e-07, 'completion_length': 158.7857208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0051727294921875, 'epoch': 0.72} + 72%|███████▏ | 1802/2500 [11:21:49<10:12:59, 52.69s/it] 72%|███████▏ | 1803/2500 [11:22:06<8:07:59, 42.01s/it] {'loss': 0.0003, 'grad_norm': 0.7201326291845352, 'learning_rate': 2.788e-07, 'completion_length': 153.71428680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.007232666015625, 'epoch': 0.72} + 72%|███████▏ | 1803/2500 [11:22:06<8:07:59, 42.01s/it] 72%|███████▏ | 1804/2500 [11:22:22<6:35:00, 34.05s/it] {'loss': 0.0002, 'grad_norm': 0.3036949345083664, 'learning_rate': 2.7839999999999995e-07, 'completion_length': 145.01786041259766, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005950927734375, 'epoch': 0.72} + 72%|███████▏ | 1804/2500 [11:22:22<6:35:00, 34.05s/it] 72%|███████▏ | 1805/2500 [11:22:39<5:35:02, 28.92s/it] {'loss': 0.0002, 'grad_norm': 0.021646372936166395, 'learning_rate': 2.7800000000000003e-07, 'completion_length': 150.01786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00463104248046875, 'epoch': 0.72} + 72%|███████▏ | 1805/2500 [11:22:39<5:35:02, 28.92s/it] 72%|███████▏ | 1806/2500 [11:22:55<4:52:17, 25.27s/it] {'loss': 0.0002, 'grad_norm': 0.3029822683734618, 'learning_rate': 2.776e-07, 'completion_length': 164.1964340209961, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.0059661865234375, 'epoch': 0.72} + 72%|███████▏ | 1806/2500 [11:22:55<4:52:17, 25.27s/it] 72%|███████▏ | 1807/2500 [11:23:11<4:19:24, 22.46s/it] {'loss': 0.0003, 'grad_norm': 0.931067142563749, 'learning_rate': 2.7719999999999997e-07, 'completion_length': 148.1964340209961, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.006317138671875, 'epoch': 0.72} + 72%|███████▏ | 1807/2500 [11:23:11<4:19:24, 22.46s/it] 72%|███████▏ | 1808/2500 [11:23:28<3:59:03, 20.73s/it] {'loss': 0.0002, 'grad_norm': 0.4089073952828324, 'learning_rate': 2.768e-07, 'completion_length': 163.23214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00543212890625, 'epoch': 0.72} + 72%|███████▏ | 1808/2500 [11:23:28<3:59:03, 20.73s/it] 72%|███████▏ | 1809/2500 [11:23:44<3:42:13, 19.30s/it] {'loss': 0.0002, 'grad_norm': 0.2272725823864556, 'learning_rate': 2.7639999999999996e-07, 'completion_length': 148.35714721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00482177734375, 'epoch': 0.72} + 72%|███████▏ | 1809/2500 [11:23:44<3:42:13, 19.30s/it] 72%|███████▏ | 1810/2500 [11:24:01<3:32:54, 18.51s/it] {'loss': 0.0002, 'grad_norm': 1.3254514587108661, 'learning_rate': 2.7600000000000004e-07, 'completion_length': 164.5178680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00542449951171875, 'epoch': 0.72} + 72%|███████▏ | 1810/2500 [11:24:01<3:32:54, 18.51s/it] 72%|███████▏ | 1811/2500 [11:24:17<3:24:41, 17.82s/it] {'loss': 0.0002, 'grad_norm': 0.15453086998140622, 'learning_rate': 2.756e-07, 'completion_length': 158.0357208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0040740966796875, 'epoch': 0.72} + 72%|███████▏ | 1811/2500 [11:24:17<3:24:41, 17.82s/it] 72%|███████▏ | 1812/2500 [11:24:34<3:22:52, 17.69s/it] {'loss': 0.0002, 'grad_norm': 0.028091994798550315, 'learning_rate': 2.752e-07, 'completion_length': 162.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045928955078125, 'epoch': 0.72} + 72%|███████▏ | 1812/2500 [11:24:34<3:22:52, 17.69s/it] 73%|███████▎ | 1813/2500 [11:24:50<3:16:05, 17.13s/it] {'loss': 0.0001, 'grad_norm': 0.019848090830099575, 'learning_rate': 2.748e-07, 'completion_length': 146.46429061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.002483367919921875, 'epoch': 0.73} + 73%|███████▎ | 1813/2500 [11:24:50<3:16:05, 17.13s/it] 73%|███████▎ | 1814/2500 [11:25:08<3:17:19, 17.26s/it] {'loss': 0.0002, 'grad_norm': 0.29483948348516914, 'learning_rate': 2.7439999999999997e-07, 'completion_length': 163.1607208251953, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.004425048828125, 'epoch': 0.73} + 73%|███████▎ | 1814/2500 [11:25:08<3:17:19, 17.26s/it] 73%|███████▎ | 1815/2500 [11:25:24<3:12:34, 16.87s/it] {'loss': 0.0003, 'grad_norm': 0.29484537598681737, 'learning_rate': 2.74e-07, 'completion_length': 149.08928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0064697265625, 'epoch': 0.73} + 73%|███████▎ | 1815/2500 [11:25:24<3:12:34, 16.87s/it] 73%|███████▎ | 1816/2500 [11:25:40<3:10:30, 16.71s/it] {'loss': 0.0003, 'grad_norm': 0.2092822591267017, 'learning_rate': 2.736e-07, 'completion_length': 167.21428680419922, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.0062713623046875, 'epoch': 0.73} + 73%|███████▎ | 1816/2500 [11:25:40<3:10:30, 16.71s/it] 73%|███████▎ | 1817/2500 [11:25:55<3:05:24, 16.29s/it] {'loss': 0.0001, 'grad_norm': 0.4614246063015595, 'learning_rate': 2.732e-07, 'completion_length': 143.58929443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.003509521484375, 'epoch': 0.73} + 73%|███████▎ | 1817/2500 [11:25:55<3:05:24, 16.29s/it] 73%|███████▎ | 1818/2500 [11:26:15<3:17:18, 17.36s/it] {'loss': 0.0003, 'grad_norm': 0.3258998486508936, 'learning_rate': 2.7279999999999995e-07, 'completion_length': 198.9107208251953, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.11266788095235825, 'kl': 0.008392333984375, 'epoch': 0.73} + 73%|███████▎ | 1818/2500 [11:26:15<3:17:18, 17.36s/it] 73%|███████▎ | 1819/2500 [11:26:31<3:11:07, 16.84s/it] {'loss': 0.0002, 'grad_norm': 0.24237954521995034, 'learning_rate': 2.724e-07, 'completion_length': 150.4464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00506591796875, 'epoch': 0.73} + 73%|███████▎ | 1819/2500 [11:26:31<3:11:07, 16.84s/it] 73%|███████▎ | 1820/2500 [11:26:46<3:05:12, 16.34s/it] {'loss': 0.0002, 'grad_norm': 0.3223808780601306, 'learning_rate': 2.72e-07, 'completion_length': 141.62500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00534820556640625, 'epoch': 0.73} + 73%|███████▎ | 1820/2500 [11:26:46<3:05:12, 16.34s/it] 73%|███████▎ | 1821/2500 [11:27:02<3:03:20, 16.20s/it] {'loss': 0.0003, 'grad_norm': 0.275375627122185, 'learning_rate': 2.7159999999999997e-07, 'completion_length': 153.6071548461914, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00738525390625, 'epoch': 0.73} + 73%|███████▎ | 1821/2500 [11:27:02<3:03:20, 16.20s/it] 73%|███████▎ | 1822/2500 [11:27:18<3:03:52, 16.27s/it] {'loss': 0.0002, 'grad_norm': 0.019311659229175188, 'learning_rate': 2.712e-07, 'completion_length': 167.17857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0052642822265625, 'epoch': 0.73} + 73%|███████▎ | 1822/2500 [11:27:18<3:03:52, 16.27s/it] 73%|███████▎ | 1823/2500 [11:27:34<3:02:41, 16.19s/it] {'loss': 0.0002, 'grad_norm': 0.3405456595382296, 'learning_rate': 2.7079999999999996e-07, 'completion_length': 160.23214721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0052947998046875, 'epoch': 0.73} + 73%|███████▎ | 1823/2500 [11:27:34<3:02:41, 16.19s/it] 73%|███████▎ | 1824/2500 [11:27:49<2:59:23, 15.92s/it] {'loss': 0.0002, 'grad_norm': 0.6366730113742369, 'learning_rate': 2.704e-07, 'completion_length': 157.10714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0040740966796875, 'epoch': 0.73} + 73%|███████▎ | 1824/2500 [11:27:49<2:59:23, 15.92s/it] 73%|███████▎ | 1825/2500 [11:28:11<3:19:31, 17.73s/it] {'loss': 0.0002, 'grad_norm': 1.1222686529229697, 'learning_rate': 2.7e-07, 'completion_length': 157.62500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00435638427734375, 'epoch': 0.73} + 73%|███████▎ | 1825/2500 [11:28:11<3:19:31, 17.73s/it] 73%|███████▎ | 1826/2500 [11:28:33<3:33:37, 19.02s/it] {'loss': 0.0002, 'grad_norm': 1.1243927155511912, 'learning_rate': 2.696e-07, 'completion_length': 154.01786041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0045623779296875, 'epoch': 0.73} + 73%|███████▎ | 1826/2500 [11:28:33<3:33:37, 19.02s/it] 73%|███████▎ | 1827/2500 [11:28:54<3:39:01, 19.53s/it] {'loss': 0.0001, 'grad_norm': 0.018137063524621314, 'learning_rate': 2.692e-07, 'completion_length': 146.2857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.002960205078125, 'epoch': 0.73} + 73%|███████▎ | 1827/2500 [11:28:54<3:39:01, 19.53s/it] 73%|███████▎ | 1828/2500 [11:29:16<3:48:06, 20.37s/it] {'loss': 0.0002, 'grad_norm': 0.3460301404950591, 'learning_rate': 2.6879999999999997e-07, 'completion_length': 154.1607208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0051727294921875, 'epoch': 0.73} + 73%|███████▎ | 1828/2500 [11:29:16<3:48:06, 20.37s/it] 73%|███████▎ | 1829/2500 [11:29:38<3:51:55, 20.74s/it] {'loss': 0.0002, 'grad_norm': 0.24188536321366597, 'learning_rate': 2.684e-07, 'completion_length': 147.62500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0045623779296875, 'epoch': 0.73} + 73%|███████▎ | 1829/2500 [11:29:38<3:51:55, 20.74s/it] 73%|███████▎ | 1830/2500 [11:30:00<3:54:57, 21.04s/it] {'loss': 0.0003, 'grad_norm': 0.42421784765389214, 'learning_rate': 2.68e-07, 'completion_length': 159.60714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00632476806640625, 'epoch': 0.73} + 73%|███████▎ | 1830/2500 [11:30:00<3:54:57, 21.04s/it] 73%|███████▎ | 1831/2500 [11:30:22<3:58:31, 21.39s/it] {'loss': 0.0002, 'grad_norm': 0.31592196500448244, 'learning_rate': 2.676e-07, 'completion_length': 156.42858123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0053253173828125, 'epoch': 0.73} + 73%|███████▎ | 1831/2500 [11:30:22<3:58:31, 21.39s/it] 73%|███████▎ | 1832/2500 [11:30:43<3:57:53, 21.37s/it] {'loss': 0.0001, 'grad_norm': 0.20621668045033245, 'learning_rate': 2.6719999999999996e-07, 'completion_length': 150.85714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0027923583984375, 'epoch': 0.73} + 73%|███████▎ | 1832/2500 [11:30:43<3:57:53, 21.37s/it] 73%|███████▎ | 1833/2500 [11:31:05<3:59:52, 21.58s/it] {'loss': 0.0002, 'grad_norm': 0.20897672617299476, 'learning_rate': 2.668e-07, 'completion_length': 156.42858123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0038909912109375, 'epoch': 0.73} + 73%|███████▎ | 1833/2500 [11:31:05<3:59:52, 21.58s/it] 73%|███████▎ | 1834/2500 [11:31:26<3:56:03, 21.27s/it] {'loss': 0.0003, 'grad_norm': 0.038655477194871925, 'learning_rate': 2.664e-07, 'completion_length': 149.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0069732666015625, 'epoch': 0.73} + 73%|███████▎ | 1834/2500 [11:31:26<3:56:03, 21.27s/it] 73%|███████▎ | 1835/2500 [11:31:47<3:55:26, 21.24s/it] {'loss': 0.0003, 'grad_norm': 0.37737550072775217, 'learning_rate': 2.66e-07, 'completion_length': 153.9464340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0066070556640625, 'epoch': 0.73} + 73%|███████▎ | 1835/2500 [11:31:47<3:55:26, 21.24s/it] 73%|███████▎ | 1836/2500 [11:32:11<4:04:08, 22.06s/it] {'loss': 0.0002, 'grad_norm': 0.16365331734120672, 'learning_rate': 2.656e-07, 'completion_length': 180.0357208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00616455078125, 'epoch': 0.73} + 73%|███████▎ | 1836/2500 [11:32:11<4:04:08, 22.06s/it] 73%|███████▎ | 1837/2500 [11:32:34<4:07:52, 22.43s/it] {'loss': 0.0002, 'grad_norm': 0.415014334879573, 'learning_rate': 2.6519999999999997e-07, 'completion_length': 156.55357360839844, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0055694580078125, 'epoch': 0.73} + 73%|███████▎ | 1837/2500 [11:32:34<4:07:52, 22.43s/it] 74%|███████▎ | 1838/2500 [11:32:57<4:08:19, 22.51s/it] {'loss': 0.0004, 'grad_norm': 0.7666298084304581, 'learning_rate': 2.648e-07, 'completion_length': 158.25000762939453, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07695358991622925, 'kl': 0.0103759765625, 'epoch': 0.74} + 74%|███████▎ | 1838/2500 [11:32:57<4:08:19, 22.51s/it] 74%|███████▎ | 1839/2500 [11:33:19<4:04:30, 22.19s/it] {'loss': 0.0002, 'grad_norm': 0.036624158785023474, 'learning_rate': 2.644e-07, 'completion_length': 159.8928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00384521484375, 'epoch': 0.74} + 74%|███████▎ | 1839/2500 [11:33:19<4:04:30, 22.19s/it] 74%|███████▎ | 1840/2500 [11:33:40<4:00:44, 21.89s/it] {'loss': 0.0002, 'grad_norm': 0.017818626921343943, 'learning_rate': 2.64e-07, 'completion_length': 151.39286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00439453125, 'epoch': 0.74} + 74%|███████▎ | 1840/2500 [11:33:40<4:00:44, 21.89s/it] 74%|███████▎ | 1841/2500 [11:34:03<4:05:09, 22.32s/it] {'loss': 0.0003, 'grad_norm': 1.2659636635840643, 'learning_rate': 2.636e-07, 'completion_length': 177.9821548461914, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.0075836181640625, 'epoch': 0.74} + 74%|███████▎ | 1841/2500 [11:34:03<4:05:09, 22.32s/it] 74%|███████▎ | 1842/2500 [11:34:24<4:01:19, 22.00s/it] {'loss': 0.0001, 'grad_norm': 0.38462845703562104, 'learning_rate': 2.632e-07, 'completion_length': 149.7857208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00305938720703125, 'epoch': 0.74} + 74%|███████▎ | 1842/2500 [11:34:24<4:01:19, 22.00s/it] 74%|███████▎ | 1843/2500 [11:34:47<4:03:14, 22.21s/it] {'loss': 0.0002, 'grad_norm': 0.3293787148584087, 'learning_rate': 2.6279999999999994e-07, 'completion_length': 159.89286041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0059661865234375, 'epoch': 0.74} + 74%|███████▎ | 1843/2500 [11:34:47<4:03:14, 22.21s/it] 74%|███████▍ | 1844/2500 [11:35:08<3:59:37, 21.92s/it] {'loss': 0.0002, 'grad_norm': 0.21708247760550822, 'learning_rate': 2.624e-07, 'completion_length': 151.92858123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00429534912109375, 'epoch': 0.74} + 74%|███████▍ | 1844/2500 [11:35:08<3:59:37, 21.92s/it] 74%|███████▍ | 1845/2500 [11:35:31<4:02:21, 22.20s/it] {'loss': 0.0002, 'grad_norm': 0.40187869332985626, 'learning_rate': 2.62e-07, 'completion_length': 155.73214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00604248046875, 'epoch': 0.74} + 74%|███████▍ | 1845/2500 [11:35:31<4:02:21, 22.20s/it] 74%|███████▍ | 1846/2500 [11:35:52<3:57:13, 21.76s/it] {'loss': 0.0002, 'grad_norm': 0.022134882476954807, 'learning_rate': 2.616e-07, 'completion_length': 163.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0047454833984375, 'epoch': 0.74} + 74%|███████▍ | 1846/2500 [11:35:52<3:57:13, 21.76s/it] 74%|███████▍ | 1847/2500 [11:36:14<3:58:24, 21.91s/it] {'loss': 0.0003, 'grad_norm': 1.1169110320682074, 'learning_rate': 2.612e-07, 'completion_length': 154.0714340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00732421875, 'epoch': 0.74} + 74%|███████▍ | 1847/2500 [11:36:14<3:58:24, 21.91s/it] 74%|███████▍ | 1848/2500 [11:36:36<3:57:07, 21.82s/it] {'loss': 0.0001, 'grad_norm': 0.9264719240521979, 'learning_rate': 2.6079999999999995e-07, 'completion_length': 138.05358123779297, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.00331878662109375, 'epoch': 0.74} + 74%|███████▍ | 1848/2500 [11:36:36<3:57:07, 21.82s/it] 74%|███████▍ | 1849/2500 [11:36:58<3:59:43, 22.09s/it] {'loss': 0.0002, 'grad_norm': 0.22381210937629542, 'learning_rate': 2.6040000000000003e-07, 'completion_length': 162.4107208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00461578369140625, 'epoch': 0.74} + 74%|███████▍ | 1849/2500 [11:36:58<3:59:43, 22.09s/it] 74%|███████▍ | 1850/2500 [11:37:20<3:56:27, 21.83s/it] {'loss': 0.0002, 'grad_norm': 0.02176316419827156, 'learning_rate': 2.6e-07, 'completion_length': 142.55357360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0050506591796875, 'epoch': 0.74} + 74%|███████▍ | 1850/2500 [11:37:20<3:56:27, 21.83s/it] 74%|███████▍ | 1851/2500 [11:37:41<3:56:06, 21.83s/it] {'loss': 0.0003, 'grad_norm': 0.27741946070197443, 'learning_rate': 2.5959999999999997e-07, 'completion_length': 174.75000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.007537841796875, 'epoch': 0.74} + 74%|███████▍ | 1851/2500 [11:37:42<3:56:06, 21.83s/it] 74%|███████▍ | 1852/2500 [11:38:05<4:01:56, 22.40s/it] {'loss': 0.0003, 'grad_norm': 0.17392992453120998, 'learning_rate': 2.592e-07, 'completion_length': 196.33929443359375, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.0075531005859375, 'epoch': 0.74} + 74%|███████▍ | 1852/2500 [11:38:05<4:01:56, 22.40s/it] 74%|███████▍ | 1853/2500 [11:38:28<4:03:57, 22.62s/it] {'loss': 0.0003, 'grad_norm': 0.29010440221532063, 'learning_rate': 2.5879999999999996e-07, 'completion_length': 170.6964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0066986083984375, 'epoch': 0.74} + 74%|███████▍ | 1853/2500 [11:38:28<4:03:57, 22.62s/it] 74%|███████▍ | 1854/2500 [11:38:50<4:01:08, 22.40s/it] {'loss': 0.0003, 'grad_norm': 0.02316383485434148, 'learning_rate': 2.584e-07, 'completion_length': 157.51786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0062408447265625, 'epoch': 0.74} + 74%|███████▍ | 1854/2500 [11:38:50<4:01:08, 22.40s/it] 74%|███████▍ | 1855/2500 [11:39:12<3:58:08, 22.15s/it] {'loss': 0.0002, 'grad_norm': 0.02458667060885073, 'learning_rate': 2.58e-07, 'completion_length': 167.125, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004791259765625, 'epoch': 0.74} + 74%|███████▍ | 1855/2500 [11:39:12<3:58:08, 22.15s/it] 74%|███████▍ | 1856/2500 [11:39:34<3:57:28, 22.13s/it] {'loss': 0.0002, 'grad_norm': 0.2254404227315995, 'learning_rate': 2.576e-07, 'completion_length': 148.4464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0048980712890625, 'epoch': 0.74} + 74%|███████▍ | 1856/2500 [11:39:34<3:57:28, 22.13s/it] 74%|███████▍ | 1857/2500 [11:39:54<3:50:33, 21.51s/it] {'loss': 0.0002, 'grad_norm': 0.018976175060556766, 'learning_rate': 2.5719999999999995e-07, 'completion_length': 149.0178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00384521484375, 'epoch': 0.74} + 74%|███████▍ | 1857/2500 [11:39:54<3:50:33, 21.51s/it] 74%|███████▍ | 1858/2500 [11:40:16<3:51:45, 21.66s/it] {'loss': 0.0001, 'grad_norm': 0.024909104646240042, 'learning_rate': 2.5679999999999997e-07, 'completion_length': 156.80358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00347137451171875, 'epoch': 0.74} + 74%|███████▍ | 1858/2500 [11:40:16<3:51:45, 21.66s/it] 74%|███████▍ | 1859/2500 [11:40:39<3:56:53, 22.17s/it] {'loss': 0.0002, 'grad_norm': 0.02275999593855142, 'learning_rate': 2.564e-07, 'completion_length': 154.9107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00379180908203125, 'epoch': 0.74} + 74%|███████▍ | 1859/2500 [11:40:39<3:56:53, 22.17s/it] 74%|███████▍ | 1860/2500 [11:41:01<3:54:59, 22.03s/it] {'loss': 0.0001, 'grad_norm': 0.013380757123945767, 'learning_rate': 2.56e-07, 'completion_length': 158.37500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00368499755859375, 'epoch': 0.74} + 74%|███████▍ | 1860/2500 [11:41:01<3:54:59, 22.03s/it] 74%|███████▍ | 1861/2500 [11:41:23<3:53:38, 21.94s/it] {'loss': 0.0002, 'grad_norm': 0.2181586486021572, 'learning_rate': 2.556e-07, 'completion_length': 150.62500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0057830810546875, 'epoch': 0.74} + 74%|███████▍ | 1861/2500 [11:41:23<3:53:38, 21.94s/it] 74%|███████▍ | 1862/2500 [11:41:45<3:55:22, 22.14s/it] {'loss': 0.0002, 'grad_norm': 0.2367491909987232, 'learning_rate': 2.5519999999999996e-07, 'completion_length': 148.23214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00518798828125, 'epoch': 0.74} + 74%|███████▍ | 1862/2500 [11:41:45<3:55:22, 22.14s/it] 75%|███████▍ | 1863/2500 [11:42:08<3:56:30, 22.28s/it] {'loss': 0.0002, 'grad_norm': 0.017263273004539552, 'learning_rate': 2.5480000000000003e-07, 'completion_length': 154.89286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006195068359375, 'epoch': 0.75} + 75%|███████▍ | 1863/2500 [11:42:08<3:56:30, 22.28s/it] 75%|███████▍ | 1864/2500 [11:42:29<3:51:47, 21.87s/it] {'loss': 0.0002, 'grad_norm': 0.24545869020309488, 'learning_rate': 2.544e-07, 'completion_length': 160.48214721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00412750244140625, 'epoch': 0.75} + 75%|███████▍ | 1864/2500 [11:42:29<3:51:47, 21.87s/it] 75%|███████▍ | 1865/2500 [11:42:51<3:51:08, 21.84s/it] {'loss': 0.0002, 'grad_norm': 0.43267972259335874, 'learning_rate': 2.5399999999999997e-07, 'completion_length': 150.35714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0056304931640625, 'epoch': 0.75} + 75%|███████▍ | 1865/2500 [11:42:51<3:51:08, 21.84s/it] 75%|███████▍ | 1866/2500 [11:43:12<3:50:16, 21.79s/it] {'loss': 0.0002, 'grad_norm': 0.027813110950228245, 'learning_rate': 2.536e-07, 'completion_length': 153.80358123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0059814453125, 'epoch': 0.75} + 75%|███████▍ | 1866/2500 [11:43:12<3:50:16, 21.79s/it] 75%|███████▍ | 1867/2500 [11:43:33<3:47:05, 21.53s/it] {'loss': 0.0002, 'grad_norm': 0.01840737703728911, 'learning_rate': 2.5319999999999996e-07, 'completion_length': 146.50000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004119873046875, 'epoch': 0.75} + 75%|███████▍ | 1867/2500 [11:43:33<3:47:05, 21.53s/it] 75%|███████▍ | 1868/2500 [11:43:54<3:45:06, 21.37s/it] {'loss': 0.0001, 'grad_norm': 0.017480398829619038, 'learning_rate': 2.528e-07, 'completion_length': 159.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00347900390625, 'epoch': 0.75} + 75%|███████▍ | 1868/2500 [11:43:54<3:45:06, 21.37s/it] 75%|███████▍ | 1869/2500 [11:44:16<3:45:00, 21.39s/it] {'loss': 0.0001, 'grad_norm': 0.01734553955396444, 'learning_rate': 2.524e-07, 'completion_length': 163.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00286865234375, 'epoch': 0.75} + 75%|███████▍ | 1869/2500 [11:44:16<3:45:00, 21.39s/it] 75%|███████▍ | 1870/2500 [11:44:38<3:46:19, 21.55s/it] {'loss': 0.0001, 'grad_norm': 0.015057300997020361, 'learning_rate': 2.52e-07, 'completion_length': 148.4464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00365447998046875, 'epoch': 0.75} + 75%|███████▍ | 1870/2500 [11:44:38<3:46:19, 21.55s/it] 75%|███████▍ | 1871/2500 [11:44:59<3:44:37, 21.43s/it] {'loss': 0.0001, 'grad_norm': 0.014039408299012216, 'learning_rate': 2.516e-07, 'completion_length': 141.0, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00275421142578125, 'epoch': 0.75} + 75%|███████▍ | 1871/2500 [11:44:59<3:44:37, 21.43s/it] 75%|███████▍ | 1872/2500 [11:45:21<3:46:17, 21.62s/it] {'loss': 0.0002, 'grad_norm': 0.7260977525287943, 'learning_rate': 2.5119999999999997e-07, 'completion_length': 171.08929443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005340576171875, 'epoch': 0.75} + 75%|███████▍ | 1872/2500 [11:45:21<3:46:17, 21.62s/it] 75%|███████▍ | 1873/2500 [11:45:42<3:43:32, 21.39s/it] {'loss': 0.0001, 'grad_norm': 0.016123690884458242, 'learning_rate': 2.508e-07, 'completion_length': 147.37500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00318145751953125, 'epoch': 0.75} + 75%|███████▍ | 1873/2500 [11:45:42<3:43:32, 21.39s/it] 75%|███████▍ | 1874/2500 [11:46:04<3:45:15, 21.59s/it] {'loss': 0.0002, 'grad_norm': 0.4850877428748892, 'learning_rate': 2.504e-07, 'completion_length': 135.92858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004791259765625, 'epoch': 0.75} + 75%|███████▍ | 1874/2500 [11:46:04<3:45:15, 21.59s/it] 75%|███████▌ | 1875/2500 [11:46:26<3:47:19, 21.82s/it] {'loss': 0.0002, 'grad_norm': 0.2168970567589975, 'learning_rate': 2.5e-07, 'completion_length': 175.2678680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0060882568359375, 'epoch': 0.75} + 75%|███████▌ | 1875/2500 [11:46:26<3:47:19, 21.82s/it] 75%|███████▌ | 1876/2500 [11:46:48<3:47:24, 21.87s/it] {'loss': 0.0001, 'grad_norm': 0.03378759386846228, 'learning_rate': 2.4959999999999996e-07, 'completion_length': 150.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003021240234375, 'epoch': 0.75} + 75%|███████▌ | 1876/2500 [11:46:48<3:47:24, 21.87s/it] 75%|███████▌ | 1877/2500 [11:47:11<3:50:40, 22.22s/it] {'loss': 0.0004, 'grad_norm': 0.22991925497571045, 'learning_rate': 2.492e-07, 'completion_length': 171.05357360839844, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.008819580078125, 'epoch': 0.75} + 75%|███████▌ | 1877/2500 [11:47:11<3:50:40, 22.22s/it] 75%|███████▌ | 1878/2500 [11:47:33<3:49:05, 22.10s/it] {'loss': 0.0003, 'grad_norm': 0.02072330691305356, 'learning_rate': 2.488e-07, 'completion_length': 167.92858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006378173828125, 'epoch': 0.75} + 75%|███████▌ | 1878/2500 [11:47:33<3:49:05, 22.10s/it] 75%|███████▌ | 1879/2500 [11:47:55<3:48:44, 22.10s/it] {'loss': 0.0003, 'grad_norm': 0.5242542757014796, 'learning_rate': 2.484e-07, 'completion_length': 164.05357360839844, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0078277587890625, 'epoch': 0.75} + 75%|███████▌ | 1879/2500 [11:47:55<3:48:44, 22.10s/it] 75%|███████▌ | 1880/2500 [11:48:17<3:46:50, 21.95s/it] {'loss': 0.0002, 'grad_norm': 0.2343327213511984, 'learning_rate': 2.48e-07, 'completion_length': 167.50000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005615234375, 'epoch': 0.75} + 75%|███████▌ | 1880/2500 [11:48:17<3:46:50, 21.95s/it] 75%|███████▌ | 1881/2500 [11:48:38<3:46:05, 21.91s/it] {'loss': 0.0002, 'grad_norm': 0.21791483889727706, 'learning_rate': 2.4759999999999997e-07, 'completion_length': 158.6964340209961, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.00525665283203125, 'epoch': 0.75} + 75%|███████▌ | 1881/2500 [11:48:38<3:46:05, 21.91s/it] 75%|███████▌ | 1882/2500 [11:49:00<3:45:25, 21.89s/it] {'loss': 0.0002, 'grad_norm': 0.21629601734133974, 'learning_rate': 2.472e-07, 'completion_length': 161.71428680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006011962890625, 'epoch': 0.75} + 75%|███████▌ | 1882/2500 [11:49:00<3:45:25, 21.89s/it] 75%|███████▌ | 1883/2500 [11:49:23<3:46:10, 21.99s/it] {'loss': 0.0003, 'grad_norm': 0.3313128391703195, 'learning_rate': 2.4679999999999996e-07, 'completion_length': 182.21429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006988525390625, 'epoch': 0.75} + 75%|███████▌ | 1883/2500 [11:49:23<3:46:10, 21.99s/it] 75%|███████▌ | 1884/2500 [11:49:44<3:43:12, 21.74s/it] {'loss': 0.0001, 'grad_norm': 0.25617317724742406, 'learning_rate': 2.464e-07, 'completion_length': 150.25000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00357818603515625, 'epoch': 0.75} + 75%|███████▌ | 1884/2500 [11:49:44<3:43:12, 21.74s/it] 75%|███████▌ | 1885/2500 [11:50:07<3:48:24, 22.28s/it] {'loss': 0.0003, 'grad_norm': 0.18113631070887615, 'learning_rate': 2.46e-07, 'completion_length': 183.19644165039062, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006866455078125, 'epoch': 0.75} + 75%|███████▌ | 1885/2500 [11:50:07<3:48:24, 22.28s/it] 75%|███████▌ | 1886/2500 [11:50:32<3:54:33, 22.92s/it] {'loss': 0.0003, 'grad_norm': 0.1997896419623949, 'learning_rate': 2.456e-07, 'completion_length': 180.46429443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.007171630859375, 'epoch': 0.75} + 75%|███████▌ | 1886/2500 [11:50:32<3:54:33, 22.92s/it] 75%|███████▌ | 1887/2500 [11:50:53<3:50:13, 22.53s/it] {'loss': 0.0001, 'grad_norm': 0.03054313512603256, 'learning_rate': 2.452e-07, 'completion_length': 150.42858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00334930419921875, 'epoch': 0.75} + 75%|███████▌ | 1887/2500 [11:50:53<3:50:13, 22.53s/it] 76%|███████▌ | 1888/2500 [11:51:14<3:44:29, 22.01s/it] {'loss': 0.0002, 'grad_norm': 0.01685438408557941, 'learning_rate': 2.4479999999999997e-07, 'completion_length': 148.08929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00469970703125, 'epoch': 0.76} + 76%|███████▌ | 1888/2500 [11:51:14<3:44:29, 22.01s/it] 76%|███████▌ | 1889/2500 [11:51:36<3:44:21, 22.03s/it] {'loss': 0.0002, 'grad_norm': 0.031478143747117784, 'learning_rate': 2.444e-07, 'completion_length': 164.67858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006195068359375, 'epoch': 0.76} + 76%|███████▌ | 1889/2500 [11:51:36<3:44:21, 22.03s/it] 76%|█��█████▌ | 1890/2500 [11:51:58<3:43:44, 22.01s/it] {'loss': 0.0002, 'grad_norm': 0.02120387837017205, 'learning_rate': 2.4399999999999996e-07, 'completion_length': 160.4464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00421142578125, 'epoch': 0.76} + 76%|███████▌ | 1890/2500 [11:51:58<3:43:44, 22.01s/it] 76%|███████▌ | 1891/2500 [11:52:20<3:44:28, 22.12s/it] {'loss': 0.0002, 'grad_norm': 0.025242215648580438, 'learning_rate': 2.436e-07, 'completion_length': 167.0714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00453948974609375, 'epoch': 0.76} + 76%|███████▌ | 1891/2500 [11:52:20<3:44:28, 22.12s/it] 76%|███████▌ | 1892/2500 [11:52:42<3:42:43, 21.98s/it] {'loss': 0.0002, 'grad_norm': 0.020761814536968296, 'learning_rate': 2.432e-07, 'completion_length': 166.3928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00386810302734375, 'epoch': 0.76} + 76%|███████▌ | 1892/2500 [11:52:42<3:42:43, 21.98s/it] 76%|███████▌ | 1893/2500 [11:53:04<3:43:31, 22.09s/it] {'loss': 0.0002, 'grad_norm': 0.17027961778215325, 'learning_rate': 2.428e-07, 'completion_length': 159.4107208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.005859375, 'epoch': 0.76} + 76%|███████▌ | 1893/2500 [11:53:04<3:43:31, 22.09s/it] 76%|███████▌ | 1894/2500 [11:53:28<3:47:22, 22.51s/it] {'loss': 0.0004, 'grad_norm': 0.27857924920685934, 'learning_rate': 2.424e-07, 'completion_length': 175.55358123779297, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.857142984867096, 'reward_std': 0.0714285746216774, 'kl': 0.011199951171875, 'epoch': 0.76} + 76%|███████▌ | 1894/2500 [11:53:28<3:47:22, 22.51s/it] 76%|███████▌ | 1895/2500 [11:53:50<3:45:33, 22.37s/it] {'loss': 0.0002, 'grad_norm': 0.3002090413288133, 'learning_rate': 2.4199999999999997e-07, 'completion_length': 163.83929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.004791259765625, 'epoch': 0.76} + 76%|███████▌ | 1895/2500 [11:53:50<3:45:33, 22.37s/it] 76%|███████▌ | 1896/2500 [11:54:11<3:40:45, 21.93s/it] {'loss': 0.0001, 'grad_norm': 0.024468582535142395, 'learning_rate': 2.416e-07, 'completion_length': 153.73214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00339508056640625, 'epoch': 0.76} + 76%|███████▌ | 1896/2500 [11:54:11<3:40:45, 21.93s/it] 76%|███████▌ | 1897/2500 [11:54:33<3:39:35, 21.85s/it] {'loss': 0.0002, 'grad_norm': 0.02095053957254611, 'learning_rate': 2.4119999999999996e-07, 'completion_length': 161.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0055084228515625, 'epoch': 0.76} + 76%|███████▌ | 1897/2500 [11:54:33<3:39:35, 21.85s/it] 76%|███████▌ | 1898/2500 [11:54:55<3:41:06, 22.04s/it] {'loss': 0.0003, 'grad_norm': 0.31254269937776064, 'learning_rate': 2.408e-07, 'completion_length': 173.12500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00677490234375, 'epoch': 0.76} + 76%|███████▌ | 1898/2500 [11:54:55<3:41:06, 22.04s/it] 76%|███████▌ | 1899/2500 [11:55:17<3:41:47, 22.14s/it] {'loss': 0.0003, 'grad_norm': 1.210241482500224, 'learning_rate': 2.404e-07, 'completion_length': 168.2678680419922, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.12371791899204254, 'kl': 0.00634765625, 'epoch': 0.76} + 76%|███████▌ | 1899/2500 [11:55:17<3:41:47, 22.14s/it] 76%|███████▌ | 1900/2500 [11:55:40<3:41:45, 22.18s/it] {'loss': 0.0002, 'grad_norm': 0.023888102746157054, 'learning_rate': 2.4e-07, 'completion_length': 170.6607208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0059356689453125, 'epoch': 0.76} + 76%|███████▌ | 1900/2500 [11:55:40<3:41:45, 22.18s/it] 76%|███████▌ | 1901/2500 [11:58:31<11:09:18, 67.04s/it] {'loss': 0.0001, 'grad_norm': 0.011292366573221795, 'learning_rate': 2.396e-07, 'completion_length': 153.5178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00240325927734375, 'epoch': 0.76} + 76%|███████▌ | 1901/2500 [11:58:31<11:09:18, 67.04s/it] 76%|███████▌ | 1902/2500 [11:58:47<8:35:17, 51.70s/it] {'loss': 0.0002, 'grad_norm': 0.032627846466241964, 'learning_rate': 2.3919999999999997e-07, 'completion_length': 155.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0043487548828125, 'epoch': 0.76} + 76%|███████▌ | 1902/2500 [11:58:47<8:35:17, 51.70s/it] 76%|███████▌ | 1903/2500 [11:59:04<6:50:12, 41.23s/it] {'loss': 0.0002, 'grad_norm': 0.2550143115915636, 'learning_rate': 2.388e-07, 'completion_length': 163.9464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005645751953125, 'epoch': 0.76} + 76%|███████▌ | 1903/2500 [11:59:04<6:50:12, 41.23s/it] 76%|███████▌ | 1904/2500 [11:59:21<5:37:37, 33.99s/it] {'loss': 0.0003, 'grad_norm': 0.022753072851983847, 'learning_rate': 2.384e-07, 'completion_length': 165.48214721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0075836181640625, 'epoch': 0.76} + 76%|███████▌ | 1904/2500 [11:59:21<5:37:37, 33.99s/it] 76%|███████▌ | 1905/2500 [11:59:38<4:44:59, 28.74s/it] {'loss': 0.0002, 'grad_norm': 0.015840752463322103, 'learning_rate': 2.38e-07, 'completion_length': 163.96429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00441741943359375, 'epoch': 0.76} + 76%|███████▌ | 1905/2500 [11:59:38<4:44:59, 28.74s/it] 76%|███████▌ | 1906/2500 [11:59:58<4:19:57, 26.26s/it] {'loss': 0.0003, 'grad_norm': 0.25270303751089884, 'learning_rate': 2.3759999999999998e-07, 'completion_length': 158.23214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0066070556640625, 'epoch': 0.76} + 76%|███████▌ | 1906/2500 [11:59:58<4:19:57, 26.26s/it] 76%|███████▋ | 1907/2500 [12:00:21<4:09:25, 25.24s/it] {'loss': 0.0002, 'grad_norm': 0.018120960057861055, 'learning_rate': 2.3719999999999998e-07, 'completion_length': 163.62500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004241943359375, 'epoch': 0.76} + 76%|███████▋ | 1907/2500 [12:00:21<4:09:25, 25.24s/it] 76%|███████▋ | 1908/2500 [12:00:43<3:58:05, 24.13s/it] {'loss': 0.0001, 'grad_norm': 0.021340644359850794, 'learning_rate': 2.368e-07, 'completion_length': 161.67857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00337982177734375, 'epoch': 0.76} + 76%|███████▋ | 1908/2500 [12:00:43<3:58:05, 24.13s/it] 76%|███████▋ | 1909/2500 [12:01:03<3:46:10, 22.96s/it] {'loss': 0.0002, 'grad_norm': 0.06922672051416817, 'learning_rate': 2.364e-07, 'completion_length': 138.25000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00592041015625, 'epoch': 0.76} + 76%|███████▋ | 1909/2500 [12:01:03<3:46:10, 22.96s/it] 76%|███████▋ | 1910/2500 [12:01:25<3:42:48, 22.66s/it] {'loss': 0.0002, 'grad_norm': 0.01920441529943934, 'learning_rate': 2.3599999999999997e-07, 'completion_length': 149.5357208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0057373046875, 'epoch': 0.76} + 76%|███████▋ | 1910/2500 [12:01:25<3:42:48, 22.66s/it] 76%|███████▋ | 1911/2500 [12:01:49<3:46:07, 23.04s/it] {'loss': 0.0003, 'grad_norm': 0.034276492092035614, 'learning_rate': 2.356e-07, 'completion_length': 175.51786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00628662109375, 'epoch': 0.76} + 76%|███████▋ | 1911/2500 [12:01:49<3:46:07, 23.04s/it] 76%|███████▋ | 1912/2500 [12:02:11<3:45:02, 22.96s/it] {'loss': 0.0003, 'grad_norm': 0.22177536859761865, 'learning_rate': 2.352e-07, 'completion_length': 169.4107208251953, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.00811767578125, 'epoch': 0.76} + 76%|███████▋ | 1912/2500 [12:02:11<3:45:02, 22.96s/it] 77%|███████▋ | 1913/2500 [12:02:34<3:42:20, 22.73s/it] {'loss': 0.0002, 'grad_norm': 0.019452072611243917, 'learning_rate': 2.3479999999999998e-07, 'completion_length': 160.37500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0039520263671875, 'epoch': 0.77} + 77%|███████▋ | 1913/2500 [12:02:34<3:42:20, 22.73s/it] 77%|███████▋ | 1914/2500 [12:02:55<3:38:48, 22.40s/it] {'loss': 0.0002, 'grad_norm': 0.016938564807625085, 'learning_rate': 2.3439999999999998e-07, 'completion_length': 161.01786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0039215087890625, 'epoch': 0.77} + 77%|███████▋ | 1914/2500 [12:02:55<3:38:48, 22.40s/it] 77%|███████▋ | 1915/2500 [12:03:16<3:33:42, 21.92s/it] {'loss': 0.0001, 'grad_norm': 0.031315627223403086, 'learning_rate': 2.34e-07, 'completion_length': 143.0714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00308990478515625, 'epoch': 0.77} + 77%|███████▋ | 1915/2500 [12:03:16<3:33:42, 21.92s/it] 77%|███████▋ | 1916/2500 [12:03:39<3:35:50, 22.18s/it] {'loss': 0.0002, 'grad_norm': 0.3162489477782431, 'learning_rate': 2.336e-07, 'completion_length': 169.10714721679688, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0052032470703125, 'epoch': 0.77} + 77%|███████▋ | 1916/2500 [12:03:39<3:35:50, 22.18s/it] 77%|███████▋ | 1917/2500 [12:04:03<3:41:27, 22.79s/it] {'loss': 0.0002, 'grad_norm': 0.29716484752609934, 'learning_rate': 2.3319999999999997e-07, 'completion_length': 180.64286041259766, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.0054779052734375, 'epoch': 0.77} + 77%|███████▋ | 1917/2500 [12:04:03<3:41:27, 22.79s/it] 77%|███████▋ | 1918/2500 [12:04:25<3:39:43, 22.65s/it] {'loss': 0.0001, 'grad_norm': 0.015367553741121389, 'learning_rate': 2.328e-07, 'completion_length': 160.17858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00360107421875, 'epoch': 0.77} + 77%|███████▋ | 1918/2500 [12:04:25<3:39:43, 22.65s/it] 77%|███████▋ | 1919/2500 [12:04:49<3:43:05, 23.04s/it] {'loss': 0.0003, 'grad_norm': 0.3146904465857744, 'learning_rate': 2.324e-07, 'completion_length': 181.96429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006500244140625, 'epoch': 0.77} + 77%|███████▋ | 1919/2500 [12:04:49<3:43:05, 23.04s/it] 77%|███████▋ | 1920/2500 [12:05:12<3:41:02, 22.87s/it] {'loss': 0.0002, 'grad_norm': 0.024304920405224938, 'learning_rate': 2.32e-07, 'completion_length': 149.4107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00440216064453125, 'epoch': 0.77} + 77%|███████▋ | 1920/2500 [12:05:12<3:41:02, 22.87s/it] 77%|███████▋ | 1921/2500 [12:05:34<3:37:48, 22.57s/it] {'loss': 0.0002, 'grad_norm': 0.5232118255698918, 'learning_rate': 2.3159999999999998e-07, 'completion_length': 177.12500762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.005523681640625, 'epoch': 0.77} + 77%|███████▋ | 1921/2500 [12:05:34<3:37:48, 22.57s/it] 77%|███████▋ | 1922/2500 [12:05:55<3:34:01, 22.22s/it] {'loss': 0.0002, 'grad_norm': 0.2784215888203309, 'learning_rate': 2.3119999999999998e-07, 'completion_length': 158.5357208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0053863525390625, 'epoch': 0.77} + 77%|███████▋ | 1922/2500 [12:05:55<3:34:01, 22.22s/it] 77%|███████▋ | 1923/2500 [12:06:17<3:33:20, 22.18s/it] {'loss': 0.0003, 'grad_norm': 0.15081020513631024, 'learning_rate': 2.308e-07, 'completion_length': 174.8571548461914, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0064697265625, 'epoch': 0.77} + 77%|███████▋ | 1923/2500 [12:06:17<3:33:20, 22.18s/it] 77%|███████▋ | 1924/2500 [12:06:40<3:35:45, 22.47s/it] {'loss': 0.0002, 'grad_norm': 0.01707486975499615, 'learning_rate': 2.3039999999999997e-07, 'completion_length': 177.3928680419922, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.0056610107421875, 'epoch': 0.77} + 77%|███████▋ | 1924/2500 [12:06:40<3:35:45, 22.47s/it] 77%|███████▋ | 1925/2500 [12:07:02<3:32:27, 22.17s/it] {'loss': 0.0002, 'grad_norm': 0.025344051385953005, 'learning_rate': 2.3e-07, 'completion_length': 157.55357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0044403076171875, 'epoch': 0.77} + 77%|███████▋ | 1925/2500 [12:07:02<3:32:27, 22.17s/it] 77%|███████▋ | 1926/2500 [12:07:24<3:32:28, 22.21s/it] {'loss': 0.0002, 'grad_norm': 0.024250150273102684, 'learning_rate': 2.296e-07, 'completion_length': 156.6428680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00543212890625, 'epoch': 0.77} + 77%|███████▋ | 1926/2500 [12:07:24<3:32:28, 22.21s/it] 77%|███████▋ | 1927/2500 [12:07:45<3:28:46, 21.86s/it] {'loss': 0.0002, 'grad_norm': 0.01736164564428919, 'learning_rate': 2.292e-07, 'completion_length': 148.33928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00482177734375, 'epoch': 0.77} + 77%|███████▋ | 1927/2500 [12:07:45<3:28:46, 21.86s/it] 77%|███████▋ | 1928/2500 [12:08:08<3:31:54, 22.23s/it] {'loss': 0.0003, 'grad_norm': 0.4211210666743066, 'learning_rate': 2.2879999999999998e-07, 'completion_length': 183.30358123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.006927490234375, 'epoch': 0.77} + 77%|███████▋ | 1928/2500 [12:08:08<3:31:54, 22.23s/it] 77%|███████▋ | 1929/2500 [12:08:31<3:31:43, 22.25s/it] {'loss': 0.0002, 'grad_norm': 0.25821522102659134, 'learning_rate': 2.2839999999999998e-07, 'completion_length': 162.05358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006011962890625, 'epoch': 0.77} + 77%|███████▋ | 1929/2500 [12:08:31<3:31:43, 22.25s/it] 77%|███████▋ | 1930/2500 [12:08:52<3:29:24, 22.04s/it] {'loss': 0.0002, 'grad_norm': 0.01899600669619979, 'learning_rate': 2.28e-07, 'completion_length': 168.50000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0054168701171875, 'epoch': 0.77} + 77%|███████▋ | 1930/2500 [12:08:52<3:29:24, 22.04s/it] 77%|███████▋ | 1931/2500 [12:09:16<3:33:30, 22.51s/it] {'loss': 0.0002, 'grad_norm': 0.21728237372362097, 'learning_rate': 2.2759999999999997e-07, 'completion_length': 166.05358123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00595855712890625, 'epoch': 0.77} + 77%|███████▋ | 1931/2500 [12:09:16<3:33:30, 22.51s/it] 77%|███████▋ | 1932/2500 [12:09:37<3:29:23, 22.12s/it] {'loss': 0.0002, 'grad_norm': 0.05517575311947045, 'learning_rate': 2.272e-07, 'completion_length': 156.26786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00438690185546875, 'epoch': 0.77} + 77%|███████▋ | 1932/2500 [12:09:37<3:29:23, 22.12s/it] 77%|███████▋ | 1933/2500 [12:09:58<3:24:59, 21.69s/it] {'loss': 0.0002, 'grad_norm': 0.021073693158815878, 'learning_rate': 2.268e-07, 'completion_length': 142.0357208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00443267822265625, 'epoch': 0.77} + 77%|███████▋ | 1933/2500 [12:09:58<3:24:59, 21.69s/it] 77%|███████▋ | 1934/2500 [12:10:20<3:25:33, 21.79s/it] {'loss': 0.0002, 'grad_norm': 0.060194580939463074, 'learning_rate': 2.264e-07, 'completion_length': 162.8214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0058135986328125, 'epoch': 0.77} + 77%|███████▋ | 1934/2500 [12:10:20<3:25:33, 21.79s/it] 77%|███████▋ | 1935/2500 [12:10:42<3:26:33, 21.94s/it] {'loss': 0.0002, 'grad_norm': 0.19259268359190718, 'learning_rate': 2.2599999999999999e-07, 'completion_length': 162.21429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0054473876953125, 'epoch': 0.77} + 77%|███████▋ | 1935/2500 [12:10:42<3:26:33, 21.94s/it] 77%|███████▋ | 1936/2500 [12:11:03<3:24:49, 21.79s/it] {'loss': 0.0002, 'grad_norm': 0.0190351393523608, 'learning_rate': 2.2559999999999998e-07, 'completion_length': 153.0714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0058746337890625, 'epoch': 0.77} + 77%|███████▋ | 1936/2500 [12:11:03<3:24:49, 21.79s/it] 77%|███████▋ | 1937/2500 [12:11:25<3:25:19, 21.88s/it] {'loss': 0.0002, 'grad_norm': 0.29521170366067007, 'learning_rate': 2.252e-07, 'completion_length': 163.1607208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006011962890625, 'epoch': 0.77} + 77%|███████▋ | 1937/2500 [12:11:25<3:25:19, 21.88s/it] 78%|███████▊ | 1938/2500 [12:11:47<3:24:12, 21.80s/it] {'loss': 0.0002, 'grad_norm': 0.025284984592690827, 'learning_rate': 2.248e-07, 'completion_length': 163.5178680419922, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.0059356689453125, 'epoch': 0.78} + 78%|███████▊ | 1938/2500 [12:11:47<3:24:12, 21.80s/it] 78%|███████▊ | 1939/2500 [12:12:09<3:24:19, 21.85s/it] {'loss': 0.0002, 'grad_norm': 0.029616356634860175, 'learning_rate': 2.2439999999999997e-07, 'completion_length': 164.05358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006195068359375, 'epoch': 0.78} + 78%|███████▊ | 1939/2500 [12:12:09<3:24:19, 21.85s/it] 78%|███████▊ | 1940/2500 [12:12:31<3:23:58, 21.86s/it] {'loss': 0.0003, 'grad_norm': 0.37707309599463285, 'learning_rate': 2.24e-07, 'completion_length': 159.58929443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00653076171875, 'epoch': 0.78} + 78%|███████▊ | 1940/2500 [12:12:31<3:23:58, 21.86s/it] 78%|███████▊ | 1941/2500 [12:12:52<3:21:21, 21.61s/it] {'loss': 0.0001, 'grad_norm': 0.025146445730408088, 'learning_rate': 2.236e-07, 'completion_length': 146.1428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00299072265625, 'epoch': 0.78} + 78%|███████▊ | 1941/2500 [12:12:52<3:21:21, 21.61s/it] 78%|███████▊ | 1942/2500 [12:13:15<3:25:40, 22.11s/it] {'loss': 0.0002, 'grad_norm': 0.3743261604865017, 'learning_rate': 2.232e-07, 'completion_length': 146.12500762939453, 'rewards/accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.04123930633068085, 'kl': 0.0062255859375, 'epoch': 0.78} + 78%|███████▊ | 1942/2500 [12:13:15<3:25:40, 22.11s/it] 78%|███████▊ | 1943/2500 [12:13:36<3:21:17, 21.68s/it] {'loss': 0.0002, 'grad_norm': 0.4740126864888072, 'learning_rate': 2.2279999999999998e-07, 'completion_length': 140.5178680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0042266845703125, 'epoch': 0.78} + 78%|███████▊ | 1943/2500 [12:13:36<3:21:17, 21.68s/it] 78%|███████▊ | 1944/2500 [12:13:59<3:24:41, 22.09s/it] {'loss': 0.0001, 'grad_norm': 0.01579651152667831, 'learning_rate': 2.2239999999999998e-07, 'completion_length': 145.33929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0037384033203125, 'epoch': 0.78} + 78%|███████▊ | 1944/2500 [12:13:59<3:24:41, 22.09s/it] 78%|███████▊ | 1945/2500 [12:14:22<3:25:42, 22.24s/it] {'loss': 0.0002, 'grad_norm': 1.0890538063212454, 'learning_rate': 2.22e-07, 'completion_length': 141.46429443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.00399017333984375, 'epoch': 0.78} + 78%|███████▊ | 1945/2500 [12:14:22<3:25:42, 22.24s/it] 78%|███████▊ | 1946/2500 [12:14:43<3:22:33, 21.94s/it] {'loss': 0.0002, 'grad_norm': 0.5743994475108335, 'learning_rate': 2.2159999999999997e-07, 'completion_length': 158.55358123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00543975830078125, 'epoch': 0.78} + 78%|███████▊ | 1946/2500 [12:14:43<3:22:33, 21.94s/it] 78%|███████▊ | 1947/2500 [12:15:05<3:23:23, 22.07s/it] {'loss': 0.0002, 'grad_norm': 0.01891860905432865, 'learning_rate': 2.212e-07, 'completion_length': 160.0357208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00402069091796875, 'epoch': 0.78} + 78%|███████▊ | 1947/2500 [12:15:05<3:23:23, 22.07s/it] 78%|███████▊ | 1948/2500 [12:15:27<3:23:14, 22.09s/it] {'loss': 0.0003, 'grad_norm': 0.9269743855557875, 'learning_rate': 2.208e-07, 'completion_length': 168.78572845458984, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0068359375, 'epoch': 0.78} + 78%|███████▊ | 1948/2500 [12:15:27<3:23:14, 22.09s/it] 78%|███████▊ | 1949/2500 [12:15:50<3:23:25, 22.15s/it] {'loss': 0.0002, 'grad_norm': 0.039021244118791626, 'learning_rate': 2.2040000000000001e-07, 'completion_length': 149.76786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0054931640625, 'epoch': 0.78} + 78%|███████▊ | 1949/2500 [12:15:50<3:23:25, 22.15s/it] 78%|███████▊ | 1950/2500 [12:16:13<3:26:42, 22.55s/it] {'loss': 0.0003, 'grad_norm': 0.01780642871397245, 'learning_rate': 2.1999999999999998e-07, 'completion_length': 169.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00738525390625, 'epoch': 0.78} + 78%|███████▊ | 1950/2500 [12:16:13<3:26:42, 22.55s/it] 78%|███████▊ | 1951/2500 [12:16:34<3:22:49, 22.17s/it] {'loss': 0.0001, 'grad_norm': 0.010181673393732141, 'learning_rate': 2.1959999999999998e-07, 'completion_length': 148.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003021240234375, 'epoch': 0.78} + 78%|███████▊ | 1951/2500 [12:16:34<3:22:49, 22.17s/it] 78%|███████▊ | 1952/2500 [12:16:58<3:25:31, 22.50s/it] {'loss': 0.0002, 'grad_norm': 0.36522064595392195, 'learning_rate': 2.192e-07, 'completion_length': 152.60714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0053253173828125, 'epoch': 0.78} + 78%|███████▊ | 1952/2500 [12:16:58<3:25:31, 22.50s/it] 78%|███████▊ | 1953/2500 [12:17:20<3:23:56, 22.37s/it] {'loss': 0.0002, 'grad_norm': 0.016129493659709667, 'learning_rate': 2.1879999999999997e-07, 'completion_length': 150.9107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00429534912109375, 'epoch': 0.78} + 78%|███████▊ | 1953/2500 [12:17:20<3:23:56, 22.37s/it] 78%|███████▊ | 1954/2500 [12:17:41<3:20:21, 22.02s/it] {'loss': 0.0003, 'grad_norm': 0.0254915499638284, 'learning_rate': 2.184e-07, 'completion_length': 146.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0074310302734375, 'epoch': 0.78} + 78%|███████▊ | 1954/2500 [12:17:41<3:20:21, 22.02s/it] 78%|███████▊ | 1955/2500 [12:18:04<3:22:11, 22.26s/it] {'loss': 0.0003, 'grad_norm': 0.27574277269537584, 'learning_rate': 2.18e-07, 'completion_length': 169.19644165039062, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0070953369140625, 'epoch': 0.78} + 78%|███████▊ | 1955/2500 [12:18:04<3:22:11, 22.26s/it] 78%|███████▊ | 1956/2500 [12:18:26<3:20:46, 22.14s/it] {'loss': 0.0002, 'grad_norm': 0.856995417194313, 'learning_rate': 2.176e-07, 'completion_length': 157.6607208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0043487548828125, 'epoch': 0.78} + 78%|███████▊ | 1956/2500 [12:18:26<3:20:46, 22.14s/it] 78%|███████▊ | 1957/2500 [12:18:47<3:18:56, 21.98s/it] {'loss': 0.0002, 'grad_norm': 0.542203908991639, 'learning_rate': 2.1719999999999999e-07, 'completion_length': 141.4464340209961, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07695359364151955, 'kl': 0.00516510009765625, 'epoch': 0.78} + 78%|███████▊ | 1957/2500 [12:18:47<3:18:56, 21.98s/it] 78%|███████▊ | 1958/2500 [12:19:08<3:16:31, 21.76s/it] {'loss': 0.0002, 'grad_norm': 0.022188939658769356, 'learning_rate': 2.1679999999999998e-07, 'completion_length': 146.6071548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00482177734375, 'epoch': 0.78} + 78%|███████▊ | 1958/2500 [12:19:08<3:16:31, 21.76s/it] 78%|███████▊ | 1959/2500 [12:19:30<3:17:01, 21.85s/it] {'loss': 0.0002, 'grad_norm': 0.01728188472543581, 'learning_rate': 2.164e-07, 'completion_length': 158.46428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003936767578125, 'epoch': 0.78} + 78%|███████▊ | 1959/2500 [12:19:30<3:17:01, 21.85s/it] 78%|███████▊ | 1960/2500 [12:19:52<3:16:56, 21.88s/it] {'loss': 0.0002, 'grad_norm': 0.6743701766764097, 'learning_rate': 2.1599999999999998e-07, 'completion_length': 161.4464340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00395965576171875, 'epoch': 0.78} + 78%|███████▊ | 1960/2500 [12:19:52<3:16:56, 21.88s/it] 78%|███████▊ | 1961/2500 [12:20:14<3:16:58, 21.93s/it] {'loss': 0.0001, 'grad_norm': 0.028434084460461816, 'learning_rate': 2.156e-07, 'completion_length': 152.21429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0029449462890625, 'epoch': 0.78} + 78%|███████▊ | 1961/2500 [12:20:14<3:16:58, 21.93s/it] 78%|███████▊ | 1962/2500 [12:20:37<3:18:44, 22.16s/it] {'loss': 0.0001, 'grad_norm': 0.02043661168571394, 'learning_rate': 2.152e-07, 'completion_length': 160.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00335693359375, 'epoch': 0.78} + 78%|███████▊ | 1962/2500 [12:20:37<3:18:44, 22.16s/it] 79%|███████▊ | 1963/2500 [12:20:59<3:17:40, 22.09s/it] {'loss': 0.0002, 'grad_norm': 0.024603075799862113, 'learning_rate': 2.148e-07, 'completion_length': 148.9464340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00492095947265625, 'epoch': 0.79} + 79%|███████▊ | 1963/2500 [12:20:59<3:17:40, 22.09s/it] 79%|███████▊ | 1964/2500 [12:21:21<3:15:43, 21.91s/it] {'loss': 0.0002, 'grad_norm': 0.19581531274216574, 'learning_rate': 2.144e-07, 'completion_length': 158.89286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0048065185546875, 'epoch': 0.79} + 79%|███████▊ | 1964/2500 [12:21:21<3:15:43, 21.91s/it] 79%|███████▊ | 1965/2500 [12:21:43<3:15:44, 21.95s/it] {'loss': 0.0002, 'grad_norm': 0.5965349584719146, 'learning_rate': 2.1399999999999998e-07, 'completion_length': 159.2321548461914, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.004730224609375, 'epoch': 0.79} + 79%|███████▊ | 1965/2500 [12:21:43<3:15:44, 21.95s/it] 79%|███████▊ | 1966/2500 [12:22:05<3:15:51, 22.01s/it] {'loss': 0.0002, 'grad_norm': 0.31255501844435496, 'learning_rate': 2.136e-07, 'completion_length': 160.55358123779297, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0054168701171875, 'epoch': 0.79} + 79%|███████▊ | 1966/2500 [12:22:05<3:15:51, 22.01s/it] 79%|███████▊ | 1967/2500 [12:22:28<3:20:02, 22.52s/it] {'loss': 0.0002, 'grad_norm': 1.0940619997133645, 'learning_rate': 2.132e-07, 'completion_length': 182.1428680419922, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.857142984867096, 'reward_std': 0.1539071835577488, 'kl': 0.0051422119140625, 'epoch': 0.79} + 79%|███████▊ | 1967/2500 [12:22:29<3:20:02, 22.52s/it] 79%|███████▊ | 1968/2500 [12:22:52<3:21:04, 22.68s/it] {'loss': 0.0001, 'grad_norm': 0.34871111604437705, 'learning_rate': 2.1279999999999997e-07, 'completion_length': 161.0714340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0033416748046875, 'epoch': 0.79} + 79%|███████▊ | 1968/2500 [12:22:52<3:21:04, 22.68s/it] 79%|███████▉ | 1969/2500 [12:23:13<3:17:21, 22.30s/it] {'loss': 0.0001, 'grad_norm': 0.6355106335200192, 'learning_rate': 2.124e-07, 'completion_length': 152.9821548461914, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.00330352783203125, 'epoch': 0.79} + 79%|███████▉ | 1969/2500 [12:23:13<3:17:21, 22.30s/it] 79%|███████▉ | 1970/2500 [12:23:37<3:20:56, 22.75s/it] {'loss': 0.0003, 'grad_norm': 0.3242851304556102, 'learning_rate': 2.12e-07, 'completion_length': 179.05357360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006256103515625, 'epoch': 0.79} + 79%|███████▉ | 1970/2500 [12:23:37<3:20:56, 22.75s/it] 79%|███████▉ | 1971/2500 [12:23:59<3:20:03, 22.69s/it] {'loss': 0.0002, 'grad_norm': 0.025567670765025117, 'learning_rate': 2.116e-07, 'completion_length': 161.00000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00548553466796875, 'epoch': 0.79} + 79%|███████▉ | 1971/2500 [12:23:59<3:20:03, 22.69s/it] 79%|███████▉ | 1972/2500 [12:24:21<3:17:16, 22.42s/it] {'loss': 0.0002, 'grad_norm': 0.17295532772641434, 'learning_rate': 2.1119999999999999e-07, 'completion_length': 164.30358123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0055999755859375, 'epoch': 0.79} + 79%|███████▉ | 1972/2500 [12:24:21<3:17:16, 22.42s/it] 79%|███████▉ | 1973/2500 [12:24:42<3:14:12, 22.11s/it] {'loss': 0.0002, 'grad_norm': 0.40912909022650734, 'learning_rate': 2.1079999999999998e-07, 'completion_length': 154.67857360839844, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.0050811767578125, 'epoch': 0.79} + 79%|███████▉ | 1973/2500 [12:24:42<3:14:12, 22.11s/it] 79%|███████▉ | 1974/2500 [12:25:02<3:08:03, 21.45s/it] {'loss': 0.0002, 'grad_norm': 0.017798223665987444, 'learning_rate': 2.104e-07, 'completion_length': 141.3571548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00406646728515625, 'epoch': 0.79} + 79%|███████▉ | 1974/2500 [12:25:02<3:08:03, 21.45s/it] 79%|███████▉ | 1975/2500 [12:25:25<3:10:21, 21.76s/it] {'loss': 0.0003, 'grad_norm': 0.25630107911043276, 'learning_rate': 2.0999999999999997e-07, 'completion_length': 181.92858123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006439208984375, 'epoch': 0.79} + 79%|███████▉ | 1975/2500 [12:25:25<3:10:21, 21.76s/it] 79%|███████▉ | 1976/2500 [12:25:46<3:07:32, 21.47s/it] {'loss': 0.0002, 'grad_norm': 0.19006578354173756, 'learning_rate': 2.096e-07, 'completion_length': 142.08928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0040130615234375, 'epoch': 0.79} + 79%|███████▉ | 1976/2500 [12:25:46<3:07:32, 21.47s/it] 79%|███████▉ | 1977/2500 [12:26:08<3:08:08, 21.58s/it] {'loss': 0.0002, 'grad_norm': 0.02537720271043806, 'learning_rate': 2.092e-07, 'completion_length': 155.85714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004241943359375, 'epoch': 0.79} + 79%|███████▉ | 1977/2500 [12:26:08<3:08:08, 21.58s/it] 79%|███████▉ | 1978/2500 [12:26:30<3:09:17, 21.76s/it] {'loss': 0.0001, 'grad_norm': 0.01577044173118386, 'learning_rate': 2.0880000000000002e-07, 'completion_length': 156.39286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00360870361328125, 'epoch': 0.79} + 79%|███████▉ | 1978/2500 [12:26:30<3:09:17, 21.76s/it] 79%|███████▉ | 1979/2500 [12:26:53<3:12:01, 22.11s/it] {'loss': 0.0002, 'grad_norm': 0.019688188249891478, 'learning_rate': 2.0839999999999999e-07, 'completion_length': 171.69644165039062, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0058135986328125, 'epoch': 0.79} + 79%|███████▉ | 1979/2500 [12:26:53<3:12:01, 22.11s/it] 79%|███████▉ | 1980/2500 [12:27:15<3:12:07, 22.17s/it] {'loss': 0.0002, 'grad_norm': 0.017422813486438923, 'learning_rate': 2.0799999999999998e-07, 'completion_length': 155.83929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00405120849609375, 'epoch': 0.79} + 79%|███████▉ | 1980/2500 [12:27:15<3:12:07, 22.17s/it] 79%|███████▉ | 1981/2500 [12:27:37<3:12:46, 22.29s/it] {'loss': 0.0002, 'grad_norm': 0.01919665868988967, 'learning_rate': 2.076e-07, 'completion_length': 181.92858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00437164306640625, 'epoch': 0.79} + 79%|███████▉ | 1981/2500 [12:27:37<3:12:46, 22.29s/it] 79%|███████▉ | 1982/2500 [12:27:59<3:09:50, 21.99s/it] {'loss': 0.0002, 'grad_norm': 0.42748718810933817, 'learning_rate': 2.0719999999999998e-07, 'completion_length': 161.55358123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00516510009765625, 'epoch': 0.79} + 79%|███████▉ | 1982/2500 [12:27:59<3:09:50, 21.99s/it] 79%|███████▉ | 1983/2500 [12:28:20<3:07:32, 21.77s/it] {'loss': 0.0002, 'grad_norm': 0.020702972908633193, 'learning_rate': 2.068e-07, 'completion_length': 142.8928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00420379638671875, 'epoch': 0.79} + 79%|███████▉ | 1983/2500 [12:28:20<3:07:32, 21.77s/it] 79%|███████▉ | 1984/2500 [12:28:41<3:05:29, 21.57s/it] {'loss': 0.0002, 'grad_norm': 0.4884324921367634, 'learning_rate': 2.064e-07, 'completion_length': 148.14286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0039520263671875, 'epoch': 0.79} + 79%|███████▉ | 1984/2500 [12:28:41<3:05:29, 21.57s/it] 79%|███████▉ | 1985/2500 [12:29:04<3:08:05, 21.91s/it] {'loss': 0.0001, 'grad_norm': 0.20051878494929495, 'learning_rate': 2.06e-07, 'completion_length': 158.7857208251953, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.00331878662109375, 'epoch': 0.79} + 79%|███████▉ | 1985/2500 [12:29:04<3:08:05, 21.91s/it] 79%|███████▉ | 1986/2500 [12:29:26<3:07:08, 21.85s/it] {'loss': 0.0002, 'grad_norm': 1.2732171024930397, 'learning_rate': 2.056e-07, 'completion_length': 163.2321548461914, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.0054779052734375, 'epoch': 0.79} + 79%|███████▉ | 1986/2500 [12:29:26<3:07:08, 21.85s/it] 79%|███████▉ | 1987/2500 [12:29:46<3:04:03, 21.53s/it] {'loss': 0.0001, 'grad_norm': 0.018659570618174866, 'learning_rate': 2.0519999999999998e-07, 'completion_length': 150.25000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00344085693359375, 'epoch': 0.79} + 79%|███████▉ | 1987/2500 [12:29:46<3:04:03, 21.53s/it] 80%|███████▉ | 1988/2500 [12:30:06<3:00:07, 21.11s/it] {'loss': 0.0002, 'grad_norm': 0.020607678446049455, 'learning_rate': 2.048e-07, 'completion_length': 154.87500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0053558349609375, 'epoch': 0.8} + 80%|███████▉ | 1988/2500 [12:30:06<3:00:07, 21.11s/it] 80%|███████▉ | 1989/2500 [12:30:28<3:01:48, 21.35s/it] {'loss': 0.0002, 'grad_norm': 0.4255343514397338, 'learning_rate': 2.0439999999999998e-07, 'completion_length': 155.58928680419922, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.0058746337890625, 'epoch': 0.8} + 80%|███████▉ | 1989/2500 [12:30:28<3:01:48, 21.35s/it] 80%|███████▉ | 1990/2500 [12:30:50<3:03:23, 21.58s/it] {'loss': 0.0003, 'grad_norm': 0.3907731926634687, 'learning_rate': 2.0399999999999997e-07, 'completion_length': 153.30358123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0063934326171875, 'epoch': 0.8} + 80%|███████▉ | 1990/2500 [12:30:50<3:03:23, 21.58s/it] 80%|███████▉ | 1991/2500 [12:31:11<3:00:07, 21.23s/it] {'loss': 0.0001, 'grad_norm': 0.3988098835077658, 'learning_rate': 2.036e-07, 'completion_length': 138.7857208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.003421783447265625, 'epoch': 0.8} + 80%|███████▉ | 1991/2500 [12:31:11<3:00:07, 21.23s/it] 80%|███████▉ | 1992/2500 [12:31:33<3:00:53, 21.36s/it] {'loss': 0.0002, 'grad_norm': 0.04613814145899962, 'learning_rate': 2.032e-07, 'completion_length': 163.58928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004852294921875, 'epoch': 0.8} + 80%|███████▉ | 1992/2500 [12:31:33<3:00:53, 21.36s/it] 80%|███████▉ | 1993/2500 [12:31:54<3:01:32, 21.48s/it] {'loss': 0.0003, 'grad_norm': 0.5476943389133982, 'learning_rate': 2.028e-07, 'completion_length': 167.08928680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.00665283203125, 'epoch': 0.8} + 80%|███████▉ | 1993/2500 [12:31:54<3:01:32, 21.48s/it] 80%|███████▉ | 1994/2500 [12:32:16<3:01:57, 21.58s/it] {'loss': 0.0001, 'grad_norm': 0.027630849194091724, 'learning_rate': 2.0239999999999999e-07, 'completion_length': 155.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0035247802734375, 'epoch': 0.8} + 80%|███████▉ | 1994/2500 [12:32:16<3:01:57, 21.58s/it] 80%|███████▉ | 1995/2500 [12:32:38<3:02:39, 21.70s/it] {'loss': 0.0001, 'grad_norm': 0.19050910890631653, 'learning_rate': 2.02e-07, 'completion_length': 162.3571548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00341033935546875, 'epoch': 0.8} + 80%|███████▉ | 1995/2500 [12:32:38<3:02:39, 21.70s/it] 80%|███████▉ | 1996/2500 [12:32:59<3:00:22, 21.47s/it] {'loss': 0.0002, 'grad_norm': 0.025800861131583055, 'learning_rate': 2.016e-07, 'completion_length': 142.60714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0056610107421875, 'epoch': 0.8} + 80%|███████▉ | 1996/2500 [12:32:59<3:00:22, 21.47s/it] 80%|███████▉ | 1997/2500 [12:33:20<2:57:57, 21.23s/it] {'loss': 0.0001, 'grad_norm': 0.21705756727608058, 'learning_rate': 2.0119999999999998e-07, 'completion_length': 143.55357360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0027923583984375, 'epoch': 0.8} + 80%|███████▉ | 1997/2500 [12:33:20<2:57:57, 21.23s/it] 80%|███████▉ | 1998/2500 [12:33:42<2:59:38, 21.47s/it] {'loss': 0.0001, 'grad_norm': 0.1597813251208638, 'learning_rate': 2.008e-07, 'completion_length': 158.9464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00339508056640625, 'epoch': 0.8} + 80%|███████▉ | 1998/2500 [12:33:42<2:59:38, 21.47s/it] 80%|███████▉ | 1999/2500 [12:34:04<3:00:31, 21.62s/it] {'loss': 0.0002, 'grad_norm': 0.02512275688954378, 'learning_rate': 2.004e-07, 'completion_length': 165.8214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0048065185546875, 'epoch': 0.8} + 80%|███████▉ | 1999/2500 [12:34:04<3:00:31, 21.62s/it] 80%|████████ | 2000/2500 [12:34:25<2:58:12, 21.38s/it] {'loss': 0.0002, 'grad_norm': 0.37619435712634663, 'learning_rate': 2e-07, 'completion_length': 135.58928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004150390625, 'epoch': 0.8} + 80%|████████ | 2000/2500 [12:34:25<2:58:12, 21.38s/it] 80%|████████ | 2001/2500 [12:37:35<9:59:05, 72.03s/it] {'loss': 0.0001, 'grad_norm': 0.01654587779258813, 'learning_rate': 1.996e-07, 'completion_length': 154.26786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00321197509765625, 'epoch': 0.8} + 80%|████████ | 2001/2500 [12:37:35<9:59:05, 72.03s/it] 80%|████████ | 2002/2500 [12:37:55<7:49:40, 56.59s/it] {'loss': 0.0002, 'grad_norm': 0.028099431490050432, 'learning_rate': 1.9919999999999998e-07, 'completion_length': 154.8928680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00487518310546875, 'epoch': 0.8} + 80%|████████ | 2002/2500 [12:37:55<7:49:40, 56.59s/it] 80%|████████ | 2003/2500 [12:38:15<6:17:48, 45.61s/it] {'loss': 0.0001, 'grad_norm': 0.01947853162693151, 'learning_rate': 1.988e-07, 'completion_length': 141.44644165039062, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00354766845703125, 'epoch': 0.8} + 80%|████████ | 2003/2500 [12:38:15<6:17:48, 45.61s/it] 80%|████████ | 2004/2500 [12:38:36<5:15:19, 38.15s/it] {'loss': 0.0002, 'grad_norm': 0.37996286679764746, 'learning_rate': 1.9839999999999998e-07, 'completion_length': 151.58928680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00537109375, 'epoch': 0.8} + 80%|████████ | 2004/2500 [12:38:36<5:15:19, 38.15s/it] 80%|████████ | 2005/2500 [12:38:56<4:30:47, 32.82s/it] {'loss': 0.0003, 'grad_norm': 0.9916908790298159, 'learning_rate': 1.98e-07, 'completion_length': 158.9107208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.00811767578125, 'epoch': 0.8} + 80%|████████ | 2005/2500 [12:38:56<4:30:47, 32.82s/it] 80%|████████ | 2006/2500 [12:39:18<4:03:17, 29.55s/it] {'loss': 0.0002, 'grad_norm': 0.18874095288714943, 'learning_rate': 1.976e-07, 'completion_length': 149.64286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004241943359375, 'epoch': 0.8} + 80%|████████ | 2006/2500 [12:39:18<4:03:17, 29.55s/it] 80%|████████ | 2007/2500 [12:39:41<3:46:24, 27.56s/it] {'loss': 0.0002, 'grad_norm': 0.05244920239695816, 'learning_rate': 1.9719999999999997e-07, 'completion_length': 178.76786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00513458251953125, 'epoch': 0.8} + 80%|████████ | 2007/2500 [12:39:41<3:46:24, 27.56s/it] 80%|████████ | 2008/2500 [12:40:02<3:27:58, 25.36s/it] {'loss': 0.0002, 'grad_norm': 0.019175211090173554, 'learning_rate': 1.968e-07, 'completion_length': 146.89286041259766, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.0041656494140625, 'epoch': 0.8} + 80%|████████ | 2008/2500 [12:40:02<3:27:58, 25.36s/it] 80%|████████ | 2009/2500 [12:40:23<3:17:56, 24.19s/it] {'loss': 0.0003, 'grad_norm': 0.7303327163609453, 'learning_rate': 1.9639999999999999e-07, 'completion_length': 161.6071548461914, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.0065155029296875, 'epoch': 0.8} + 80%|████████ | 2009/2500 [12:40:23<3:17:56, 24.19s/it] 80%|████████ | 2010/2500 [12:40:44<3:10:25, 23.32s/it] {'loss': 0.0002, 'grad_norm': 0.35669959288570985, 'learning_rate': 1.96e-07, 'completion_length': 163.55358123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0052490234375, 'epoch': 0.8} + 80%|████████ | 2010/2500 [12:40:44<3:10:25, 23.32s/it] 80%|████████ | 2011/2500 [12:41:04<3:01:40, 22.29s/it] {'loss': 0.0002, 'grad_norm': 0.02038473179721561, 'learning_rate': 1.9559999999999998e-07, 'completion_length': 163.96429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0052947998046875, 'epoch': 0.8} + 80%|████████ | 2011/2500 [12:41:04<3:01:40, 22.29s/it] 80%|████████ | 2012/2500 [12:41:24<2:55:40, 21.60s/it] {'loss': 0.0002, 'grad_norm': 1.5087208875330806, 'learning_rate': 1.952e-07, 'completion_length': 159.64286041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0057373046875, 'epoch': 0.8} + 80%|████████ | 2012/2500 [12:41:24<2:55:40, 21.60s/it] 81%|████████ | 2013/2500 [12:41:45<2:54:21, 21.48s/it] {'loss': 0.0001, 'grad_norm': 0.04036425963946446, 'learning_rate': 1.948e-07, 'completion_length': 161.01786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0037384033203125, 'epoch': 0.81} + 81%|████████ | 2013/2500 [12:41:45<2:54:21, 21.48s/it] 81%|████████ | 2014/2500 [12:42:05<2:49:10, 20.89s/it] {'loss': 0.0001, 'grad_norm': 0.016997419541886934, 'learning_rate': 1.944e-07, 'completion_length': 145.0357208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00316619873046875, 'epoch': 0.81} + 81%|████████ | 2014/2500 [12:42:05<2:49:10, 20.89s/it] 81%|████████ | 2015/2500 [12:42:24<2:43:59, 20.29s/it] {'loss': 0.0001, 'grad_norm': 0.035895245737716255, 'learning_rate': 1.94e-07, 'completion_length': 153.83928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00313568115234375, 'epoch': 0.81} + 81%|████████ | 2015/2500 [12:42:24<2:43:59, 20.29s/it] 81%|████████ | 2016/2500 [12:42:41<2:37:31, 19.53s/it] {'loss': 0.0002, 'grad_norm': 0.02382456982429101, 'learning_rate': 1.9359999999999999e-07, 'completion_length': 147.46429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00402069091796875, 'epoch': 0.81} + 81%|████████ | 2016/2500 [12:42:41<2:37:31, 19.53s/it] 81%|████████ | 2017/2500 [12:43:00<2:34:25, 19.18s/it] {'loss': 0.0003, 'grad_norm': 0.3315546464470334, 'learning_rate': 1.932e-07, 'completion_length': 175.5357208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0084228515625, 'epoch': 0.81} + 81%|████████ | 2017/2500 [12:43:00<2:34:25, 19.18s/it] 81%|████████ | 2018/2500 [12:43:18<2:31:31, 18.86s/it] {'loss': 0.0002, 'grad_norm': 0.015504258807118513, 'learning_rate': 1.9279999999999998e-07, 'completion_length': 162.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00501251220703125, 'epoch': 0.81} + 81%|████████ | 2018/2500 [12:43:18<2:31:31, 18.86s/it] 81%|████████ | 2019/2500 [12:43:37<2:31:54, 18.95s/it] {'loss': 0.0002, 'grad_norm': 0.021227053881553867, 'learning_rate': 1.9239999999999998e-07, 'completion_length': 173.92857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0055694580078125, 'epoch': 0.81} + 81%|████████ | 2019/2500 [12:43:37<2:31:54, 18.95s/it] 81%|████████ | 2020/2500 [12:43:56<2:30:51, 18.86s/it] {'loss': 0.0002, 'grad_norm': 0.0208177010839514, 'learning_rate': 1.92e-07, 'completion_length': 171.2678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0046844482421875, 'epoch': 0.81} + 81%|████████ | 2020/2500 [12:43:56<2:30:51, 18.86s/it] 81%|████████ | 2021/2500 [12:44:14<2:28:18, 18.58s/it] {'loss': 0.0001, 'grad_norm': 0.013057072129756361, 'learning_rate': 1.916e-07, 'completion_length': 152.60714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00244140625, 'epoch': 0.81} + 81%|████████ | 2021/2500 [12:44:14<2:28:18, 18.58s/it] 81%|████████ | 2022/2500 [12:44:32<2:28:01, 18.58s/it] {'loss': 0.0002, 'grad_norm': 0.44466474479544016, 'learning_rate': 1.912e-07, 'completion_length': 157.35714721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00551605224609375, 'epoch': 0.81} + 81%|████████ | 2022/2500 [12:44:32<2:28:01, 18.58s/it] 81%|████████ | 2023/2500 [12:44:51<2:28:28, 18.68s/it] {'loss': 0.0002, 'grad_norm': 0.32866257591389153, 'learning_rate': 1.908e-07, 'completion_length': 162.33928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0040435791015625, 'epoch': 0.81} + 81%|████████ | 2023/2500 [12:44:51<2:28:28, 18.68s/it] 81%|████████ | 2024/2500 [12:45:11<2:31:01, 19.04s/it] {'loss': 0.0003, 'grad_norm': 0.02308528608571965, 'learning_rate': 1.904e-07, 'completion_length': 160.35714721679688, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.006378173828125, 'epoch': 0.81} + 81%|████████ | 2024/2500 [12:45:11<2:31:01, 19.04s/it] 81%|████████ | 2025/2500 [12:45:30<2:31:13, 19.10s/it] {'loss': 0.0002, 'grad_norm': 0.2961499608916158, 'learning_rate': 1.8999999999999998e-07, 'completion_length': 162.42858123779297, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.07695359364151955, 'kl': 0.0062103271484375, 'epoch': 0.81} + 81%|████████ | 2025/2500 [12:45:30<2:31:13, 19.10s/it] 81%|████████ | 2026/2500 [12:45:50<2:31:19, 19.16s/it] {'loss': 0.0002, 'grad_norm': 0.6124655534590263, 'learning_rate': 1.8959999999999998e-07, 'completion_length': 161.6964340209961, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.0062255859375, 'epoch': 0.81} + 81%|████████ | 2026/2500 [12:45:50<2:31:19, 19.16s/it] 81%|████████ | 2027/2500 [12:46:09<2:30:40, 19.11s/it] {'loss': 0.0001, 'grad_norm': 0.016431765637125912, 'learning_rate': 1.892e-07, 'completion_length': 155.2857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.002445220947265625, 'epoch': 0.81} + 81%|████████ | 2027/2500 [12:46:09<2:30:40, 19.11s/it] 81%|████████ | 2028/2500 [12:46:27<2:27:59, 18.81s/it] {'loss': 0.0002, 'grad_norm': 0.03349476747553219, 'learning_rate': 1.888e-07, 'completion_length': 154.58928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00457763671875, 'epoch': 0.81} + 81%|████████ | 2028/2500 [12:46:27<2:27:59, 18.81s/it] 81%|████████ | 2029/2500 [12:46:45<2:26:29, 18.66s/it] {'loss': 0.0002, 'grad_norm': 0.3390185208034192, 'learning_rate': 1.884e-07, 'completion_length': 141.64286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00424957275390625, 'epoch': 0.81} + 81%|████████ | 2029/2500 [12:46:45<2:26:29, 18.66s/it] 81%|████████ | 2030/2500 [12:47:04<2:26:38, 18.72s/it] {'loss': 0.0002, 'grad_norm': 0.023910446189214975, 'learning_rate': 1.88e-07, 'completion_length': 153.85714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00567626953125, 'epoch': 0.81} + 81%|████████ | 2030/2500 [12:47:04<2:26:38, 18.72s/it] 81%|████████ | 2031/2500 [12:47:23<2:26:26, 18.74s/it] {'loss': 0.0002, 'grad_norm': 1.6437689702236848, 'learning_rate': 1.8759999999999999e-07, 'completion_length': 147.1607208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.005157470703125, 'epoch': 0.81} + 81%|████████ | 2031/2500 [12:47:23<2:26:26, 18.74s/it] 81%|████████▏ | 2032/2500 [12:47:42<2:26:29, 18.78s/it] {'loss': 0.0003, 'grad_norm': 0.01858320030038651, 'learning_rate': 1.872e-07, 'completion_length': 151.51786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006988525390625, 'epoch': 0.81} + 81%|████████▏ | 2032/2500 [12:47:42<2:26:29, 18.78s/it] 81%|████████▏ | 2033/2500 [12:47:59<2:24:07, 18.52s/it] {'loss': 0.0002, 'grad_norm': 0.17545211028002164, 'learning_rate': 1.8679999999999998e-07, 'completion_length': 152.60714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00428009033203125, 'epoch': 0.81} + 81%|████████▏ | 2033/2500 [12:47:59<2:24:07, 18.52s/it] 81%|████████▏ | 2034/2500 [12:48:18<2:23:18, 18.45s/it] {'loss': 0.0002, 'grad_norm': 2.803993621131053, 'learning_rate': 1.864e-07, 'completion_length': 162.2857208251953, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.11266787722706795, 'kl': 0.005462646484375, 'epoch': 0.81} + 81%|████████▏ | 2034/2500 [12:48:18<2:23:18, 18.45s/it] 81%|████████▏ | 2035/2500 [12:48:36<2:22:40, 18.41s/it] {'loss': 0.0002, 'grad_norm': 0.021218989200345553, 'learning_rate': 1.86e-07, 'completion_length': 146.7321548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005401611328125, 'epoch': 0.81} + 81%|████████▏ | 2035/2500 [12:48:36<2:22:40, 18.41s/it] 81%|████████▏ | 2036/2500 [12:48:55<2:24:22, 18.67s/it] {'loss': 0.0001, 'grad_norm': 0.036687278272194576, 'learning_rate': 1.8559999999999997e-07, 'completion_length': 152.4464340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.003082275390625, 'epoch': 0.81} + 81%|████████▏ | 2036/2500 [12:48:55<2:24:22, 18.67s/it] 81%|████████▏ | 2037/2500 [12:49:15<2:25:35, 18.87s/it] {'loss': 0.0002, 'grad_norm': 0.03143366749028759, 'learning_rate': 1.852e-07, 'completion_length': 154.5178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0054168701171875, 'epoch': 0.81} + 81%|████████▏ | 2037/2500 [12:49:15<2:25:35, 18.87s/it] 82%|████████▏ | 2038/2500 [12:49:34<2:25:44, 18.93s/it] {'loss': 0.0003, 'grad_norm': 0.34650322425877716, 'learning_rate': 1.848e-07, 'completion_length': 147.2857208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0074005126953125, 'epoch': 0.82} + 82%|████████▏ | 2038/2500 [12:49:34<2:25:44, 18.93s/it] 82%|████████▏ | 2039/2500 [12:49:52<2:24:29, 18.81s/it] {'loss': 0.0002, 'grad_norm': 0.019829850269071116, 'learning_rate': 1.844e-07, 'completion_length': 158.64286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004608154296875, 'epoch': 0.82} + 82%|████████▏ | 2039/2500 [12:49:52<2:24:29, 18.81s/it] 82%|████████▏ | 2040/2500 [12:50:11<2:25:13, 18.94s/it] {'loss': 0.0002, 'grad_norm': 0.6377737142452017, 'learning_rate': 1.8399999999999998e-07, 'completion_length': 152.73214721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.00466156005859375, 'epoch': 0.82} + 82%|████████▏ | 2040/2500 [12:50:11<2:25:13, 18.94s/it] 82%|████████▏ | 2041/2500 [12:50:30<2:23:13, 18.72s/it] {'loss': 0.0002, 'grad_norm': 0.2905962086531071, 'learning_rate': 1.836e-07, 'completion_length': 149.96429443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.004669189453125, 'epoch': 0.82} + 82%|████████▏ | 2041/2500 [12:50:30<2:23:13, 18.72s/it] 82%|████████▏ | 2042/2500 [12:50:49<2:23:49, 18.84s/it] {'loss': 0.0001, 'grad_norm': 0.017131016761328707, 'learning_rate': 1.832e-07, 'completion_length': 158.39286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00328826904296875, 'epoch': 0.82} + 82%|████████▏ | 2042/2500 [12:50:49<2:23:49, 18.84s/it] 82%|████████▏ | 2043/2500 [12:51:08<2:23:14, 18.81s/it] {'loss': 0.0002, 'grad_norm': 0.024544776723747858, 'learning_rate': 1.8279999999999997e-07, 'completion_length': 164.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00435638427734375, 'epoch': 0.82} + 82%|████████▏ | 2043/2500 [12:51:08<2:23:14, 18.81s/it] 82%|████████▏ | 2044/2500 [12:51:26<2:23:01, 18.82s/it] {'loss': 0.0002, 'grad_norm': 0.5059115287023354, 'learning_rate': 1.824e-07, 'completion_length': 155.00000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00382232666015625, 'epoch': 0.82} + 82%|████████▏ | 2044/2500 [12:51:26<2:23:01, 18.82s/it] 82%|████████▏ | 2045/2500 [12:51:45<2:22:20, 18.77s/it] {'loss': 0.0002, 'grad_norm': 0.02686040810579285, 'learning_rate': 1.82e-07, 'completion_length': 150.80357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004669189453125, 'epoch': 0.82} + 82%|████████▏ | 2045/2500 [12:51:45<2:22:20, 18.77s/it] 82%|████████▏ | 2046/2500 [12:52:04<2:23:00, 18.90s/it] {'loss': 0.0002, 'grad_norm': 0.27994496311163336, 'learning_rate': 1.816e-07, 'completion_length': 162.92857360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0055999755859375, 'epoch': 0.82} + 82%|████████▏ | 2046/2500 [12:52:04<2:23:00, 18.90s/it] 82%|████████▏ | 2047/2500 [12:52:24<2:24:07, 19.09s/it] {'loss': 0.0002, 'grad_norm': 0.02073041287110919, 'learning_rate': 1.8119999999999998e-07, 'completion_length': 166.4821548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045166015625, 'epoch': 0.82} + 82%|████████▏ | 2047/2500 [12:52:24<2:24:07, 19.09s/it] 82%|████████▏ | 2048/2500 [12:52:43<2:22:59, 18.98s/it] {'loss': 0.0002, 'grad_norm': 0.023869296124398677, 'learning_rate': 1.8079999999999998e-07, 'completion_length': 152.80357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0044403076171875, 'epoch': 0.82} + 82%|████████▏ | 2048/2500 [12:52:43<2:22:59, 18.98s/it] 82%|████████▏ | 2049/2500 [12:53:02<2:24:07, 19.17s/it] {'loss': 0.0002, 'grad_norm': 0.41774632149092594, 'learning_rate': 1.804e-07, 'completion_length': 161.60714721679688, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0055999755859375, 'epoch': 0.82} + 82%|████████▏ | 2049/2500 [12:53:02<2:24:07, 19.17s/it] 82%|████████▏ | 2050/2500 [12:53:22<2:24:47, 19.30s/it] {'loss': 0.0002, 'grad_norm': 0.030469970761090495, 'learning_rate': 1.8e-07, 'completion_length': 153.7321548461914, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.004638671875, 'epoch': 0.82} + 82%|████████▏ | 2050/2500 [12:53:22<2:24:47, 19.30s/it] 82%|████████▏ | 2051/2500 [12:53:41<2:24:09, 19.26s/it] {'loss': 0.0002, 'grad_norm': 0.36866467066513914, 'learning_rate': 1.796e-07, 'completion_length': 158.62500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0050506591796875, 'epoch': 0.82} + 82%|████████▏ | 2051/2500 [12:53:41<2:24:09, 19.26s/it] 82%|████████▏ | 2052/2500 [12:54:01<2:26:15, 19.59s/it] {'loss': 0.0002, 'grad_norm': 0.22922787977679557, 'learning_rate': 1.792e-07, 'completion_length': 156.35714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0040435791015625, 'epoch': 0.82} + 82%|████████▏ | 2052/2500 [12:54:01<2:26:15, 19.59s/it] 82%|████████▏ | 2053/2500 [12:54:23<2:31:31, 20.34s/it] {'loss': 0.0002, 'grad_norm': 0.22844071319662937, 'learning_rate': 1.7879999999999999e-07, 'completion_length': 166.7678680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0049285888671875, 'epoch': 0.82} + 82%|████████▏ | 2053/2500 [12:54:23<2:31:31, 20.34s/it] 82%|████████▏ | 2054/2500 [12:54:44<2:31:44, 20.41s/it] {'loss': 0.0001, 'grad_norm': 0.022695135625867342, 'learning_rate': 1.7839999999999998e-07, 'completion_length': 150.0714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00334930419921875, 'epoch': 0.82} + 82%|████████▏ | 2054/2500 [12:54:44<2:31:44, 20.41s/it] 82%|████████▏ | 2055/2500 [12:55:03<2:28:06, 19.97s/it] {'loss': 0.0002, 'grad_norm': 0.02090401696932743, 'learning_rate': 1.7799999999999998e-07, 'completion_length': 150.73214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00441741943359375, 'epoch': 0.82} + 82%|████████▏ | 2055/2500 [12:55:03<2:28:06, 19.97s/it] 82%|████████▏ | 2056/2500 [12:55:21<2:23:55, 19.45s/it] {'loss': 0.0001, 'grad_norm': 1.1928095890408716, 'learning_rate': 1.776e-07, 'completion_length': 145.42857360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0034942626953125, 'epoch': 0.82} + 82%|████████▏ | 2056/2500 [12:55:21<2:23:55, 19.45s/it] 82%|████████▏ | 2057/2500 [12:55:39<2:19:37, 18.91s/it] {'loss': 0.0002, 'grad_norm': 0.027026287661001583, 'learning_rate': 1.772e-07, 'completion_length': 152.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0051422119140625, 'epoch': 0.82} + 82%|████████▏ | 2057/2500 [12:55:39<2:19:37, 18.91s/it] 82%|████████▏ | 2058/2500 [12:55:58<2:19:39, 18.96s/it] {'loss': 0.0002, 'grad_norm': 0.23445126620844078, 'learning_rate': 1.768e-07, 'completion_length': 166.4464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00579833984375, 'epoch': 0.82} + 82%|████████▏ | 2058/2500 [12:55:58<2:19:39, 18.96s/it] 82%|████████▏ | 2059/2500 [12:56:17<2:19:10, 18.94s/it] {'loss': 0.0002, 'grad_norm': 0.3754778308794995, 'learning_rate': 1.764e-07, 'completion_length': 172.08929443359375, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.0050811767578125, 'epoch': 0.82} + 82%|████████▏ | 2059/2500 [12:56:17<2:19:10, 18.94s/it] 82%|████████▏ | 2060/2500 [12:56:36<2:19:54, 19.08s/it] {'loss': 0.0002, 'grad_norm': 0.19151097533825823, 'learning_rate': 1.76e-07, 'completion_length': 154.6428680419922, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.004547119140625, 'epoch': 0.82} + 82%|████████▏ | 2060/2500 [12:56:36<2:19:54, 19.08s/it] 82%|████████▏ | 2061/2500 [12:56:55<2:19:05, 19.01s/it] {'loss': 0.0002, 'grad_norm': 0.1916908689944388, 'learning_rate': 1.756e-07, 'completion_length': 144.4107208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.004974365234375, 'epoch': 0.82} + 82%|████████▏ | 2061/2500 [12:56:55<2:19:05, 19.01s/it] 82%|████████▏ | 2062/2500 [12:57:14<2:17:59, 18.90s/it] {'loss': 0.0002, 'grad_norm': 0.34838025245248605, 'learning_rate': 1.7519999999999998e-07, 'completion_length': 156.39286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0048370361328125, 'epoch': 0.82} + 82%|████████▏ | 2062/2500 [12:57:14<2:17:59, 18.90s/it] 83%|████████▎ | 2063/2500 [12:57:33<2:18:52, 19.07s/it] {'loss': 0.0002, 'grad_norm': 0.3783928661393937, 'learning_rate': 1.748e-07, 'completion_length': 149.1428680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0046844482421875, 'epoch': 0.83} + 83%|████████▎ | 2063/2500 [12:57:33<2:18:52, 19.07s/it] 83%|████████▎ | 2064/2500 [12:57:51<2:15:55, 18.71s/it] {'loss': 0.0002, 'grad_norm': 0.028855791828695333, 'learning_rate': 1.744e-07, 'completion_length': 158.4464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005645751953125, 'epoch': 0.83} + 83%|████████▎ | 2064/2500 [12:57:51<2:15:55, 18.71s/it] 83%|████████▎ | 2065/2500 [12:58:11<2:18:06, 19.05s/it] {'loss': 0.0002, 'grad_norm': 0.45637567093275505, 'learning_rate': 1.7399999999999997e-07, 'completion_length': 164.4107208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00403594970703125, 'epoch': 0.83} + 83%|████████▎ | 2065/2500 [12:58:11<2:18:06, 19.05s/it] 83%|████████▎ | 2066/2500 [12:58:29<2:15:46, 18.77s/it] {'loss': 0.0003, 'grad_norm': 0.031797323620448965, 'learning_rate': 1.736e-07, 'completion_length': 156.50000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00653076171875, 'epoch': 0.83} + 83%|████████▎ | 2066/2500 [12:58:29<2:15:46, 18.77s/it] 83%|████████▎ | 2067/2500 [12:58:47<2:13:02, 18.43s/it] {'loss': 0.0001, 'grad_norm': 0.014953767550867482, 'learning_rate': 1.732e-07, 'completion_length': 153.50000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00327301025390625, 'epoch': 0.83} + 83%|████████▎ | 2067/2500 [12:58:47<2:13:02, 18.43s/it] 83%|████████▎ | 2068/2500 [12:59:05<2:13:04, 18.48s/it] {'loss': 0.0002, 'grad_norm': 0.02971729667907337, 'learning_rate': 1.728e-07, 'completion_length': 159.6964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00616455078125, 'epoch': 0.83} + 83%|████████▎ | 2068/2500 [12:59:05<2:13:04, 18.48s/it] 83%|████████▎ | 2069/2500 [12:59:23<2:11:22, 18.29s/it] {'loss': 0.0002, 'grad_norm': 0.3661624486728554, 'learning_rate': 1.7239999999999998e-07, 'completion_length': 146.50000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004638671875, 'epoch': 0.83} + 83%|████████▎ | 2069/2500 [12:59:23<2:11:22, 18.29s/it] 83%|████████▎ | 2070/2500 [12:59:41<2:10:38, 18.23s/it] {'loss': 0.0002, 'grad_norm': 0.5782628260965965, 'learning_rate': 1.7199999999999998e-07, 'completion_length': 150.00000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0045166015625, 'epoch': 0.83} + 83%|████████▎ | 2070/2500 [12:59:41<2:10:38, 18.23s/it] 83%|████████▎ | 2071/2500 [12:59:59<2:10:40, 18.28s/it] {'loss': 0.0001, 'grad_norm': 0.5230861108170441, 'learning_rate': 1.716e-07, 'completion_length': 145.6607208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.003570556640625, 'epoch': 0.83} + 83%|████████▎ | 2071/2500 [12:59:59<2:10:40, 18.28s/it] 83%|████████▎ | 2072/2500 [13:00:18<2:10:40, 18.32s/it] {'loss': 0.0002, 'grad_norm': 0.037297272671214174, 'learning_rate': 1.7119999999999997e-07, 'completion_length': 155.89286041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0050201416015625, 'epoch': 0.83} + 83%|████████▎ | 2072/2500 [13:00:18<2:10:40, 18.32s/it] 83%|████████▎ | 2073/2500 [13:00:36<2:10:48, 18.38s/it] {'loss': 0.0001, 'grad_norm': 0.3053641255438528, 'learning_rate': 1.708e-07, 'completion_length': 160.6607208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00372314453125, 'epoch': 0.83} + 83%|████████▎ | 2073/2500 [13:00:36<2:10:48, 18.38s/it] 83%|████████▎ | 2074/2500 [13:00:56<2:12:19, 18.64s/it] {'loss': 0.0002, 'grad_norm': 0.27368131444127997, 'learning_rate': 1.704e-07, 'completion_length': 167.69644165039062, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.00574493408203125, 'epoch': 0.83} + 83%|████████▎ | 2074/2500 [13:00:56<2:12:19, 18.64s/it] 83%|████████▎ | 2075/2500 [13:01:14<2:12:26, 18.70s/it] {'loss': 0.0002, 'grad_norm': 0.29459124897795563, 'learning_rate': 1.7000000000000001e-07, 'completion_length': 161.50000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0037841796875, 'epoch': 0.83} + 83%|████████▎ | 2075/2500 [13:01:14<2:12:26, 18.70s/it] 83%|████████▎ | 2076/2500 [13:01:32<2:09:38, 18.35s/it] {'loss': 0.0001, 'grad_norm': 0.05261644219080523, 'learning_rate': 1.6959999999999998e-07, 'completion_length': 147.55357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.002685546875, 'epoch': 0.83} + 83%|████████▎ | 2076/2500 [13:01:32<2:09:38, 18.35s/it] 83%|████████▎ | 2077/2500 [13:01:51<2:09:43, 18.40s/it] {'loss': 0.0001, 'grad_norm': 0.015199384942329598, 'learning_rate': 1.6919999999999998e-07, 'completion_length': 152.96429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00341796875, 'epoch': 0.83} + 83%|████████▎ | 2077/2500 [13:01:51<2:09:43, 18.40s/it] 83%|████████▎ | 2078/2500 [13:02:09<2:10:07, 18.50s/it] {'loss': 0.0001, 'grad_norm': 0.01383598615785065, 'learning_rate': 1.688e-07, 'completion_length': 144.87500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00293731689453125, 'epoch': 0.83} + 83%|████████▎ | 2078/2500 [13:02:09<2:10:07, 18.50s/it] 83%|████████▎ | 2079/2500 [13:02:28<2:11:03, 18.68s/it] {'loss': 0.0001, 'grad_norm': 1.598456049931896, 'learning_rate': 1.684e-07, 'completion_length': 157.48214721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00365447998046875, 'epoch': 0.83} + 83%|████████▎ | 2079/2500 [13:02:28<2:11:03, 18.68s/it] 83%|████████▎ | 2080/2500 [13:02:47<2:09:36, 18.52s/it] {'loss': 0.0001, 'grad_norm': 0.13367868275655154, 'learning_rate': 1.68e-07, 'completion_length': 155.8928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003631591796875, 'epoch': 0.83} + 83%|████████▎ | 2080/2500 [13:02:47<2:09:36, 18.52s/it] 83%|████████▎ | 2081/2500 [13:03:05<2:10:09, 18.64s/it] {'loss': 0.0002, 'grad_norm': 0.26725798094346, 'learning_rate': 1.676e-07, 'completion_length': 172.92858123779297, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.0055694580078125, 'epoch': 0.83} + 83%|████████▎ | 2081/2500 [13:03:05<2:10:09, 18.64s/it] 83%|████████▎ | 2082/2500 [13:03:25<2:10:45, 18.77s/it] {'loss': 0.0002, 'grad_norm': 0.02389952068054117, 'learning_rate': 1.672e-07, 'completion_length': 153.21429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0042572021484375, 'epoch': 0.83} + 83%|████████▎ | 2082/2500 [13:03:25<2:10:45, 18.77s/it] 83%|████████▎ | 2083/2500 [13:03:45<2:14:16, 19.32s/it] {'loss': 0.0002, 'grad_norm': 0.0712169720700416, 'learning_rate': 1.6679999999999998e-07, 'completion_length': 149.89286041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00505828857421875, 'epoch': 0.83} + 83%|████████▎ | 2083/2500 [13:03:45<2:14:16, 19.32s/it] 83%|████████▎ | 2084/2500 [13:04:03<2:11:20, 18.94s/it] {'loss': 0.0002, 'grad_norm': 0.012227811588103859, 'learning_rate': 1.6639999999999998e-07, 'completion_length': 156.4464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00414276123046875, 'epoch': 0.83} + 83%|████████▎ | 2084/2500 [13:04:03<2:11:20, 18.94s/it] 83%|████████▎ | 2085/2500 [13:04:21<2:08:04, 18.52s/it] {'loss': 0.0001, 'grad_norm': 0.016451417326514922, 'learning_rate': 1.66e-07, 'completion_length': 148.30358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00262451171875, 'epoch': 0.83} + 83%|████████▎ | 2085/2500 [13:04:21<2:08:04, 18.52s/it] 83%|████████▎ | 2086/2500 [13:04:41<2:11:20, 19.04s/it] {'loss': 0.0002, 'grad_norm': 0.17774856155421387, 'learning_rate': 1.656e-07, 'completion_length': 164.1071548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00421142578125, 'epoch': 0.83} + 83%|████████▎ | 2086/2500 [13:04:41<2:11:20, 19.04s/it] 83%|████████▎ | 2087/2500 [13:05:00<2:11:58, 19.17s/it] {'loss': 0.0002, 'grad_norm': 0.06788740340892159, 'learning_rate': 1.652e-07, 'completion_length': 165.6428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006195068359375, 'epoch': 0.83} + 83%|████████▎ | 2087/2500 [13:05:00<2:11:58, 19.17s/it] 84%|████████▎ | 2088/2500 [13:05:20<2:11:54, 19.21s/it] {'loss': 0.0002, 'grad_norm': 0.23506413673312418, 'learning_rate': 1.648e-07, 'completion_length': 161.85714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0062103271484375, 'epoch': 0.84} + 84%|████████▎ | 2088/2500 [13:05:20<2:11:54, 19.21s/it] 84%|████████▎ | 2089/2500 [13:05:38<2:08:44, 18.79s/it] {'loss': 0.0001, 'grad_norm': 0.022950257094518527, 'learning_rate': 1.644e-07, 'completion_length': 151.6964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00372314453125, 'epoch': 0.84} + 84%|████████▎ | 2089/2500 [13:05:38<2:08:44, 18.79s/it] 84%|████████▎ | 2090/2500 [13:05:57<2:09:51, 19.00s/it] {'loss': 0.0002, 'grad_norm': 0.02782505073307675, 'learning_rate': 1.64e-07, 'completion_length': 156.48214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00408172607421875, 'epoch': 0.84} + 84%|████████▎ | 2090/2500 [13:05:57<2:09:51, 19.00s/it] 84%|████████▎ | 2091/2500 [13:06:16<2:09:41, 19.03s/it] {'loss': 0.0003, 'grad_norm': 0.34326371222079227, 'learning_rate': 1.6359999999999998e-07, 'completion_length': 170.6964340209961, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.006378173828125, 'epoch': 0.84} + 84%|████████▎ | 2091/2500 [13:06:16<2:09:41, 19.03s/it] 84%|████████▎ | 2092/2500 [13:06:35<2:09:16, 19.01s/it] {'loss': 0.0003, 'grad_norm': 0.017900552672158704, 'learning_rate': 1.632e-07, 'completion_length': 170.7678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006561279296875, 'epoch': 0.84} + 84%|████████▎ | 2092/2500 [13:06:35<2:09:16, 19.01s/it] 84%|████████▎ | 2093/2500 [13:06:55<2:11:21, 19.36s/it] {'loss': 0.0001, 'grad_norm': 0.015873796174215235, 'learning_rate': 1.628e-07, 'completion_length': 167.57144165039062, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00359344482421875, 'epoch': 0.84} + 84%|████████▎ | 2093/2500 [13:06:55<2:11:21, 19.36s/it] 84%|████████▍ | 2094/2500 [13:07:13<2:07:56, 18.91s/it] {'loss': 0.0001, 'grad_norm': 0.018407839646955765, 'learning_rate': 1.6239999999999997e-07, 'completion_length': 147.51786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.003520965576171875, 'epoch': 0.84} + 84%|████████▍ | 2094/2500 [13:07:13<2:07:56, 18.91s/it] 84%|████████▍ | 2095/2500 [13:07:31<2:05:37, 18.61s/it] {'loss': 0.0002, 'grad_norm': 0.021643781524059758, 'learning_rate': 1.62e-07, 'completion_length': 150.73214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00415802001953125, 'epoch': 0.84} + 84%|████████▍ | 2095/2500 [13:07:31<2:05:37, 18.61s/it] 84%|████████▍ | 2096/2500 [13:07:50<2:06:55, 18.85s/it] {'loss': 0.0002, 'grad_norm': 0.341800621158568, 'learning_rate': 1.616e-07, 'completion_length': 164.21429443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0059967041015625, 'epoch': 0.84} + 84%|████████▍ | 2096/2500 [13:07:50<2:06:55, 18.85s/it] 84%|████████▍ | 2097/2500 [13:08:09<2:06:48, 18.88s/it] {'loss': 0.0002, 'grad_norm': 0.03811296299842578, 'learning_rate': 1.6120000000000001e-07, 'completion_length': 157.73214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003875732421875, 'epoch': 0.84} + 84%|████████▍ | 2097/2500 [13:08:09<2:06:48, 18.88s/it] 84%|████████▍ | 2098/2500 [13:08:29<2:08:42, 19.21s/it] {'loss': 0.0002, 'grad_norm': 0.018518057472871555, 'learning_rate': 1.6079999999999998e-07, 'completion_length': 154.21429443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0055999755859375, 'epoch': 0.84} + 84%|████████▍ | 2098/2500 [13:08:29<2:08:42, 19.21s/it] 84%|████████▍ | 2099/2500 [13:08:48<2:07:44, 19.11s/it] {'loss': 0.0003, 'grad_norm': 6.982838758719167, 'learning_rate': 1.6039999999999998e-07, 'completion_length': 160.6607208251953, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.12371791899204254, 'kl': 0.008392333984375, 'epoch': 0.84} + 84%|████████▍ | 2099/2500 [13:08:48<2:07:44, 19.11s/it] 84%|████████▍ | 2100/2500 [13:09:07<2:05:42, 18.86s/it] {'loss': 0.0002, 'grad_norm': 1.4784041629957019, 'learning_rate': 1.6e-07, 'completion_length': 150.3571548461914, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07695358991622925, 'kl': 0.005462646484375, 'epoch': 0.84} + 84%|████████▍ | 2100/2500 [13:09:07<2:05:42, 18.86s/it] 84%|████████▍ | 2101/2500 [13:12:10<7:33:43, 68.23s/it] {'loss': 0.0002, 'grad_norm': 0.20984014735825623, 'learning_rate': 1.5959999999999997e-07, 'completion_length': 166.73214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005157470703125, 'epoch': 0.84} + 84%|████████▍ | 2101/2500 [13:12:10<7:33:43, 68.23s/it] 84%|████████▍ | 2102/2500 [13:12:29<5:55:23, 53.58s/it] {'loss': 0.0002, 'grad_norm': 0.6683834957884954, 'learning_rate': 1.592e-07, 'completion_length': 168.30357360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0047760009765625, 'epoch': 0.84} + 84%|████████▍ | 2102/2500 [13:12:29<5:55:23, 53.58s/it] 84%|████████▍ | 2103/2500 [13:12:49<4:47:39, 43.47s/it] {'loss': 0.0002, 'grad_norm': 0.019659192704044468, 'learning_rate': 1.588e-07, 'completion_length': 163.0357208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004364013671875, 'epoch': 0.84} + 84%|████████▍ | 2103/2500 [13:12:49<4:47:39, 43.47s/it] 84%|████████▍ | 2104/2500 [13:13:09<3:59:14, 36.25s/it] {'loss': 0.0002, 'grad_norm': 0.02331885374243402, 'learning_rate': 1.5840000000000002e-07, 'completion_length': 165.75000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00458526611328125, 'epoch': 0.84} + 84%|████████▍ | 2104/2500 [13:13:09<3:59:14, 36.25s/it] 84%|████████▍ | 2105/2500 [13:13:26<3:21:53, 30.67s/it] {'loss': 0.0002, 'grad_norm': 0.7010812361610826, 'learning_rate': 1.5799999999999999e-07, 'completion_length': 153.9107208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.0054473876953125, 'epoch': 0.84} + 84%|████████▍ | 2105/2500 [13:13:26<3:21:53, 30.67s/it] 84%|████████▍ | 2106/2500 [13:13:45<2:57:10, 26.98s/it] {'loss': 0.0001, 'grad_norm': 0.014358174928037936, 'learning_rate': 1.5759999999999998e-07, 'completion_length': 153.75000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0036468505859375, 'epoch': 0.84} + 84%|████████▍ | 2106/2500 [13:13:45<2:57:10, 26.98s/it] 84%|████████▍ | 2107/2500 [13:14:06<2:45:22, 25.25s/it] {'loss': 0.0002, 'grad_norm': 0.026529169860109596, 'learning_rate': 1.572e-07, 'completion_length': 166.4107208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00435638427734375, 'epoch': 0.84} + 84%|████████▍ | 2107/2500 [13:14:06<2:45:22, 25.25s/it] 84%|████████▍ | 2108/2500 [13:14:23<2:29:28, 22.88s/it] {'loss': 0.0001, 'grad_norm': 0.026814891123243097, 'learning_rate': 1.5679999999999997e-07, 'completion_length': 143.21429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0028839111328125, 'epoch': 0.84} + 84%|████████▍ | 2108/2500 [13:14:23<2:29:28, 22.88s/it] 84%|████████▍ | 2109/2500 [13:14:42<2:20:09, 21.51s/it] {'loss': 0.0001, 'grad_norm': 0.0184005790898102, 'learning_rate': 1.564e-07, 'completion_length': 151.89286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00296783447265625, 'epoch': 0.84} + 84%|████████▍ | 2109/2500 [13:14:42<2:20:09, 21.51s/it] 84%|████████▍ | 2110/2500 [13:14:59<2:11:39, 20.26s/it] {'loss': 0.0002, 'grad_norm': 0.2522993238265654, 'learning_rate': 1.56e-07, 'completion_length': 152.25000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004730224609375, 'epoch': 0.84} + 84%|████████▍ | 2110/2500 [13:14:59<2:11:39, 20.26s/it] 84%|████████▍ | 2111/2500 [13:15:17<2:07:00, 19.59s/it] {'loss': 0.0001, 'grad_norm': 0.008893452835362557, 'learning_rate': 1.556e-07, 'completion_length': 143.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00174713134765625, 'epoch': 0.84} + 84%|████████▍ | 2111/2500 [13:15:17<2:07:00, 19.59s/it] 84%|████████▍ | 2112/2500 [13:15:35<2:02:52, 19.00s/it] {'loss': 0.0002, 'grad_norm': 0.20557568033224355, 'learning_rate': 1.552e-07, 'completion_length': 141.23214721679688, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0040435791015625, 'epoch': 0.84} + 84%|████████▍ | 2112/2500 [13:15:35<2:02:52, 19.00s/it] 85%|████████▍ | 2113/2500 [13:15:53<2:02:15, 18.96s/it] {'loss': 0.0002, 'grad_norm': 0.21130342163064048, 'learning_rate': 1.5479999999999998e-07, 'completion_length': 169.3214340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00428009033203125, 'epoch': 0.85} + 85%|████████▍ | 2113/2500 [13:15:53<2:02:15, 18.96s/it] 85%|████████▍ | 2114/2500 [13:16:12<2:01:21, 18.86s/it] {'loss': 0.0002, 'grad_norm': 0.2261282662524638, 'learning_rate': 1.544e-07, 'completion_length': 172.5357208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005035400390625, 'epoch': 0.85} + 85%|████████▍ | 2114/2500 [13:16:12<2:01:21, 18.86s/it] 85%|████████▍ | 2115/2500 [13:16:29<1:57:37, 18.33s/it] {'loss': 0.0001, 'grad_norm': 0.021053361657078516, 'learning_rate': 1.54e-07, 'completion_length': 139.1071548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0035247802734375, 'epoch': 0.85} + 85%|████████▍ | 2115/2500 [13:16:29<1:57:37, 18.33s/it] 85%|████████▍ | 2116/2500 [13:16:47<1:56:58, 18.28s/it] {'loss': 0.0001, 'grad_norm': 0.3678627931102728, 'learning_rate': 1.5359999999999997e-07, 'completion_length': 149.48214721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.002349853515625, 'epoch': 0.85} + 85%|████████▍ | 2116/2500 [13:16:47<1:56:58, 18.28s/it] 85%|████████▍ | 2117/2500 [13:17:06<1:57:09, 18.35s/it] {'loss': 0.0001, 'grad_norm': 0.01744503120840002, 'learning_rate': 1.532e-07, 'completion_length': 159.67857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00345611572265625, 'epoch': 0.85} + 85%|████████▍ | 2117/2500 [13:17:06<1:57:09, 18.35s/it] 85%|████████▍ | 2118/2500 [13:17:23<1:55:27, 18.13s/it] {'loss': 0.0003, 'grad_norm': 0.040337565841155665, 'learning_rate': 1.528e-07, 'completion_length': 147.51786041259766, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.0072174072265625, 'epoch': 0.85} + 85%|████████▍ | 2118/2500 [13:17:23<1:55:27, 18.13s/it] 85%|████████▍ | 2119/2500 [13:17:42<1:56:51, 18.40s/it] {'loss': 0.0002, 'grad_norm': 0.022028789774460895, 'learning_rate': 1.524e-07, 'completion_length': 166.8214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0044097900390625, 'epoch': 0.85} + 85%|████████▍ | 2119/2500 [13:17:42<1:56:51, 18.40s/it] 85%|████████▍ | 2120/2500 [13:18:00<1:53:58, 18.00s/it] {'loss': 0.0002, 'grad_norm': 3.1159297648856557, 'learning_rate': 1.5199999999999998e-07, 'completion_length': 151.73215103149414, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0047607421875, 'epoch': 0.85} + 85%|████████▍ | 2120/2500 [13:18:00<1:53:58, 18.00s/it] 85%|████████▍ | 2121/2500 [13:18:16<1:51:22, 17.63s/it] {'loss': 0.0001, 'grad_norm': 0.025532260396557076, 'learning_rate': 1.516e-07, 'completion_length': 135.6964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0027008056640625, 'epoch': 0.85} + 85%|████████▍ | 2121/2500 [13:18:16<1:51:22, 17.63s/it] 85%|████████▍ | 2122/2500 [13:18:35<1:52:49, 17.91s/it] {'loss': 0.0002, 'grad_norm': 0.018319147713263654, 'learning_rate': 1.512e-07, 'completion_length': 145.57144165039062, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004276275634765625, 'epoch': 0.85} + 85%|████████▍ | 2122/2500 [13:18:35<1:52:49, 17.91s/it] 85%|████████▍ | 2123/2500 [13:18:53<1:52:29, 17.90s/it] {'loss': 0.0001, 'grad_norm': 0.3360871753439912, 'learning_rate': 1.5079999999999997e-07, 'completion_length': 153.46429443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00333404541015625, 'epoch': 0.85} + 85%|████████▍ | 2123/2500 [13:18:53<1:52:29, 17.90s/it] 85%|████████▍ | 2124/2500 [13:19:10<1:51:31, 17.80s/it] {'loss': 0.0002, 'grad_norm': 1.7833263368830867, 'learning_rate': 1.504e-07, 'completion_length': 145.42857360839844, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.00432586669921875, 'epoch': 0.85} + 85%|████████▍ | 2124/2500 [13:19:10<1:51:31, 17.80s/it] 85%|████████▌ | 2125/2500 [13:19:30<1:54:04, 18.25s/it] {'loss': 0.0003, 'grad_norm': 0.5669233593980864, 'learning_rate': 1.5e-07, 'completion_length': 165.6607208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0064697265625, 'epoch': 0.85} + 85%|████████▌ | 2125/2500 [13:19:30<1:54:04, 18.25s/it] 85%|████████▌ | 2126/2500 [13:19:49<1:55:02, 18.46s/it] {'loss': 0.0002, 'grad_norm': 0.020840089341518705, 'learning_rate': 1.4960000000000002e-07, 'completion_length': 154.1607208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.003814697265625, 'epoch': 0.85} + 85%|████████▌ | 2126/2500 [13:19:49<1:55:02, 18.46s/it] 85%|████████▌ | 2127/2500 [13:20:07<1:54:10, 18.37s/it] {'loss': 0.0001, 'grad_norm': 0.07327474758141024, 'learning_rate': 1.4919999999999999e-07, 'completion_length': 160.37500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00275421142578125, 'epoch': 0.85} + 85%|████████▌ | 2127/2500 [13:20:07<1:54:10, 18.37s/it] 85%|████████▌ | 2128/2500 [13:20:25<1:54:26, 18.46s/it] {'loss': 0.0003, 'grad_norm': 0.019398635466281514, 'learning_rate': 1.4879999999999998e-07, 'completion_length': 155.12500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0068817138671875, 'epoch': 0.85} + 85%|████████▌ | 2128/2500 [13:20:25<1:54:26, 18.46s/it] 85%|████████▌ | 2129/2500 [13:20:44<1:54:26, 18.51s/it] {'loss': 0.0002, 'grad_norm': 10.163295402449764, 'learning_rate': 1.484e-07, 'completion_length': 133.12500762939453, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.005950927734375, 'epoch': 0.85} + 85%|████████▌ | 2129/2500 [13:20:44<1:54:26, 18.51s/it] 85%|████████▌ | 2130/2500 [13:21:02<1:53:56, 18.48s/it] {'loss': 0.0001, 'grad_norm': 0.016477557211798874, 'learning_rate': 1.4799999999999998e-07, 'completion_length': 155.62500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0029449462890625, 'epoch': 0.85} + 85%|████████▌ | 2130/2500 [13:21:02<1:53:56, 18.48s/it] 85%|████████▌ | 2131/2500 [13:21:20<1:51:56, 18.20s/it] {'loss': 0.0002, 'grad_norm': 0.4067044235938164, 'learning_rate': 1.476e-07, 'completion_length': 150.62500762939453, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.00411224365234375, 'epoch': 0.85} + 85%|████████▌ | 2131/2500 [13:21:20<1:51:56, 18.20s/it] 85%|████████▌ | 2132/2500 [13:21:37<1:49:55, 17.92s/it] {'loss': 0.0001, 'grad_norm': 0.014587137124937387, 'learning_rate': 1.472e-07, 'completion_length': 142.3214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00281524658203125, 'epoch': 0.85} + 85%|████████▌ | 2132/2500 [13:21:37<1:49:55, 17.92s/it] 85%|████████▌ | 2133/2500 [13:21:54<1:48:10, 17.69s/it] {'loss': 0.0002, 'grad_norm': 0.03440218627784185, 'learning_rate': 1.4680000000000002e-07, 'completion_length': 146.7321548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00508880615234375, 'epoch': 0.85} + 85%|████████▌ | 2133/2500 [13:21:54<1:48:10, 17.69s/it] 85%|████████▌ | 2134/2500 [13:22:12<1:48:21, 17.76s/it] {'loss': 0.0003, 'grad_norm': 0.9158869645421596, 'learning_rate': 1.464e-07, 'completion_length': 150.5357208251953, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07695358991622925, 'kl': 0.0067291259765625, 'epoch': 0.85} + 85%|████████▌ | 2134/2500 [13:22:12<1:48:21, 17.76s/it] 85%|████████▌ | 2135/2500 [13:22:30<1:48:24, 17.82s/it] {'loss': 0.0001, 'grad_norm': 0.02023167305719922, 'learning_rate': 1.4599999999999998e-07, 'completion_length': 156.33929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00250244140625, 'epoch': 0.85} + 85%|████████▌ | 2135/2500 [13:22:30<1:48:24, 17.82s/it] 85%|████████▌ | 2136/2500 [13:22:47<1:46:57, 17.63s/it] {'loss': 0.0001, 'grad_norm': 0.015978376846666278, 'learning_rate': 1.456e-07, 'completion_length': 141.67858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0031280517578125, 'epoch': 0.85} + 85%|████████▌ | 2136/2500 [13:22:47<1:46:57, 17.63s/it] 85%|████████▌ | 2137/2500 [13:23:06<1:47:45, 17.81s/it] {'loss': 0.0002, 'grad_norm': 0.023412139371961255, 'learning_rate': 1.4519999999999998e-07, 'completion_length': 153.67858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00396728515625, 'epoch': 0.85} + 85%|████████▌ | 2137/2500 [13:23:06<1:47:45, 17.81s/it] 86%|████████▌ | 2138/2500 [13:23:24<1:48:14, 17.94s/it] {'loss': 0.0002, 'grad_norm': 0.03254761313334463, 'learning_rate': 1.448e-07, 'completion_length': 148.17857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003936767578125, 'epoch': 0.86} + 86%|████████▌ | 2138/2500 [13:23:24<1:48:14, 17.94s/it] 86%|████████▌ | 2139/2500 [13:23:42<1:48:49, 18.09s/it] {'loss': 0.0002, 'grad_norm': 0.4006530105209356, 'learning_rate': 1.444e-07, 'completion_length': 158.51786041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0053558349609375, 'epoch': 0.86} + 86%|████████▌ | 2139/2500 [13:23:42<1:48:49, 18.09s/it] 86%|████████▌ | 2140/2500 [13:24:01<1:50:26, 18.41s/it] {'loss': 0.0002, 'grad_norm': 0.0221368937575754, 'learning_rate': 1.44e-07, 'completion_length': 165.89286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006011962890625, 'epoch': 0.86} + 86%|████████▌ | 2140/2500 [13:24:01<1:50:26, 18.41s/it] 86%|████████▌ | 2141/2500 [13:24:19<1:48:10, 18.08s/it] {'loss': 0.0001, 'grad_norm': 0.029780045672333996, 'learning_rate': 1.436e-07, 'completion_length': 139.42857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0036163330078125, 'epoch': 0.86} + 86%|████████▌ | 2141/2500 [13:24:19<1:48:10, 18.08s/it] 86%|████████▌ | 2142/2500 [13:24:38<1:50:33, 18.53s/it] {'loss': 0.0002, 'grad_norm': 0.01985285103704142, 'learning_rate': 1.4319999999999999e-07, 'completion_length': 171.28572845458984, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0052032470703125, 'epoch': 0.86} + 86%|████████▌ | 2142/2500 [13:24:38<1:50:33, 18.53s/it] 86%|████████▌ | 2143/2500 [13:24:57<1:50:12, 18.52s/it] {'loss': 0.0002, 'grad_norm': 0.018599723900355046, 'learning_rate': 1.428e-07, 'completion_length': 162.4821548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00382232666015625, 'epoch': 0.86} + 86%|████████▌ | 2143/2500 [13:24:57<1:50:12, 18.52s/it] 86%|████████▌ | 2144/2500 [13:25:15<1:49:22, 18.43s/it] {'loss': 0.0002, 'grad_norm': 0.22201674420703763, 'learning_rate': 1.424e-07, 'completion_length': 162.3214340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00461578369140625, 'epoch': 0.86} + 86%|████████▌ | 2144/2500 [13:25:15<1:49:22, 18.43s/it] 86%|████████▌ | 2145/2500 [13:25:35<1:51:07, 18.78s/it] {'loss': 0.0001, 'grad_norm': 0.014272392011763562, 'learning_rate': 1.4199999999999997e-07, 'completion_length': 164.83929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0030670166015625, 'epoch': 0.86} + 86%|████████▌ | 2145/2500 [13:25:35<1:51:07, 18.78s/it] 86%|████████▌ | 2146/2500 [13:25:54<1:50:54, 18.80s/it] {'loss': 0.0003, 'grad_norm': 0.3465130303941362, 'learning_rate': 1.416e-07, 'completion_length': 153.0357208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0079345703125, 'epoch': 0.86} + 86%|████████▌ | 2146/2500 [13:25:54<1:50:54, 18.80s/it] 86%|████████▌ | 2147/2500 [13:26:14<1:53:31, 19.30s/it] {'loss': 0.0003, 'grad_norm': 0.3494894160224744, 'learning_rate': 1.412e-07, 'completion_length': 190.7321548461914, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.0068206787109375, 'epoch': 0.86} + 86%|████████▌ | 2147/2500 [13:26:14<1:53:31, 19.30s/it] 86%|████████▌ | 2148/2500 [13:26:33<1:53:16, 19.31s/it] {'loss': 0.0002, 'grad_norm': 0.2900021170532861, 'learning_rate': 1.408e-07, 'completion_length': 154.7857208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.0051727294921875, 'epoch': 0.86} + 86%|████████▌ | 2148/2500 [13:26:33<1:53:16, 19.31s/it] 86%|████████▌ | 2149/2500 [13:26:51<1:50:25, 18.88s/it] {'loss': 0.0002, 'grad_norm': 0.026411165623194628, 'learning_rate': 1.4039999999999999e-07, 'completion_length': 153.33928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00506591796875, 'epoch': 0.86} + 86%|████████▌ | 2149/2500 [13:26:51<1:50:25, 18.88s/it] 86%|████████▌ | 2150/2500 [13:27:10<1:49:41, 18.80s/it] {'loss': 0.0002, 'grad_norm': 0.8668895207406624, 'learning_rate': 1.4e-07, 'completion_length': 161.0357208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0045928955078125, 'epoch': 0.86} + 86%|████████▌ | 2150/2500 [13:27:10<1:49:41, 18.80s/it] 86%|████████▌ | 2151/2500 [13:27:29<1:49:39, 18.85s/it] {'loss': 0.0003, 'grad_norm': 0.4031987674825288, 'learning_rate': 1.396e-07, 'completion_length': 166.0714340209961, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.006561279296875, 'epoch': 0.86} + 86%|████████▌ | 2151/2500 [13:27:29<1:49:39, 18.85s/it] 86%|████████▌ | 2152/2500 [13:27:47<1:48:03, 18.63s/it] {'loss': 0.0001, 'grad_norm': 0.013505523234636475, 'learning_rate': 1.3919999999999998e-07, 'completion_length': 148.6964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00261688232421875, 'epoch': 0.86} + 86%|████████▌ | 2152/2500 [13:27:47<1:48:03, 18.63s/it] 86%|████████▌ | 2153/2500 [13:28:05<1:46:35, 18.43s/it] {'loss': 0.0002, 'grad_norm': 0.27529314708936525, 'learning_rate': 1.388e-07, 'completion_length': 160.96429443359375, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.005615234375, 'epoch': 0.86} + 86%|████████▌ | 2153/2500 [13:28:05<1:46:35, 18.43s/it] 86%|████████▌ | 2154/2500 [13:28:22<1:44:35, 18.14s/it] {'loss': 0.0002, 'grad_norm': 0.013651189340950196, 'learning_rate': 1.384e-07, 'completion_length': 152.9821548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00390625, 'epoch': 0.86} + 86%|████████▌ | 2154/2500 [13:28:22<1:44:35, 18.14s/it] 86%|████████▌ | 2155/2500 [13:28:41<1:45:38, 18.37s/it] {'loss': 0.0001, 'grad_norm': 0.19102882611802732, 'learning_rate': 1.3800000000000002e-07, 'completion_length': 165.25000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00341796875, 'epoch': 0.86} + 86%|████████▌ | 2155/2500 [13:28:41<1:45:38, 18.37s/it] 86%|████████▌ | 2156/2500 [13:29:00<1:45:21, 18.38s/it] {'loss': 0.0002, 'grad_norm': 1.3713465030729963, 'learning_rate': 1.376e-07, 'completion_length': 152.7321548461914, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.0047607421875, 'epoch': 0.86} + 86%|████████▌ | 2156/2500 [13:29:00<1:45:21, 18.38s/it] 86%|████████▋ | 2157/2500 [13:29:18<1:45:15, 18.41s/it] {'loss': 0.0002, 'grad_norm': 0.020097886611502632, 'learning_rate': 1.3719999999999998e-07, 'completion_length': 161.35714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0043792724609375, 'epoch': 0.86} + 86%|████████▋ | 2157/2500 [13:29:18<1:45:15, 18.41s/it] 86%|████████▋ | 2158/2500 [13:29:37<1:46:18, 18.65s/it] {'loss': 0.0003, 'grad_norm': 0.017216130713944233, 'learning_rate': 1.368e-07, 'completion_length': 169.87500762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0062408447265625, 'epoch': 0.86} + 86%|████████▋ | 2158/2500 [13:29:37<1:46:18, 18.65s/it] 86%|████████▋ | 2159/2500 [13:29:55<1:44:38, 18.41s/it] {'loss': 0.0003, 'grad_norm': 0.019711308584821348, 'learning_rate': 1.3639999999999998e-07, 'completion_length': 156.42858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0073394775390625, 'epoch': 0.86} + 86%|████████▋ | 2159/2500 [13:29:55<1:44:38, 18.41s/it] 86%|████████▋ | 2160/2500 [13:30:14<1:44:48, 18.50s/it] {'loss': 0.0001, 'grad_norm': 0.038879992817436954, 'learning_rate': 1.36e-07, 'completion_length': 164.1071548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0033416748046875, 'epoch': 0.86} + 86%|████████▋ | 2160/2500 [13:30:14<1:44:48, 18.50s/it] 86%|████████▋ | 2161/2500 [13:30:32<1:44:17, 18.46s/it] {'loss': 0.0002, 'grad_norm': 0.011898770593498437, 'learning_rate': 1.356e-07, 'completion_length': 157.6428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00390625, 'epoch': 0.86} + 86%|████████▋ | 2161/2500 [13:30:32<1:44:17, 18.46s/it] 86%|████████▋ | 2162/2500 [13:30:50<1:43:26, 18.36s/it] {'loss': 0.0001, 'grad_norm': 0.02873212174500648, 'learning_rate': 1.352e-07, 'completion_length': 149.08928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00351715087890625, 'epoch': 0.86} + 86%|████████▋ | 2162/2500 [13:30:50<1:43:26, 18.36s/it] 87%|████████▋ | 2163/2500 [13:31:09<1:44:08, 18.54s/it] {'loss': 0.0003, 'grad_norm': 0.7050696571803656, 'learning_rate': 1.348e-07, 'completion_length': 171.08929443359375, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.11266787722706795, 'kl': 0.0066070556640625, 'epoch': 0.87} + 87%|████████▋ | 2163/2500 [13:31:09<1:44:08, 18.54s/it] 87%|████████▋ | 2164/2500 [13:31:28<1:43:25, 18.47s/it] {'loss': 0.0002, 'grad_norm': 0.019213732894254754, 'learning_rate': 1.3439999999999999e-07, 'completion_length': 164.5357208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004913330078125, 'epoch': 0.87} + 87%|████████▋ | 2164/2500 [13:31:28<1:43:25, 18.47s/it] 87%|████████▋ | 2165/2500 [13:31:46<1:43:27, 18.53s/it] {'loss': 0.0002, 'grad_norm': 0.6330466434784322, 'learning_rate': 1.34e-07, 'completion_length': 153.05358123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00572967529296875, 'epoch': 0.87} + 87%|████████▋ | 2165/2500 [13:31:46<1:43:27, 18.53s/it] 87%|████████▋ | 2166/2500 [13:32:06<1:44:32, 18.78s/it] {'loss': 0.0001, 'grad_norm': 0.024091301041030445, 'learning_rate': 1.3359999999999998e-07, 'completion_length': 154.92857360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00357818603515625, 'epoch': 0.87} + 87%|████████▋ | 2166/2500 [13:32:06<1:44:32, 18.78s/it] 87%|████████▋ | 2167/2500 [13:32:25<1:45:11, 18.95s/it] {'loss': 0.0001, 'grad_norm': 0.35469259351717103, 'learning_rate': 1.332e-07, 'completion_length': 163.7678680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0036468505859375, 'epoch': 0.87} + 87%|████████▋ | 2167/2500 [13:32:25<1:45:11, 18.95s/it] 87%|████████▋ | 2168/2500 [13:32:43<1:42:24, 18.51s/it] {'loss': 0.0001, 'grad_norm': 0.47231458626272, 'learning_rate': 1.328e-07, 'completion_length': 148.5357208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00341796875, 'epoch': 0.87} + 87%|████████▋ | 2168/2500 [13:32:43<1:42:24, 18.51s/it] 87%|████████▋ | 2169/2500 [13:33:02<1:44:07, 18.88s/it] {'loss': 0.0002, 'grad_norm': 0.0202695865579836, 'learning_rate': 1.324e-07, 'completion_length': 157.83929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0061187744140625, 'epoch': 0.87} + 87%|████████▋ | 2169/2500 [13:33:02<1:44:07, 18.88s/it] 87%|████████▋ | 2170/2500 [13:33:22<1:45:21, 19.15s/it] {'loss': 0.0002, 'grad_norm': 0.015248501011995305, 'learning_rate': 1.32e-07, 'completion_length': 163.4821548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00405120849609375, 'epoch': 0.87} + 87%|████████▋ | 2170/2500 [13:33:22<1:45:21, 19.15s/it] 87%|████████▋ | 2171/2500 [13:33:40<1:43:09, 18.81s/it] {'loss': 0.0002, 'grad_norm': 0.45801244929595725, 'learning_rate': 1.316e-07, 'completion_length': 153.51786041259766, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00396728515625, 'epoch': 0.87} + 87%|████████▋ | 2171/2500 [13:33:40<1:43:09, 18.81s/it] 87%|████████▋ | 2172/2500 [13:33:58<1:41:53, 18.64s/it] {'loss': 0.0001, 'grad_norm': 0.014855935723718662, 'learning_rate': 1.312e-07, 'completion_length': 138.75000381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00296783447265625, 'epoch': 0.87} + 87%|████████▋ | 2172/2500 [13:33:58<1:41:53, 18.64s/it] 87%|████████▋ | 2173/2500 [13:34:16<1:40:43, 18.48s/it] {'loss': 0.0002, 'grad_norm': 0.09099502735861832, 'learning_rate': 1.308e-07, 'completion_length': 146.2678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00389862060546875, 'epoch': 0.87} + 87%|████████▋ | 2173/2500 [13:34:16<1:40:43, 18.48s/it] 87%|████████▋ | 2174/2500 [13:34:35<1:40:24, 18.48s/it] {'loss': 0.0002, 'grad_norm': 0.2701309840597939, 'learning_rate': 1.3039999999999998e-07, 'completion_length': 170.0357208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00531005859375, 'epoch': 0.87} + 87%|████████▋ | 2174/2500 [13:34:35<1:40:24, 18.48s/it] 87%|████████▋ | 2175/2500 [13:34:54<1:41:42, 18.78s/it] {'loss': 0.0002, 'grad_norm': 0.21301479791642858, 'learning_rate': 1.3e-07, 'completion_length': 151.1071548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0050048828125, 'epoch': 0.87} + 87%|████████▋ | 2175/2500 [13:34:54<1:41:42, 18.78s/it] 87%|████████▋ | 2176/2500 [13:35:14<1:42:02, 18.90s/it] {'loss': 0.0002, 'grad_norm': 0.6172285382836314, 'learning_rate': 1.296e-07, 'completion_length': 148.71428680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.004364013671875, 'epoch': 0.87} + 87%|████████▋ | 2176/2500 [13:35:14<1:42:02, 18.90s/it] 87%|████████▋ | 2177/2500 [13:35:31<1:40:04, 18.59s/it] {'loss': 0.0001, 'grad_norm': 0.5705106663918783, 'learning_rate': 1.292e-07, 'completion_length': 143.6964340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.003204345703125, 'epoch': 0.87} + 87%|████████▋ | 2177/2500 [13:35:31<1:40:04, 18.59s/it] 87%|████████▋ | 2178/2500 [13:35:51<1:40:56, 18.81s/it] {'loss': 0.0003, 'grad_norm': 3.854524748361431, 'learning_rate': 1.288e-07, 'completion_length': 158.67857360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0065765380859375, 'epoch': 0.87} + 87%|████████▋ | 2178/2500 [13:35:51<1:40:56, 18.81s/it] 87%|████████▋ | 2179/2500 [13:36:10<1:41:27, 18.97s/it] {'loss': 0.0002, 'grad_norm': 0.026478665679235805, 'learning_rate': 1.2839999999999999e-07, 'completion_length': 162.05358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0043182373046875, 'epoch': 0.87} + 87%|████████▋ | 2179/2500 [13:36:10<1:41:27, 18.97s/it] 87%|████████▋ | 2180/2500 [13:36:28<1:39:56, 18.74s/it] {'loss': 0.0002, 'grad_norm': 0.6539718576064494, 'learning_rate': 1.28e-07, 'completion_length': 154.9464340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.00482940673828125, 'epoch': 0.87} + 87%|████████▋ | 2180/2500 [13:36:28<1:39:56, 18.74s/it] 87%|████████▋ | 2181/2500 [13:36:46<1:37:56, 18.42s/it] {'loss': 0.0001, 'grad_norm': 0.015344613083948792, 'learning_rate': 1.2759999999999998e-07, 'completion_length': 146.17857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00243377685546875, 'epoch': 0.87} + 87%|████████▋ | 2181/2500 [13:36:46<1:37:56, 18.42s/it] 87%|████████▋ | 2182/2500 [13:37:04<1:36:24, 18.19s/it] {'loss': 0.0002, 'grad_norm': 0.07692235438506803, 'learning_rate': 1.272e-07, 'completion_length': 156.1428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0040435791015625, 'epoch': 0.87} + 87%|████████▋ | 2182/2500 [13:37:04<1:36:24, 18.19s/it] 87%|████████▋ | 2183/2500 [13:37:22<1:36:01, 18.17s/it] {'loss': 0.0001, 'grad_norm': 0.050510966160544186, 'learning_rate': 1.268e-07, 'completion_length': 155.1428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00334930419921875, 'epoch': 0.87} + 87%|████████▋ | 2183/2500 [13:37:22<1:36:01, 18.17s/it] 87%|████████▋ | 2184/2500 [13:37:40<1:35:58, 18.22s/it] {'loss': 0.0001, 'grad_norm': 0.016719363825224626, 'learning_rate': 1.264e-07, 'completion_length': 144.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0034027099609375, 'epoch': 0.87} + 87%|████████▋ | 2184/2500 [13:37:40<1:35:58, 18.22s/it] 87%|████████▋ | 2185/2500 [13:38:00<1:37:31, 18.58s/it] {'loss': 0.0001, 'grad_norm': 0.011278689655467956, 'learning_rate': 1.26e-07, 'completion_length': 157.60714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00325775146484375, 'epoch': 0.87} + 87%|████████▋ | 2185/2500 [13:38:00<1:37:31, 18.58s/it] 87%|████████▋ | 2186/2500 [13:38:17<1:35:35, 18.27s/it] {'loss': 0.0002, 'grad_norm': 0.023477874686336313, 'learning_rate': 1.2559999999999999e-07, 'completion_length': 156.67858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00521087646484375, 'epoch': 0.87} + 87%|████████▋ | 2186/2500 [13:38:17<1:35:35, 18.27s/it] 87%|████████▋ | 2187/2500 [13:38:36<1:37:04, 18.61s/it] {'loss': 0.0002, 'grad_norm': 0.2784750600149867, 'learning_rate': 1.252e-07, 'completion_length': 155.5178680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004119873046875, 'epoch': 0.87} + 87%|████████▋ | 2187/2500 [13:38:36<1:37:04, 18.61s/it] 88%|████████▊ | 2188/2500 [13:38:55<1:37:21, 18.72s/it] {'loss': 0.0001, 'grad_norm': 0.016353108529417737, 'learning_rate': 1.2479999999999998e-07, 'completion_length': 148.08929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00315093994140625, 'epoch': 0.88} + 88%|████████▊ | 2188/2500 [13:38:55<1:37:21, 18.72s/it] 88%|████████▊ | 2189/2500 [13:39:15<1:38:57, 19.09s/it] {'loss': 0.0002, 'grad_norm': 0.02550930953395299, 'learning_rate': 1.244e-07, 'completion_length': 166.1964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00457763671875, 'epoch': 0.88} + 88%|████████▊ | 2189/2500 [13:39:15<1:38:57, 19.09s/it] 88%|████████▊ | 2190/2500 [13:39:34<1:37:21, 18.84s/it] {'loss': 0.0002, 'grad_norm': 0.014244558365489107, 'learning_rate': 1.24e-07, 'completion_length': 153.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00395965576171875, 'epoch': 0.88} + 88%|████████▊ | 2190/2500 [13:39:34<1:37:21, 18.84s/it] 88%|████████▊ | 2191/2500 [13:39:52<1:35:39, 18.57s/it] {'loss': 0.0002, 'grad_norm': 0.33958136878180367, 'learning_rate': 1.236e-07, 'completion_length': 159.21428680419922, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.004730224609375, 'epoch': 0.88} + 88%|████████▊ | 2191/2500 [13:39:52<1:35:39, 18.57s/it] 88%|████████▊ | 2192/2500 [13:40:10<1:35:47, 18.66s/it] {'loss': 0.0003, 'grad_norm': 0.03757100557939164, 'learning_rate': 1.232e-07, 'completion_length': 164.42858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0076446533203125, 'epoch': 0.88} + 88%|████████▊ | 2192/2500 [13:40:10<1:35:47, 18.66s/it] 88%|████████▊ | 2193/2500 [13:40:28<1:34:07, 18.40s/it] {'loss': 0.0001, 'grad_norm': 0.03643927978438975, 'learning_rate': 1.228e-07, 'completion_length': 139.9107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00337982177734375, 'epoch': 0.88} + 88%|████████▊ | 2193/2500 [13:40:28<1:34:07, 18.40s/it] 88%|████████▊ | 2194/2500 [13:40:47<1:33:56, 18.42s/it] {'loss': 0.0002, 'grad_norm': 0.28575424434936986, 'learning_rate': 1.2239999999999998e-07, 'completion_length': 154.30358123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00495147705078125, 'epoch': 0.88} + 88%|████████▊ | 2194/2500 [13:40:47<1:33:56, 18.42s/it] 88%|████████▊ | 2195/2500 [13:41:05<1:33:36, 18.41s/it] {'loss': 0.0002, 'grad_norm': 0.020325224369539947, 'learning_rate': 1.2199999999999998e-07, 'completion_length': 159.46429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0044097900390625, 'epoch': 0.88} + 88%|████████▊ | 2195/2500 [13:41:05<1:33:36, 18.41s/it] 88%|████████▊ | 2196/2500 [13:41:23<1:32:00, 18.16s/it] {'loss': 0.0002, 'grad_norm': 0.023901943182874037, 'learning_rate': 1.216e-07, 'completion_length': 155.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00467681884765625, 'epoch': 0.88} + 88%|████████▊ | 2196/2500 [13:41:23<1:32:00, 18.16s/it] 88%|████████▊ | 2197/2500 [13:41:41<1:31:34, 18.14s/it] {'loss': 0.0001, 'grad_norm': 0.03895320962108639, 'learning_rate': 1.212e-07, 'completion_length': 158.0714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00284576416015625, 'epoch': 0.88} + 88%|████████▊ | 2197/2500 [13:41:41<1:31:34, 18.14s/it] 88%|████████▊ | 2198/2500 [13:42:01<1:34:53, 18.85s/it] {'loss': 0.0002, 'grad_norm': 0.6182425813976481, 'learning_rate': 1.208e-07, 'completion_length': 151.1428680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0042724609375, 'epoch': 0.88} + 88%|████████▊ | 2198/2500 [13:42:01<1:34:53, 18.85s/it] 88%|████████▊ | 2199/2500 [13:42:20<1:34:24, 18.82s/it] {'loss': 0.0001, 'grad_norm': 0.31972727794322225, 'learning_rate': 1.204e-07, 'completion_length': 169.05358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00365447998046875, 'epoch': 0.88} + 88%|████████▊ | 2199/2500 [13:42:20<1:34:24, 18.82s/it] 88%|████████▊ | 2200/2500 [13:42:38<1:32:59, 18.60s/it] {'loss': 0.0001, 'grad_norm': 0.015581322709280967, 'learning_rate': 1.2e-07, 'completion_length': 144.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0026092529296875, 'epoch': 0.88} + 88%|████████▊ | 2200/2500 [13:42:38<1:32:59, 18.60s/it] 88%|████████▊ | 2201/2500 [13:45:31<5:22:50, 64.78s/it] {'loss': 0.0002, 'grad_norm': 0.048582422136993396, 'learning_rate': 1.1959999999999999e-07, 'completion_length': 147.2678680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0044403076171875, 'epoch': 0.88} + 88%|████████▊ | 2201/2500 [13:45:31<5:22:50, 64.78s/it] 88%|████████▊ | 2202/2500 [13:45:50<4:14:27, 51.23s/it] {'loss': 0.0003, 'grad_norm': 0.027203465978554584, 'learning_rate': 1.192e-07, 'completion_length': 148.00000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.006927490234375, 'epoch': 0.88} + 88%|████████▊ | 2202/2500 [13:45:50<4:14:27, 51.23s/it] 88%|████████▊ | 2203/2500 [13:46:11<3:28:02, 42.03s/it] {'loss': 0.0002, 'grad_norm': 0.2928446746780798, 'learning_rate': 1.1879999999999999e-07, 'completion_length': 163.80358123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0053863525390625, 'epoch': 0.88} + 88%|████████▊ | 2203/2500 [13:46:11<3:28:02, 42.03s/it] 88%|████████▊ | 2204/2500 [13:46:29<2:51:56, 34.85s/it] {'loss': 0.0001, 'grad_norm': 0.017509317512706315, 'learning_rate': 1.184e-07, 'completion_length': 154.5357208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.003204345703125, 'epoch': 0.88} + 88%|████████▊ | 2204/2500 [13:46:29<2:51:56, 34.85s/it] 88%|████████▊ | 2205/2500 [13:46:46<2:25:49, 29.66s/it] {'loss': 0.0002, 'grad_norm': 0.2622647576718808, 'learning_rate': 1.1799999999999998e-07, 'completion_length': 152.5357208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00376129150390625, 'epoch': 0.88} + 88%|████████▊ | 2205/2500 [13:46:46<2:25:49, 29.66s/it] 88%|████████▊ | 2206/2500 [13:47:08<2:12:42, 27.08s/it] {'loss': 0.0002, 'grad_norm': 0.3501690624040022, 'learning_rate': 1.176e-07, 'completion_length': 154.5, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.0038299560546875, 'epoch': 0.88} + 88%|████████▊ | 2206/2500 [13:47:08<2:12:42, 27.08s/it] 88%|████████▊ | 2207/2500 [13:47:28<2:02:58, 25.18s/it] {'loss': 0.0002, 'grad_norm': 0.6278132109032212, 'learning_rate': 1.1719999999999999e-07, 'completion_length': 145.9464340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0043792724609375, 'epoch': 0.88} + 88%|████████▊ | 2207/2500 [13:47:28<2:02:58, 25.18s/it] 88%|████████▊ | 2208/2500 [13:47:49<1:56:11, 23.87s/it] {'loss': 0.0002, 'grad_norm': 0.4865832414887373, 'learning_rate': 1.168e-07, 'completion_length': 157.6607208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.0051422119140625, 'epoch': 0.88} + 88%|████████▊ | 2208/2500 [13:47:49<1:56:11, 23.87s/it] 88%|████████▊ | 2209/2500 [13:48:08<1:48:48, 22.43s/it] {'loss': 0.0002, 'grad_norm': 0.07886355459565944, 'learning_rate': 1.164e-07, 'completion_length': 146.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0038909912109375, 'epoch': 0.88} + 88%|████████▊ | 2209/2500 [13:48:08<1:48:48, 22.43s/it] 88%|████████▊ | 2210/2500 [13:48:27<1:43:48, 21.48s/it] {'loss': 0.0001, 'grad_norm': 0.025019772415183196, 'learning_rate': 1.16e-07, 'completion_length': 167.71429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0030670166015625, 'epoch': 0.88} + 88%|████████▊ | 2210/2500 [13:48:27<1:43:48, 21.48s/it] 88%|████████▊ | 2211/2500 [13:48:46<1:39:02, 20.56s/it] {'loss': 0.0002, 'grad_norm': 0.749386709524471, 'learning_rate': 1.1559999999999999e-07, 'completion_length': 147.42858123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00533294677734375, 'epoch': 0.88} + 88%|████████▊ | 2211/2500 [13:48:46<1:39:02, 20.56s/it] 88%|████████▊ | 2212/2500 [13:49:06<1:37:37, 20.34s/it] {'loss': 0.0002, 'grad_norm': 0.8019320422900021, 'learning_rate': 1.1519999999999999e-07, 'completion_length': 161.8214340209961, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.00469970703125, 'epoch': 0.88} + 88%|████████▊ | 2212/2500 [13:49:06<1:37:37, 20.34s/it] 89%|████████▊ | 2213/2500 [13:49:25<1:35:41, 20.01s/it] {'loss': 0.0002, 'grad_norm': 0.02955818397430691, 'learning_rate': 1.148e-07, 'completion_length': 155.58929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0057220458984375, 'epoch': 0.89} + 89%|████████▊ | 2213/2500 [13:49:25<1:35:41, 20.01s/it] 89%|████████▊ | 2214/2500 [13:49:43<1:32:26, 19.39s/it] {'loss': 0.0003, 'grad_norm': 0.06588119039659227, 'learning_rate': 1.1439999999999999e-07, 'completion_length': 149.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0068817138671875, 'epoch': 0.89} + 89%|████████▊ | 2214/2500 [13:49:43<1:32:26, 19.39s/it] 89%|████████▊ | 2215/2500 [13:50:02<1:31:44, 19.31s/it] {'loss': 0.0001, 'grad_norm': 0.01906358532780111, 'learning_rate': 1.14e-07, 'completion_length': 143.30357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00328826904296875, 'epoch': 0.89} + 89%|████████▊ | 2215/2500 [13:50:02<1:31:44, 19.31s/it] 89%|████████▊ | 2216/2500 [13:50:22<1:33:01, 19.65s/it] {'loss': 0.0002, 'grad_norm': 0.39405172824535767, 'learning_rate': 1.136e-07, 'completion_length': 179.98214721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005035400390625, 'epoch': 0.89} + 89%|████████▊ | 2216/2500 [13:50:22<1:33:01, 19.65s/it] 89%|████████▊ | 2217/2500 [13:50:41<1:31:50, 19.47s/it] {'loss': 0.0001, 'grad_norm': 0.012136819817111707, 'learning_rate': 1.132e-07, 'completion_length': 148.2678680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.003143310546875, 'epoch': 0.89} + 89%|████████▊ | 2217/2500 [13:50:41<1:31:50, 19.47s/it] 89%|████████▊ | 2218/2500 [13:51:00<1:30:36, 19.28s/it] {'loss': 0.0003, 'grad_norm': 1.8461639435439452, 'learning_rate': 1.1279999999999999e-07, 'completion_length': 152.4107208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.0079193115234375, 'epoch': 0.89} + 89%|████████▊ | 2218/2500 [13:51:00<1:30:36, 19.28s/it] 89%|████████▉ | 2219/2500 [13:51:21<1:31:51, 19.61s/it] {'loss': 0.0002, 'grad_norm': 0.5560047110959092, 'learning_rate': 1.124e-07, 'completion_length': 152.39286041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.0048675537109375, 'epoch': 0.89} + 89%|████████▉ | 2219/2500 [13:51:21<1:31:51, 19.61s/it] 89%|████████▉ | 2220/2500 [13:51:39<1:29:30, 19.18s/it] {'loss': 0.0001, 'grad_norm': 0.4296788852653064, 'learning_rate': 1.12e-07, 'completion_length': 138.00000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00261688232421875, 'epoch': 0.89} + 89%|████████▉ | 2220/2500 [13:51:39<1:29:30, 19.18s/it] 89%|████████▉ | 2221/2500 [13:51:59<1:30:44, 19.51s/it] {'loss': 0.0003, 'grad_norm': 0.33221974627884787, 'learning_rate': 1.116e-07, 'completion_length': 177.89286041259766, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0068817138671875, 'epoch': 0.89} + 89%|████████▉ | 2221/2500 [13:51:59<1:30:44, 19.51s/it] 89%|████████▉ | 2222/2500 [13:52:17<1:28:36, 19.13s/it] {'loss': 0.0001, 'grad_norm': 0.22451312214603247, 'learning_rate': 1.1119999999999999e-07, 'completion_length': 153.48214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0034637451171875, 'epoch': 0.89} + 89%|████████▉ | 2222/2500 [13:52:17<1:28:36, 19.13s/it] 89%|████████▉ | 2223/2500 [13:52:36<1:27:39, 18.99s/it] {'loss': 0.0001, 'grad_norm': 0.9883920077738759, 'learning_rate': 1.1079999999999999e-07, 'completion_length': 149.17858123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00333404541015625, 'epoch': 0.89} + 89%|████████▉ | 2223/2500 [13:52:36<1:27:39, 18.99s/it] 89%|████████▉ | 2224/2500 [13:52:55<1:27:30, 19.02s/it] {'loss': 0.0001, 'grad_norm': 2.3840106581346254, 'learning_rate': 1.104e-07, 'completion_length': 161.00000762939453, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.0036773681640625, 'epoch': 0.89} + 89%|████████▉ | 2224/2500 [13:52:55<1:27:30, 19.02s/it] 89%|████████▉ | 2225/2500 [13:53:16<1:29:32, 19.54s/it] {'loss': 0.0002, 'grad_norm': 0.1416595457274939, 'learning_rate': 1.0999999999999999e-07, 'completion_length': 138.96429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00392913818359375, 'epoch': 0.89} + 89%|████████▉ | 2225/2500 [13:53:16<1:29:32, 19.54s/it] 89%|████████▉ | 2226/2500 [13:53:35<1:28:46, 19.44s/it] {'loss': 0.0001, 'grad_norm': 0.7137737554133408, 'learning_rate': 1.096e-07, 'completion_length': 148.83929443359375, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.00366973876953125, 'epoch': 0.89} + 89%|████████▉ | 2226/2500 [13:53:35<1:28:46, 19.44s/it] 89%|████████▉ | 2227/2500 [13:53:53<1:26:20, 18.98s/it] {'loss': 0.0001, 'grad_norm': 0.01652905627814211, 'learning_rate': 1.092e-07, 'completion_length': 143.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00360870361328125, 'epoch': 0.89} + 89%|████████▉ | 2227/2500 [13:53:53<1:26:20, 18.98s/it] 89%|████████▉ | 2228/2500 [13:54:11<1:25:12, 18.79s/it] {'loss': 0.0002, 'grad_norm': 0.02815441627751038, 'learning_rate': 1.088e-07, 'completion_length': 156.0714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004638671875, 'epoch': 0.89} + 89%|████████▉ | 2228/2500 [13:54:11<1:25:12, 18.79s/it] 89%|████████▉ | 2229/2500 [13:54:31<1:25:45, 18.99s/it] {'loss': 0.0002, 'grad_norm': 0.013309464044633616, 'learning_rate': 1.0839999999999999e-07, 'completion_length': 154.23214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045166015625, 'epoch': 0.89} + 89%|████████▉ | 2229/2500 [13:54:31<1:25:45, 18.99s/it] 89%|████████▉ | 2230/2500 [13:54:50<1:25:49, 19.07s/it] {'loss': 0.0003, 'grad_norm': 0.7342452827551219, 'learning_rate': 1.0799999999999999e-07, 'completion_length': 166.67858123779297, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0066375732421875, 'epoch': 0.89} + 89%|████████▉ | 2230/2500 [13:54:50<1:25:49, 19.07s/it] 89%|████████▉ | 2231/2500 [13:55:08<1:24:27, 18.84s/it] {'loss': 0.0002, 'grad_norm': 0.011612860031940538, 'learning_rate': 1.076e-07, 'completion_length': 154.3928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0040130615234375, 'epoch': 0.89} + 89%|████████▉ | 2231/2500 [13:55:08<1:24:27, 18.84s/it] 89%|████████▉ | 2232/2500 [13:55:28<1:25:19, 19.10s/it] {'loss': 0.0002, 'grad_norm': 0.3476282703498305, 'learning_rate': 1.072e-07, 'completion_length': 164.92857360839844, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.00476837158203125, 'epoch': 0.89} + 89%|████████▉ | 2232/2500 [13:55:28<1:25:19, 19.10s/it] 89%|████████▉ | 2233/2500 [13:55:47<1:24:33, 19.00s/it] {'loss': 0.0002, 'grad_norm': 0.019813214822062077, 'learning_rate': 1.068e-07, 'completion_length': 157.55358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00470733642578125, 'epoch': 0.89} + 89%|████████▉ | 2233/2500 [13:55:47<1:24:33, 19.00s/it] 89%|████████▉ | 2234/2500 [13:56:05<1:23:39, 18.87s/it] {'loss': 0.0002, 'grad_norm': 0.2067577345282743, 'learning_rate': 1.0639999999999999e-07, 'completion_length': 150.08929443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00377655029296875, 'epoch': 0.89} + 89%|████████▉ | 2234/2500 [13:56:05<1:23:39, 18.87s/it] 89%|████████▉ | 2235/2500 [13:56:25<1:24:09, 19.06s/it] {'loss': 0.0002, 'grad_norm': 0.343530285448631, 'learning_rate': 1.06e-07, 'completion_length': 156.89286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004669189453125, 'epoch': 0.89} + 89%|████████▉ | 2235/2500 [13:56:25<1:24:09, 19.06s/it] 89%|████████▉ | 2236/2500 [13:56:43<1:23:10, 18.90s/it] {'loss': 0.0001, 'grad_norm': 0.01891945836140843, 'learning_rate': 1.0559999999999999e-07, 'completion_length': 158.73214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00363922119140625, 'epoch': 0.89} + 89%|████████▉ | 2236/2500 [13:56:43<1:23:10, 18.90s/it] 89%|████████▉ | 2237/2500 [13:57:02<1:21:57, 18.70s/it] {'loss': 0.0002, 'grad_norm': 0.7055969781192086, 'learning_rate': 1.052e-07, 'completion_length': 144.5714340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005340576171875, 'epoch': 0.89} + 89%|████████▉ | 2237/2500 [13:57:02<1:21:57, 18.70s/it] 90%|████████▉ | 2238/2500 [13:57:19<1:20:17, 18.39s/it] {'loss': 0.0001, 'grad_norm': 0.024470018292481136, 'learning_rate': 1.048e-07, 'completion_length': 140.96429443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.002559661865234375, 'epoch': 0.9} + 90%|████████▉ | 2238/2500 [13:57:19<1:20:17, 18.39s/it] 90%|████████▉ | 2239/2500 [13:57:38<1:19:58, 18.39s/it] {'loss': 0.0001, 'grad_norm': 0.3076888186262084, 'learning_rate': 1.0440000000000001e-07, 'completion_length': 145.4821548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00325775146484375, 'epoch': 0.9} + 90%|████████▉ | 2239/2500 [13:57:38<1:19:58, 18.39s/it] 90%|████████▉ | 2240/2500 [13:57:57<1:21:03, 18.71s/it] {'loss': 0.0003, 'grad_norm': 0.6647332507238441, 'learning_rate': 1.0399999999999999e-07, 'completion_length': 165.8571548461914, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.11266788095235825, 'kl': 0.0072479248046875, 'epoch': 0.9} + 90%|████████▉ | 2240/2500 [13:57:57<1:21:03, 18.71s/it] 90%|████████▉ | 2241/2500 [13:58:16<1:20:27, 18.64s/it] {'loss': 0.0001, 'grad_norm': 0.03065336670520752, 'learning_rate': 1.0359999999999999e-07, 'completion_length': 146.76786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00327301025390625, 'epoch': 0.9} + 90%|████████▉ | 2241/2500 [13:58:16<1:20:27, 18.64s/it] 90%|████████▉ | 2242/2500 [13:58:36<1:21:58, 19.07s/it] {'loss': 0.0002, 'grad_norm': 0.019702559260846582, 'learning_rate': 1.032e-07, 'completion_length': 162.51786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0048828125, 'epoch': 0.9} + 90%|████████▉ | 2242/2500 [13:58:36<1:21:58, 19.07s/it] 90%|████████▉ | 2243/2500 [13:58:54<1:21:08, 18.94s/it] {'loss': 0.0001, 'grad_norm': 0.024823049744225615, 'learning_rate': 1.028e-07, 'completion_length': 146.4464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003173828125, 'epoch': 0.9} + 90%|████████▉ | 2243/2500 [13:58:54<1:21:08, 18.94s/it] 90%|████████▉ | 2244/2500 [13:59:11<1:18:13, 18.33s/it] {'loss': 0.0001, 'grad_norm': 0.2226029318285301, 'learning_rate': 1.024e-07, 'completion_length': 138.37500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0023040771484375, 'epoch': 0.9} + 90%|████████▉ | 2244/2500 [13:59:11<1:18:13, 18.33s/it] 90%|████████▉ | 2245/2500 [13:59:29<1:16:32, 18.01s/it] {'loss': 0.0001, 'grad_norm': 0.2797858627471949, 'learning_rate': 1.0199999999999999e-07, 'completion_length': 141.3571548461914, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00298309326171875, 'epoch': 0.9} + 90%|████████▉ | 2245/2500 [13:59:29<1:16:32, 18.01s/it] 90%|████████▉ | 2246/2500 [13:59:48<1:17:28, 18.30s/it] {'loss': 0.0003, 'grad_norm': 0.13560669140127268, 'learning_rate': 1.016e-07, 'completion_length': 169.58929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.007049560546875, 'epoch': 0.9} + 90%|████████▉ | 2246/2500 [13:59:48<1:17:28, 18.30s/it] 90%|████████▉ | 2247/2500 [14:00:06<1:17:22, 18.35s/it] {'loss': 0.0002, 'grad_norm': 0.3376519357135722, 'learning_rate': 1.0119999999999999e-07, 'completion_length': 150.125, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00408935546875, 'epoch': 0.9} + 90%|████████▉ | 2247/2500 [14:00:06<1:17:22, 18.35s/it] 90%|████████▉ | 2248/2500 [14:00:25<1:18:07, 18.60s/it] {'loss': 0.0002, 'grad_norm': 1.2486726684938305, 'learning_rate': 1.008e-07, 'completion_length': 169.58929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0060882568359375, 'epoch': 0.9} + 90%|████████▉ | 2248/2500 [14:00:25<1:18:07, 18.60s/it] 90%|████████▉ | 2249/2500 [14:00:43<1:16:49, 18.36s/it] {'loss': 0.0001, 'grad_norm': 0.02017615579366532, 'learning_rate': 1.004e-07, 'completion_length': 140.33929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00244903564453125, 'epoch': 0.9} + 90%|████████▉ | 2249/2500 [14:00:43<1:16:49, 18.36s/it] 90%|█████████ | 2250/2500 [14:01:01<1:15:52, 18.21s/it] {'loss': 0.0002, 'grad_norm': 0.02437308748929227, 'learning_rate': 1e-07, 'completion_length': 152.51786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00415802001953125, 'epoch': 0.9} + 90%|█████████ | 2250/2500 [14:01:01<1:15:52, 18.21s/it] 90%|█████████ | 2251/2500 [14:01:21<1:17:29, 18.67s/it] {'loss': 0.0002, 'grad_norm': 0.1449845278208209, 'learning_rate': 9.959999999999999e-08, 'completion_length': 159.8214340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0043487548828125, 'epoch': 0.9} + 90%|█████████ | 2251/2500 [14:01:21<1:17:29, 18.67s/it] 90%|█████████ | 2252/2500 [14:01:39<1:17:14, 18.69s/it] {'loss': 0.0003, 'grad_norm': 0.4821218891481401, 'learning_rate': 9.919999999999999e-08, 'completion_length': 146.7857208251953, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0065155029296875, 'epoch': 0.9} + 90%|█████████ | 2252/2500 [14:01:39<1:17:14, 18.69s/it] 90%|█████████ | 2253/2500 [14:01:59<1:17:55, 18.93s/it] {'loss': 0.0001, 'grad_norm': 0.08532974899869973, 'learning_rate': 9.88e-08, 'completion_length': 160.3214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0027618408203125, 'epoch': 0.9} + 90%|█████████ | 2253/2500 [14:01:59<1:17:55, 18.93s/it] 90%|█████████ | 2254/2500 [14:02:17<1:17:13, 18.84s/it] {'loss': 0.0002, 'grad_norm': 0.03393862400009751, 'learning_rate': 9.84e-08, 'completion_length': 148.4464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0062103271484375, 'epoch': 0.9} + 90%|█████████ | 2254/2500 [14:02:17<1:17:13, 18.84s/it] 90%|█████████ | 2255/2500 [14:02:36<1:16:48, 18.81s/it] {'loss': 0.0001, 'grad_norm': 0.036870749752957434, 'learning_rate': 9.8e-08, 'completion_length': 156.625, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0034637451171875, 'epoch': 0.9} + 90%|█████████ | 2255/2500 [14:02:36<1:16:48, 18.81s/it] 90%|█████████ | 2256/2500 [14:02:54<1:15:19, 18.52s/it] {'loss': 0.0001, 'grad_norm': 0.019560269467455354, 'learning_rate': 9.76e-08, 'completion_length': 162.87500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003326416015625, 'epoch': 0.9} + 90%|█████████ | 2256/2500 [14:02:54<1:15:19, 18.52s/it] 90%|█████████ | 2257/2500 [14:03:14<1:16:26, 18.87s/it] {'loss': 0.0003, 'grad_norm': 0.6329650943471584, 'learning_rate': 9.72e-08, 'completion_length': 170.17858123779297, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.00640869140625, 'epoch': 0.9} + 90%|█████████ | 2257/2500 [14:03:14<1:16:26, 18.87s/it] 90%|█████████ | 2258/2500 [14:03:33<1:16:49, 19.05s/it] {'loss': 0.0002, 'grad_norm': 0.02010852678521651, 'learning_rate': 9.679999999999999e-08, 'completion_length': 168.0178680419922, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.0045928955078125, 'epoch': 0.9} + 90%|█████████ | 2258/2500 [14:03:33<1:16:49, 19.05s/it] 90%|█████████ | 2259/2500 [14:03:52<1:15:47, 18.87s/it] {'loss': 0.0002, 'grad_norm': 0.2645700693649353, 'learning_rate': 9.639999999999999e-08, 'completion_length': 158.5714340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00467681884765625, 'epoch': 0.9} + 90%|█████████ | 2259/2500 [14:03:52<1:15:47, 18.87s/it] 90%|█████████ | 2260/2500 [14:04:09<1:13:11, 18.30s/it] {'loss': 0.0001, 'grad_norm': 0.020746158631822165, 'learning_rate': 9.6e-08, 'completion_length': 133.89286041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00363922119140625, 'epoch': 0.9} + 90%|█████████ | 2260/2500 [14:04:09<1:13:11, 18.30s/it] 90%|█████████ | 2261/2500 [14:04:27<1:12:33, 18.22s/it] {'loss': 0.0002, 'grad_norm': 0.05253824357070875, 'learning_rate': 9.56e-08, 'completion_length': 168.48214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0050201416015625, 'epoch': 0.9} + 90%|█████████ | 2261/2500 [14:04:27<1:12:33, 18.22s/it] 90%|█████████ | 2262/2500 [14:04:45<1:12:21, 18.24s/it] {'loss': 0.0003, 'grad_norm': 0.43323917606528645, 'learning_rate': 9.52e-08, 'completion_length': 172.9107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006317138671875, 'epoch': 0.9} + 90%|█████████ | 2262/2500 [14:04:45<1:12:21, 18.24s/it] 91%|█████████ | 2263/2500 [14:05:03<1:12:21, 18.32s/it] {'loss': 0.0002, 'grad_norm': 0.021106437527627163, 'learning_rate': 9.479999999999999e-08, 'completion_length': 163.46429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0054931640625, 'epoch': 0.91} + 91%|█████████ | 2263/2500 [14:05:03<1:12:21, 18.32s/it] 91%|█████████ | 2264/2500 [14:05:23<1:13:19, 18.64s/it] {'loss': 0.0002, 'grad_norm': 0.053081488148102846, 'learning_rate': 9.44e-08, 'completion_length': 164.80357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00616455078125, 'epoch': 0.91} + 91%|█████████ | 2264/2500 [14:05:23<1:13:19, 18.64s/it] 91%|█████████ | 2265/2500 [14:05:42<1:13:13, 18.70s/it] {'loss': 0.0002, 'grad_norm': 0.05110374930132264, 'learning_rate': 9.4e-08, 'completion_length': 162.42858123779297, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.0059967041015625, 'epoch': 0.91} + 91%|█████████ | 2265/2500 [14:05:42<1:13:13, 18.70s/it] 91%|█████████ | 2266/2500 [14:06:01<1:13:26, 18.83s/it] {'loss': 0.0002, 'grad_norm': 0.32989008165623845, 'learning_rate': 9.36e-08, 'completion_length': 167.10714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0058441162109375, 'epoch': 0.91} + 91%|█████████ | 2266/2500 [14:06:01<1:13:26, 18.83s/it] 91%|█████████ | 2267/2500 [14:06:21<1:14:40, 19.23s/it] {'loss': 0.0002, 'grad_norm': 0.022457757497065627, 'learning_rate': 9.32e-08, 'completion_length': 160.75000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0050048828125, 'epoch': 0.91} + 91%|█████████ | 2267/2500 [14:06:21<1:14:40, 19.23s/it] 91%|█████████ | 2268/2500 [14:06:40<1:13:59, 19.14s/it] {'loss': 0.0002, 'grad_norm': 0.35802297301597014, 'learning_rate': 9.279999999999998e-08, 'completion_length': 164.44644165039062, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00440216064453125, 'epoch': 0.91} + 91%|█████████ | 2268/2500 [14:06:40<1:13:59, 19.14s/it] 91%|█████████ | 2269/2500 [14:06:58<1:12:43, 18.89s/it] {'loss': 0.0001, 'grad_norm': 0.022020971063075434, 'learning_rate': 9.24e-08, 'completion_length': 148.12500762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00337982177734375, 'epoch': 0.91} + 91%|█████████ | 2269/2500 [14:06:58<1:12:43, 18.89s/it] 91%|█████████ | 2270/2500 [14:07:16<1:11:26, 18.64s/it] {'loss': 0.0002, 'grad_norm': 0.040988067414378875, 'learning_rate': 9.199999999999999e-08, 'completion_length': 157.4464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0053253173828125, 'epoch': 0.91} + 91%|█████████ | 2270/2500 [14:07:16<1:11:26, 18.64s/it] 91%|█████████ | 2271/2500 [14:07:34<1:10:06, 18.37s/it] {'loss': 0.0001, 'grad_norm': 0.26043444634659574, 'learning_rate': 9.16e-08, 'completion_length': 155.76786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00347900390625, 'epoch': 0.91} + 91%|█████████ | 2271/2500 [14:07:34<1:10:06, 18.37s/it] 91%|█████████ | 2272/2500 [14:07:52<1:08:59, 18.16s/it] {'loss': 0.0002, 'grad_norm': 0.03971889417657282, 'learning_rate': 9.12e-08, 'completion_length': 153.9464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00379180908203125, 'epoch': 0.91} + 91%|█████████ | 2272/2500 [14:07:52<1:08:59, 18.16s/it] 91%|█████████ | 2273/2500 [14:08:11<1:09:57, 18.49s/it] {'loss': 0.0002, 'grad_norm': 0.11867036993451693, 'learning_rate': 9.08e-08, 'completion_length': 157.17858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00394439697265625, 'epoch': 0.91} + 91%|█████████ | 2273/2500 [14:08:11<1:09:57, 18.49s/it] 91%|█████████ | 2274/2500 [14:08:29<1:09:24, 18.43s/it] {'loss': 0.0002, 'grad_norm': 0.027400237406498773, 'learning_rate': 9.039999999999999e-08, 'completion_length': 157.4107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0051116943359375, 'epoch': 0.91} + 91%|█████████ | 2274/2500 [14:08:29<1:09:24, 18.43s/it] 91%|█████████ | 2275/2500 [14:08:47<1:08:55, 18.38s/it] {'loss': 0.0002, 'grad_norm': 0.1712880052034698, 'learning_rate': 9e-08, 'completion_length': 155.67858123779297, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.00431060791015625, 'epoch': 0.91} + 91%|█████████ | 2275/2500 [14:08:47<1:08:55, 18.38s/it] 91%|█████████ | 2276/2500 [14:09:06<1:09:20, 18.57s/it] {'loss': 0.0002, 'grad_norm': 0.3699671730732658, 'learning_rate': 8.96e-08, 'completion_length': 162.1964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005279541015625, 'epoch': 0.91} + 91%|█████████ | 2276/2500 [14:09:06<1:09:20, 18.57s/it] 91%|█████████ | 2277/2500 [14:09:26<1:10:20, 18.93s/it] {'loss': 0.0002, 'grad_norm': 0.3995308363246812, 'learning_rate': 8.919999999999999e-08, 'completion_length': 142.9464340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00414276123046875, 'epoch': 0.91} + 91%|█████████ | 2277/2500 [14:09:26<1:10:20, 18.93s/it] 91%|█████████ | 2278/2500 [14:09:45<1:10:19, 19.00s/it] {'loss': 0.0002, 'grad_norm': 0.5643161655901846, 'learning_rate': 8.88e-08, 'completion_length': 140.4107208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.00389862060546875, 'epoch': 0.91} + 91%|█████████ | 2278/2500 [14:09:45<1:10:19, 19.00s/it] 91%|█████████ | 2279/2500 [14:10:04<1:09:35, 18.89s/it] {'loss': 0.0002, 'grad_norm': 0.5935633064043608, 'learning_rate': 8.84e-08, 'completion_length': 160.14286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00493621826171875, 'epoch': 0.91} + 91%|█████████ | 2279/2500 [14:10:04<1:09:35, 18.89s/it] 91%|█████████ | 2280/2500 [14:10:23<1:09:29, 18.95s/it] {'loss': 0.0002, 'grad_norm': 0.022144498443434912, 'learning_rate': 8.8e-08, 'completion_length': 164.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0047607421875, 'epoch': 0.91} + 91%|█████████ | 2280/2500 [14:10:23<1:09:29, 18.95s/it] 91%|█████████ | 2281/2500 [14:10:43<1:09:40, 19.09s/it] {'loss': 0.0001, 'grad_norm': 0.17619285799967696, 'learning_rate': 8.759999999999999e-08, 'completion_length': 162.73214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0036468505859375, 'epoch': 0.91} + 91%|█████████ | 2281/2500 [14:10:43<1:09:40, 19.09s/it] 91%|█████████▏| 2282/2500 [14:11:02<1:10:06, 19.30s/it] {'loss': 0.0001, 'grad_norm': 0.020613118187268714, 'learning_rate': 8.72e-08, 'completion_length': 152.96429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00325775146484375, 'epoch': 0.91} + 91%|█████████▏| 2282/2500 [14:11:02<1:10:06, 19.30s/it] 91%|█████████▏| 2283/2500 [14:11:21<1:09:36, 19.25s/it] {'loss': 0.0001, 'grad_norm': 0.31822919087126594, 'learning_rate': 8.68e-08, 'completion_length': 150.4107208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.003387451171875, 'epoch': 0.91} + 91%|█████████▏| 2283/2500 [14:11:21<1:09:36, 19.25s/it] 91%|█████████▏| 2284/2500 [14:11:40<1:08:34, 19.05s/it] {'loss': 0.0002, 'grad_norm': 0.01859376219247177, 'learning_rate': 8.64e-08, 'completion_length': 168.33929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004638671875, 'epoch': 0.91} + 91%|█████████▏| 2284/2500 [14:11:40<1:08:34, 19.05s/it] 91%|█████████▏| 2285/2500 [14:12:00<1:09:14, 19.32s/it] {'loss': 0.0002, 'grad_norm': 0.26264190552388145, 'learning_rate': 8.599999999999999e-08, 'completion_length': 160.7678680419922, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0055084228515625, 'epoch': 0.91} + 91%|█████���███▏| 2285/2500 [14:12:00<1:09:14, 19.32s/it] 91%|█████████▏| 2286/2500 [14:12:20<1:09:25, 19.46s/it] {'loss': 0.0002, 'grad_norm': 0.9877158316153536, 'learning_rate': 8.559999999999999e-08, 'completion_length': 169.33929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00567626953125, 'epoch': 0.91} + 91%|█████████▏| 2286/2500 [14:12:20<1:09:25, 19.46s/it] 91%|█████████▏| 2287/2500 [14:12:40<1:10:06, 19.75s/it] {'loss': 0.0002, 'grad_norm': 0.19855977133642624, 'learning_rate': 8.52e-08, 'completion_length': 180.12500762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00392913818359375, 'epoch': 0.91} + 91%|█████████▏| 2287/2500 [14:12:40<1:10:06, 19.75s/it] 92%|█████████▏| 2288/2500 [14:12:58<1:07:59, 19.24s/it] {'loss': 0.0001, 'grad_norm': 0.015113463566083372, 'learning_rate': 8.479999999999999e-08, 'completion_length': 151.33929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00301361083984375, 'epoch': 0.92} + 92%|█████████▏| 2288/2500 [14:12:58<1:07:59, 19.24s/it] 92%|█████████▏| 2289/2500 [14:13:18<1:07:56, 19.32s/it] {'loss': 0.0001, 'grad_norm': 0.012185727228585507, 'learning_rate': 8.44e-08, 'completion_length': 158.4107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00274658203125, 'epoch': 0.92} + 92%|█████████▏| 2289/2500 [14:13:18<1:07:56, 19.32s/it] 92%|█████████▏| 2290/2500 [14:13:36<1:06:11, 18.91s/it] {'loss': 0.0002, 'grad_norm': 0.03125123330823901, 'learning_rate': 8.4e-08, 'completion_length': 153.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0051116943359375, 'epoch': 0.92} + 92%|█████████▏| 2290/2500 [14:13:36<1:06:11, 18.91s/it] 92%|█████████▏| 2291/2500 [14:13:54<1:05:14, 18.73s/it] {'loss': 0.0002, 'grad_norm': 0.019168017868499077, 'learning_rate': 8.36e-08, 'completion_length': 167.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0050506591796875, 'epoch': 0.92} + 92%|█████████▏| 2291/2500 [14:13:54<1:05:14, 18.73s/it] 92%|█████████▏| 2292/2500 [14:14:13<1:05:29, 18.89s/it] {'loss': 0.0002, 'grad_norm': 0.022561343471514582, 'learning_rate': 8.319999999999999e-08, 'completion_length': 163.4107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004241943359375, 'epoch': 0.92} + 92%|█████████▏| 2292/2500 [14:14:13<1:05:29, 18.89s/it] 92%|█████████▏| 2293/2500 [14:14:33<1:05:43, 19.05s/it] {'loss': 0.0002, 'grad_norm': 0.023109795281103074, 'learning_rate': 8.28e-08, 'completion_length': 169.30358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00469970703125, 'epoch': 0.92} + 92%|█████████▏| 2293/2500 [14:14:33<1:05:43, 19.05s/it] 92%|█████████▏| 2294/2500 [14:14:51<1:04:46, 18.86s/it] {'loss': 0.0001, 'grad_norm': 0.5852069574234254, 'learning_rate': 8.24e-08, 'completion_length': 154.00000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.003753662109375, 'epoch': 0.92} + 92%|█████████▏| 2294/2500 [14:14:51<1:04:46, 18.86s/it] 92%|█████████▏| 2295/2500 [14:15:10<1:04:12, 18.79s/it] {'loss': 0.0002, 'grad_norm': 0.025167301339309063, 'learning_rate': 8.2e-08, 'completion_length': 160.2678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00469970703125, 'epoch': 0.92} + 92%|█████████▏| 2295/2500 [14:15:10<1:04:12, 18.79s/it] 92%|█████████▏| 2296/2500 [14:15:30<1:04:56, 19.10s/it] {'loss': 0.0002, 'grad_norm': 0.2497637928624014, 'learning_rate': 8.16e-08, 'completion_length': 158.4821548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00388336181640625, 'epoch': 0.92} + 92%|█████████▏| 2296/2500 [14:15:30<1:04:56, 19.10s/it] 92%|█████████▏| 2297/2500 [14:15:48<1:03:48, 18.86s/it] {'loss': 0.0001, 'grad_norm': 0.013091859656690413, 'learning_rate': 8.119999999999999e-08, 'completion_length': 157.12500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0031280517578125, 'epoch': 0.92} + 92%|█████████▏| 2297/2500 [14:15:48<1:03:48, 18.86s/it] 92%|█████████▏| 2298/2500 [14:16:06<1:03:10, 18.77s/it] {'loss': 0.0002, 'grad_norm': 0.47789983610583014, 'learning_rate': 8.08e-08, 'completion_length': 151.1607208251953, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.00467681884765625, 'epoch': 0.92} + 92%|█████████▏| 2298/2500 [14:16:06<1:03:10, 18.77s/it] 92%|█████████▏| 2299/2500 [14:16:25<1:03:02, 18.82s/it] {'loss': 0.0002, 'grad_norm': 0.017492637757977263, 'learning_rate': 8.039999999999999e-08, 'completion_length': 155.21428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00457763671875, 'epoch': 0.92} + 92%|█████████▏| 2299/2500 [14:16:25<1:03:02, 18.82s/it] 92%|█████████▏| 2300/2500 [14:16:44<1:02:42, 18.81s/it] {'loss': 0.0002, 'grad_norm': 0.016350857025947303, 'learning_rate': 8e-08, 'completion_length': 156.23214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045928955078125, 'epoch': 0.92} + 92%|█████████▏| 2300/2500 [14:16:44<1:02:42, 18.81s/it] 92%|█████████▏| 2301/2500 [14:19:47<3:45:31, 68.00s/it] {'loss': 0.0003, 'grad_norm': 0.03783943899602781, 'learning_rate': 7.96e-08, 'completion_length': 156.50000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0064239501953125, 'epoch': 0.92} + 92%|█████████▏| 2301/2500 [14:19:47<3:45:31, 68.00s/it] 92%|█████████▏| 2302/2500 [14:20:07<2:57:13, 53.70s/it] {'loss': 0.0002, 'grad_norm': 0.4068915982176331, 'learning_rate': 7.920000000000001e-08, 'completion_length': 159.4107208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005645751953125, 'epoch': 0.92} + 92%|█████████▏| 2302/2500 [14:20:07<2:57:13, 53.70s/it] 92%|█████████▏| 2303/2500 [14:20:26<2:22:10, 43.30s/it] {'loss': 0.0002, 'grad_norm': 0.020046465877497425, 'learning_rate': 7.879999999999999e-08, 'completion_length': 156.10714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004180908203125, 'epoch': 0.92} + 92%|█████████▏| 2303/2500 [14:20:26<2:22:10, 43.30s/it] 92%|█████████▏| 2304/2500 [14:20:45<1:57:10, 35.87s/it] {'loss': 0.0001, 'grad_norm': 0.014026367152635527, 'learning_rate': 7.839999999999999e-08, 'completion_length': 152.3928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00328826904296875, 'epoch': 0.92} + 92%|█████████▏| 2304/2500 [14:20:45<1:57:10, 35.87s/it] 92%|█████████▏| 2305/2500 [14:21:05<1:41:06, 31.11s/it] {'loss': 0.0002, 'grad_norm': 0.21455915795007738, 'learning_rate': 7.8e-08, 'completion_length': 165.6428680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0058135986328125, 'epoch': 0.92} + 92%|█████████▏| 2305/2500 [14:21:05<1:41:06, 31.11s/it] 92%|█████████▏| 2306/2500 [14:21:23<1:27:36, 27.09s/it] {'loss': 0.0002, 'grad_norm': 0.02618368442738798, 'learning_rate': 7.76e-08, 'completion_length': 158.6964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00469970703125, 'epoch': 0.92} + 92%|█████████▏| 2306/2500 [14:21:23<1:27:36, 27.09s/it] 92%|█████████▏| 2307/2500 [14:21:42<1:19:29, 24.71s/it] {'loss': 0.0002, 'grad_norm': 0.4000706740869663, 'learning_rate': 7.72e-08, 'completion_length': 168.12500762939453, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.00579833984375, 'epoch': 0.92} + 92%|█████████▏| 2307/2500 [14:21:42<1:19:29, 24.71s/it] 92%|█████████▏| 2308/2500 [14:22:00<1:12:42, 22.72s/it] {'loss': 0.0001, 'grad_norm': 0.5948769454151047, 'learning_rate': 7.679999999999999e-08, 'completion_length': 140.94644165039062, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00354766845703125, 'epoch': 0.92} + 92%|█████████▏| 2308/2500 [14:22:00<1:12:42, 22.72s/it] 92%|█████████▏| 2309/2500 [14:22:18<1:08:24, 21.49s/it] {'loss': 0.0002, 'grad_norm': 0.32837439355851805, 'learning_rate': 7.64e-08, 'completion_length': 147.7857208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00383758544921875, 'epoch': 0.92} + 92%|█████████▏| 2309/2500 [14:22:18<1:08:24, 21.49s/it] 92%|█████████▏| 2310/2500 [14:22:38<1:06:13, 20.91s/it] {'loss': 0.0002, 'grad_norm': 0.26202583881920016, 'learning_rate': 7.599999999999999e-08, 'completion_length': 158.50000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0052490234375, 'epoch': 0.92} + 92%|█████████▏| 2310/2500 [14:22:38<1:06:13, 20.91s/it] 92%|█████████▏| 2311/2500 [14:22:55<1:02:38, 19.89s/it] {'loss': 0.0001, 'grad_norm': 0.0198500876183428, 'learning_rate': 7.56e-08, 'completion_length': 145.33929061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.001922607421875, 'epoch': 0.92} + 92%|█████████▏| 2311/2500 [14:22:55<1:02:38, 19.89s/it] 92%|█████████▏| 2312/2500 [14:23:15<1:02:22, 19.91s/it] {'loss': 0.0004, 'grad_norm': 0.6047933545537861, 'learning_rate': 7.52e-08, 'completion_length': 170.0178680419922, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07695358991622925, 'kl': 0.010162353515625, 'epoch': 0.92} + 92%|█████████▏| 2312/2500 [14:23:15<1:02:22, 19.91s/it] 93%|█████████▎| 2313/2500 [14:23:34<1:00:26, 19.39s/it] {'loss': 0.0001, 'grad_norm': 1.1571640971269526, 'learning_rate': 7.480000000000001e-08, 'completion_length': 143.21429443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00243377685546875, 'epoch': 0.93} + 93%|█████████▎| 2313/2500 [14:23:34<1:00:26, 19.39s/it] 93%|█████████▎| 2314/2500 [14:23:53<59:56, 19.33s/it] {'loss': 0.0002, 'grad_norm': 0.024401182757642657, 'learning_rate': 7.439999999999999e-08, 'completion_length': 155.4464340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0052337646484375, 'epoch': 0.93} + 93%|█████████▎| 2314/2500 [14:23:53<59:56, 19.33s/it] 93%|█████████▎| 2315/2500 [14:24:11<58:24, 18.94s/it] {'loss': 0.0002, 'grad_norm': 0.01660334838533366, 'learning_rate': 7.399999999999999e-08, 'completion_length': 160.5178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00390625, 'epoch': 0.93} + 93%|█████████▎| 2315/2500 [14:24:11<58:24, 18.94s/it] 93%|█████████▎| 2316/2500 [14:24:30<58:06, 18.95s/it] {'loss': 0.0004, 'grad_norm': 0.29642043655097244, 'learning_rate': 7.36e-08, 'completion_length': 157.3928680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0088653564453125, 'epoch': 0.93} + 93%|█████████▎| 2316/2500 [14:24:30<58:06, 18.95s/it] 93%|█████████▎| 2317/2500 [14:24:49<58:05, 19.05s/it] {'loss': 0.0002, 'grad_norm': 0.7491120024597585, 'learning_rate': 7.32e-08, 'completion_length': 167.83929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0047454833984375, 'epoch': 0.93} + 93%|█████████▎| 2317/2500 [14:24:49<58:05, 19.05s/it] 93%|█████████▎| 2318/2500 [14:25:09<58:26, 19.27s/it] {'loss': 0.0002, 'grad_norm': 0.017014628010861326, 'learning_rate': 7.28e-08, 'completion_length': 158.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004913330078125, 'epoch': 0.93} + 93%|█████████▎| 2318/2500 [14:25:09<58:26, 19.27s/it] 93%|█████████▎| 2319/2500 [14:25:26<56:08, 18.61s/it] {'loss': 0.0002, 'grad_norm': 0.014510484352189072, 'learning_rate': 7.24e-08, 'completion_length': 136.46429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004486083984375, 'epoch': 0.93} + 93%|█████████▎| 2319/2500 [14:25:26<56:08, 18.61s/it] 93%|█████████▎| 2320/2500 [14:25:46<57:10, 19.06s/it] {'loss': 0.0003, 'grad_norm': 0.19487246236818354, 'learning_rate': 7.2e-08, 'completion_length': 165.5714340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.006256103515625, 'epoch': 0.93} + 93%|█████████▎| 2320/2500 [14:25:46<57:10, 19.06s/it] 93%|█████████▎| 2321/2500 [14:26:05<56:18, 18.88s/it] {'loss': 0.0002, 'grad_norm': 1.5915827510388068, 'learning_rate': 7.159999999999999e-08, 'completion_length': 156.92858123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00385284423828125, 'epoch': 0.93} + 93%|█████████▎| 2321/2500 [14:26:05<56:18, 18.88s/it] 93%|█████████▎| 2322/2500 [14:26:23<55:23, 18.67s/it] {'loss': 0.0002, 'grad_norm': 0.02032731166244475, 'learning_rate': 7.12e-08, 'completion_length': 154.80357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0059051513671875, 'epoch': 0.93} + 93%|█████████▎| 2322/2500 [14:26:23<55:23, 18.67s/it] 93%|█████████▎| 2323/2500 [14:26:42<55:21, 18.77s/it] {'loss': 0.0001, 'grad_norm': 0.012827464710328161, 'learning_rate': 7.08e-08, 'completion_length': 174.60714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003326416015625, 'epoch': 0.93} + 93%|█████████▎| 2323/2500 [14:26:42<55:21, 18.77s/it] 93%|█████████▎| 2324/2500 [14:27:00<54:20, 18.52s/it] {'loss': 0.0002, 'grad_norm': 0.03923997707862744, 'learning_rate': 7.04e-08, 'completion_length': 156.3928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004241943359375, 'epoch': 0.93} + 93%|█████████▎| 2324/2500 [14:27:00<54:20, 18.52s/it] 93%|█████████▎| 2325/2500 [14:27:18<53:47, 18.44s/it] {'loss': 0.0002, 'grad_norm': 0.06007778837506625, 'learning_rate': 7e-08, 'completion_length': 144.375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0046234130859375, 'epoch': 0.93} + 93%|█████████▎| 2325/2500 [14:27:18<53:47, 18.44s/it] 93%|█████████▎| 2326/2500 [14:27:37<53:42, 18.52s/it] {'loss': 0.0001, 'grad_norm': 0.23329483160594336, 'learning_rate': 6.959999999999999e-08, 'completion_length': 140.85714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.002407073974609375, 'epoch': 0.93} + 93%|█████████▎| 2326/2500 [14:27:37<53:42, 18.52s/it] 93%|█████████▎| 2327/2500 [14:27:55<53:32, 18.57s/it] {'loss': 0.0003, 'grad_norm': 0.04597403603252296, 'learning_rate': 6.92e-08, 'completion_length': 157.6964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0080413818359375, 'epoch': 0.93} + 93%|█████████▎| 2327/2500 [14:27:55<53:32, 18.57s/it] 93%|█████████▎| 2328/2500 [14:28:15<54:20, 18.96s/it] {'loss': 0.0002, 'grad_norm': 0.030129483024356946, 'learning_rate': 6.88e-08, 'completion_length': 160.98214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00444793701171875, 'epoch': 0.93} + 93%|█████████▎| 2328/2500 [14:28:15<54:20, 18.96s/it] 93%|█████████▎| 2329/2500 [14:28:33<53:26, 18.75s/it] {'loss': 0.0001, 'grad_norm': 0.01801489697446204, 'learning_rate': 6.84e-08, 'completion_length': 145.3214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0035552978515625, 'epoch': 0.93} + 93%|█████████▎| 2329/2500 [14:28:33<53:26, 18.75s/it] 93%|█████████▎| 2330/2500 [14:28:53<53:46, 18.98s/it] {'loss': 0.0002, 'grad_norm': 0.5842784602477292, 'learning_rate': 6.8e-08, 'completion_length': 159.55358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00604248046875, 'epoch': 0.93} + 93%|█████████▎| 2330/2500 [14:28:53<53:46, 18.98s/it] 93%|█████████▎| 2331/2500 [14:29:12<53:08, 18.87s/it] {'loss': 0.0001, 'grad_norm': 0.019109467988384046, 'learning_rate': 6.76e-08, 'completion_length': 156.1607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00325775146484375, 'epoch': 0.93} + 93%|█████████▎| 2331/2500 [14:29:12<53:08, 18.87s/it] 93%|█████████▎| 2332/2500 [14:29:32<53:57, 19.27s/it] {'loss': 0.0002, 'grad_norm': 0.258910937579488, 'learning_rate': 6.719999999999999e-08, 'completion_length': 169.0357208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.00383758544921875, 'epoch': 0.93} + 93%|█████████▎| 2332/2500 [14:29:32<53:57, 19.27s/it] 93%|█████████▎| 2333/2500 [14:29:51<53:36, 19.26s/it] {'loss': 0.0002, 'grad_norm': 0.19226382235391962, 'learning_rate': 6.679999999999999e-08, 'completion_length': 165.58928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00516510009765625, 'epoch': 0.93} + 93%|█████████▎| 2333/2500 [14:29:51<53:36, 19.26s/it] 93%|█████████▎| 2334/2500 [14:30:11<53:46, 19.44s/it] {'loss': 0.0002, 'grad_norm': 1.2175204447194483, 'learning_rate': 6.64e-08, 'completion_length': 168.76786041259766, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.0048065185546875, 'epoch': 0.93} + 93%|█████████▎| 2334/2500 [14:30:11<53:46, 19.44s/it] 93%|█████████▎| 2335/2500 [14:30:29<52:11, 18.98s/it] {'loss': 0.0003, 'grad_norm': 0.5171364469486067, 'learning_rate': 6.6e-08, 'completion_length': 160.7857208251953, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.006500244140625, 'epoch': 0.93} + 93%|█████████▎| 2335/2500 [14:30:29<52:11, 18.98s/it] 93%|█████████▎| 2336/2500 [14:30:48<52:29, 19.21s/it] {'loss': 0.0002, 'grad_norm': 0.5546560991677193, 'learning_rate': 6.56e-08, 'completion_length': 172.58929443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0059814453125, 'epoch': 0.93} + 93%|█████████▎| 2336/2500 [14:30:48<52:29, 19.21s/it] 93%|█████████▎| 2337/2500 [14:31:07<51:50, 19.08s/it] {'loss': 0.0002, 'grad_norm': 0.5339928454594708, 'learning_rate': 6.519999999999999e-08, 'completion_length': 153.50000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0045623779296875, 'epoch': 0.93} + 93%|█████████▎| 2337/2500 [14:31:07<51:50, 19.08s/it] 94%|█████████▎| 2338/2500 [14:31:26<51:20, 19.02s/it] {'loss': 0.0002, 'grad_norm': 0.18329253438870735, 'learning_rate': 6.48e-08, 'completion_length': 152.7678680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00499725341796875, 'epoch': 0.94} + 94%|█████████▎| 2338/2500 [14:31:26<51:20, 19.02s/it] 94%|█████████▎| 2339/2500 [14:31:47<52:11, 19.45s/it] {'loss': 0.0003, 'grad_norm': 0.29592028732239756, 'learning_rate': 6.44e-08, 'completion_length': 185.35714721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0824786126613617, 'kl': 0.008026123046875, 'epoch': 0.94} + 94%|█████████▎| 2339/2500 [14:31:47<52:11, 19.45s/it] 94%|█████████▎| 2340/2500 [14:32:04<50:29, 18.93s/it] {'loss': 0.0001, 'grad_norm': 0.01942386189868548, 'learning_rate': 6.4e-08, 'completion_length': 146.9464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00359344482421875, 'epoch': 0.94} + 94%|█████████▎| 2340/2500 [14:32:04<50:29, 18.93s/it] 94%|█████████▎| 2341/2500 [14:32:25<51:29, 19.43s/it] {'loss': 0.0002, 'grad_norm': 0.11970802758741282, 'learning_rate': 6.36e-08, 'completion_length': 168.7678680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0045318603515625, 'epoch': 0.94} + 94%|█████████▎| 2341/2500 [14:32:25<51:29, 19.43s/it] 94%|█████████▎| 2342/2500 [14:32:44<50:47, 19.29s/it] {'loss': 0.0002, 'grad_norm': 0.23202663035541984, 'learning_rate': 6.32e-08, 'completion_length': 162.5357208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00472259521484375, 'epoch': 0.94} + 94%|█████████▎| 2342/2500 [14:32:44<50:47, 19.29s/it] 94%|█████████▎| 2343/2500 [14:33:04<51:01, 19.50s/it] {'loss': 0.0002, 'grad_norm': 0.2682400737714235, 'learning_rate': 6.279999999999999e-08, 'completion_length': 149.23214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0040740966796875, 'epoch': 0.94} + 94%|█████████▎| 2343/2500 [14:33:04<51:01, 19.50s/it] 94%|█████████▍| 2344/2500 [14:33:23<50:18, 19.35s/it] {'loss': 0.0002, 'grad_norm': 1.2880813253686307, 'learning_rate': 6.239999999999999e-08, 'completion_length': 163.00000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0050048828125, 'epoch': 0.94} + 94%|█████████▍| 2344/2500 [14:33:23<50:18, 19.35s/it] 94%|█████████▍| 2345/2500 [14:33:42<49:57, 19.34s/it] {'loss': 0.0002, 'grad_norm': 0.20237668033783482, 'learning_rate': 6.2e-08, 'completion_length': 162.80358123779297, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00452423095703125, 'epoch': 0.94} + 94%|█████████▍| 2345/2500 [14:33:42<49:57, 19.34s/it] 94%|█████████▍| 2346/2500 [14:34:01<49:09, 19.15s/it] {'loss': 0.0002, 'grad_norm': 0.2904136990908822, 'learning_rate': 6.16e-08, 'completion_length': 154.08929061889648, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0824786126613617, 'kl': 0.004425048828125, 'epoch': 0.94} + 94%|█████████▍| 2346/2500 [14:34:01<49:09, 19.15s/it] 94%|█████████▍| 2347/2500 [14:34:20<48:47, 19.13s/it] {'loss': 0.0002, 'grad_norm': 0.02016832850007984, 'learning_rate': 6.119999999999999e-08, 'completion_length': 159.0178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0061187744140625, 'epoch': 0.94} + 94%|█████████▍| 2347/2500 [14:34:20<48:47, 19.13s/it] 94%|█████████▍| 2348/2500 [14:34:38<47:19, 18.68s/it] {'loss': 0.0001, 'grad_norm': 0.29119773687104844, 'learning_rate': 6.08e-08, 'completion_length': 154.42858123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.003265380859375, 'epoch': 0.94} + 94%|█████████▍| 2348/2500 [14:34:38<47:19, 18.68s/it] 94%|█████████▍| 2349/2500 [14:34:58<48:35, 19.31s/it] {'loss': 0.0002, 'grad_norm': 0.20897199060742017, 'learning_rate': 6.04e-08, 'completion_length': 181.67858123779297, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.005828857421875, 'epoch': 0.94} + 94%|█████████▍| 2349/2500 [14:34:58<48:35, 19.31s/it] 94%|█████████▍| 2350/2500 [14:35:18<48:26, 19.38s/it] {'loss': 0.0003, 'grad_norm': 1.0994396067680507, 'learning_rate': 6e-08, 'completion_length': 178.7857208251953, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.11266787722706795, 'kl': 0.006439208984375, 'epoch': 0.94} + 94%|█████████▍| 2350/2500 [14:35:18<48:26, 19.38s/it] 94%|█████████▍| 2351/2500 [14:35:36<47:20, 19.06s/it] {'loss': 0.0002, 'grad_norm': 0.2138725488513131, 'learning_rate': 5.96e-08, 'completion_length': 159.62500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0049591064453125, 'epoch': 0.94} + 94%|█████████▍| 2351/2500 [14:35:36<47:20, 19.06s/it] 94%|█████████▍| 2352/2500 [14:35:56<47:39, 19.32s/it] {'loss': 0.0002, 'grad_norm': 0.015482518868957587, 'learning_rate': 5.92e-08, 'completion_length': 161.73214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00386810302734375, 'epoch': 0.94} + 94%|█████████▍| 2352/2500 [14:35:56<47:39, 19.32s/it] 94%|█████████▍| 2353/2500 [14:36:16<47:44, 19.49s/it] {'loss': 0.0002, 'grad_norm': 0.3066229173225495, 'learning_rate': 5.88e-08, 'completion_length': 159.73214721679688, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.0062103271484375, 'epoch': 0.94} + 94%|█████████▍| 2353/2500 [14:36:16<47:44, 19.49s/it] 94%|█████████▍| 2354/2500 [14:36:34<46:22, 19.06s/it] {'loss': 0.0001, 'grad_norm': 0.6570784709093551, 'learning_rate': 5.84e-08, 'completion_length': 157.12500762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0035552978515625, 'epoch': 0.94} + 94%|█████████▍| 2354/2500 [14:36:34<46:22, 19.06s/it] 94%|█████████▍| 2355/2500 [14:36:53<45:37, 18.88s/it] {'loss': 0.0002, 'grad_norm': 0.015200879366477585, 'learning_rate': 5.8e-08, 'completion_length': 148.25000762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004730224609375, 'epoch': 0.94} + 94%|█████████▍| 2355/2500 [14:36:53<45:37, 18.88s/it] 94%|█████████▍| 2356/2500 [14:37:11<44:39, 18.61s/it] {'loss': 0.0001, 'grad_norm': 3.683761279078533, 'learning_rate': 5.759999999999999e-08, 'completion_length': 150.4464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0034942626953125, 'epoch': 0.94} + 94%|█████████▍| 2356/2500 [14:37:11<44:39, 18.61s/it] 94%|█████████▍| 2357/2500 [14:37:29<43:57, 18.44s/it] {'loss': 0.0002, 'grad_norm': 0.40162846714140976, 'learning_rate': 5.7199999999999996e-08, 'completion_length': 159.46429443359375, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.004486083984375, 'epoch': 0.94} + 94%|█████████▍| 2357/2500 [14:37:29<43:57, 18.44s/it] 94%|█████████▍| 2358/2500 [14:37:46<42:44, 18.06s/it] {'loss': 0.0002, 'grad_norm': 0.19202998028194426, 'learning_rate': 5.68e-08, 'completion_length': 149.5357208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00438690185546875, 'epoch': 0.94} + 94%|█████████▍| 2358/2500 [14:37:46<42:44, 18.06s/it] 94%|█████████▍| 2359/2500 [14:38:04<42:25, 18.05s/it] {'loss': 0.0001, 'grad_norm': 0.02486126080411369, 'learning_rate': 5.6399999999999995e-08, 'completion_length': 151.71428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00354766845703125, 'epoch': 0.94} + 94%|█████████▍| 2359/2500 [14:38:04<42:25, 18.05s/it] 94%|█████████▍| 2360/2500 [14:38:22<42:18, 18.13s/it] {'loss': 0.0002, 'grad_norm': 0.05507821354183743, 'learning_rate': 5.6e-08, 'completion_length': 162.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004791259765625, 'epoch': 0.94} + 94%|█████████▍| 2360/2500 [14:38:22<42:18, 18.13s/it] 94%|█████████▍| 2361/2500 [14:38:40<41:53, 18.08s/it] {'loss': 0.0001, 'grad_norm': 0.5581867708576488, 'learning_rate': 5.5599999999999995e-08, 'completion_length': 145.2678680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.003509521484375, 'epoch': 0.94} + 94%|█████████▍| 2361/2500 [14:38:40<41:53, 18.08s/it] 94%|█████████▍| 2362/2500 [14:38:59<42:10, 18.34s/it] {'loss': 0.0002, 'grad_norm': 0.014830561626095057, 'learning_rate': 5.52e-08, 'completion_length': 169.8214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0047454833984375, 'epoch': 0.94} + 94%|█████████▍| 2362/2500 [14:38:59<42:10, 18.34s/it] 95%|█████████▍| 2363/2500 [14:39:18<42:16, 18.52s/it] {'loss': 0.0003, 'grad_norm': 0.1945624517720174, 'learning_rate': 5.48e-08, 'completion_length': 162.92857360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006500244140625, 'epoch': 0.95} + 95%|█████████▍| 2363/2500 [14:39:18<42:16, 18.52s/it] 95%|█████████▍| 2364/2500 [14:39:37<42:14, 18.64s/it] {'loss': 0.0002, 'grad_norm': 0.49208817029756574, 'learning_rate': 5.44e-08, 'completion_length': 181.16072845458984, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.005706787109375, 'epoch': 0.95} + 95%|█████████▍| 2364/2500 [14:39:37<42:14, 18.64s/it] 95%|█████████▍| 2365/2500 [14:39:56<42:18, 18.81s/it] {'loss': 0.0002, 'grad_norm': 0.04317966635583562, 'learning_rate': 5.3999999999999994e-08, 'completion_length': 161.58929443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005584716796875, 'epoch': 0.95} + 95%|█████████▍| 2365/2500 [14:39:56<42:18, 18.81s/it] 95%|█████████▍| 2366/2500 [14:40:14<41:40, 18.66s/it] {'loss': 0.0002, 'grad_norm': 1.585354279592838, 'learning_rate': 5.36e-08, 'completion_length': 146.4464340209961, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.004638671875, 'epoch': 0.95} + 95%|█████████▍| 2366/2500 [14:40:14<41:40, 18.66s/it] 95%|█████████▍| 2367/2500 [14:40:32<40:47, 18.40s/it] {'loss': 0.0002, 'grad_norm': 0.34698430055951707, 'learning_rate': 5.319999999999999e-08, 'completion_length': 144.46429443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00414276123046875, 'epoch': 0.95} + 95%|█████████▍| 2367/2500 [14:40:32<40:47, 18.40s/it] 95%|█████████▍| 2368/2500 [14:40:51<40:28, 18.40s/it] {'loss': 0.0001, 'grad_norm': 0.017349136837248976, 'learning_rate': 5.2799999999999996e-08, 'completion_length': 156.17858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00295257568359375, 'epoch': 0.95} + 95%|█████████▍| 2368/2500 [14:40:51<40:28, 18.40s/it] 95%|█████████▍| 2369/2500 [14:41:10<40:44, 18.66s/it] {'loss': 0.0002, 'grad_norm': 0.03267217077572451, 'learning_rate': 5.24e-08, 'completion_length': 154.64286041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.005279541015625, 'epoch': 0.95} + 95%|█████████▍| 2369/2500 [14:41:10<40:44, 18.66s/it] 95%|█████████▍| 2370/2500 [14:41:28<39:57, 18.44s/it] {'loss': 0.0002, 'grad_norm': 1.0235113756818588, 'learning_rate': 5.1999999999999996e-08, 'completion_length': 165.2678680419922, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.0046844482421875, 'epoch': 0.95} + 95%|█████████▍| 2370/2500 [14:41:28<39:57, 18.44s/it] 95%|█████████▍| 2371/2500 [14:41:47<39:56, 18.58s/it] {'loss': 0.0002, 'grad_norm': 0.27171958380601363, 'learning_rate': 5.16e-08, 'completion_length': 157.12500762939453, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.0059661865234375, 'epoch': 0.95} + 95%|█████████▍| 2371/2500 [14:41:47<39:56, 18.58s/it] 95%|█████████▍| 2372/2500 [14:42:05<39:31, 18.53s/it] {'loss': 0.0002, 'grad_norm': 0.026529654687853517, 'learning_rate': 5.12e-08, 'completion_length': 145.60714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00482177734375, 'epoch': 0.95} + 95%|█████████▍| 2372/2500 [14:42:05<39:31, 18.53s/it] 95%|█████████▍| 2373/2500 [14:42:25<39:47, 18.80s/it] {'loss': 0.0001, 'grad_norm': 0.18085942536325456, 'learning_rate': 5.08e-08, 'completion_length': 159.46429443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00298309326171875, 'epoch': 0.95} + 95%|█████████▍| 2373/2500 [14:42:25<39:47, 18.80s/it] 95%|█████████▍| 2374/2500 [14:42:43<39:01, 18.58s/it] {'loss': 0.0002, 'grad_norm': 0.2816823866967879, 'learning_rate': 5.04e-08, 'completion_length': 161.58929443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00452423095703125, 'epoch': 0.95} + 95%|█████████▍| 2374/2500 [14:42:43<39:01, 18.58s/it] 95%|█████████▌| 2375/2500 [14:43:00<38:16, 18.37s/it] {'loss': 0.0002, 'grad_norm': 0.016216825029912636, 'learning_rate': 5e-08, 'completion_length': 162.51786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0042877197265625, 'epoch': 0.95} + 95%|█████████▌| 2375/2500 [14:43:00<38:16, 18.37s/it] 95%|█████████▌| 2376/2500 [14:43:18<37:25, 18.11s/it] {'loss': 0.0002, 'grad_norm': 0.3798530903688949, 'learning_rate': 4.9599999999999994e-08, 'completion_length': 161.50000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0042877197265625, 'epoch': 0.95} + 95%|█████████▌| 2376/2500 [14:43:18<37:25, 18.11s/it] 95%|█████████▌| 2377/2500 [14:43:37<37:36, 18.34s/it] {'loss': 0.0002, 'grad_norm': 0.0912038085908239, 'learning_rate': 4.92e-08, 'completion_length': 161.9464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0051727294921875, 'epoch': 0.95} + 95%|█████████▌| 2377/2500 [14:43:37<37:36, 18.34s/it] 95%|█████████▌| 2378/2500 [14:43:56<37:37, 18.51s/it] {'loss': 0.0002, 'grad_norm': 0.29616506200152687, 'learning_rate': 4.88e-08, 'completion_length': 157.7857208251953, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0041656494140625, 'epoch': 0.95} + 95%|█████████▌| 2378/2500 [14:43:56<37:37, 18.51s/it] 95%|█████████▌| 2379/2500 [14:44:14<37:18, 18.50s/it] {'loss': 0.0002, 'grad_norm': 0.020216109338567403, 'learning_rate': 4.8399999999999997e-08, 'completion_length': 159.08929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00557708740234375, 'epoch': 0.95} + 95%|█████████▌| 2379/2500 [14:44:14<37:18, 18.50s/it] 95%|█████████▌| 2380/2500 [14:44:33<37:01, 18.51s/it] {'loss': 0.0002, 'grad_norm': 0.21081541276438617, 'learning_rate': 4.8e-08, 'completion_length': 161.08929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0044403076171875, 'epoch': 0.95} + 95%|█████████▌| 2380/2500 [14:44:33<37:01, 18.51s/it] 95%|█████████▌| 2381/2500 [14:44:51<36:48, 18.56s/it] {'loss': 0.0001, 'grad_norm': 0.31438363472714065, 'learning_rate': 4.76e-08, 'completion_length': 150.9107208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0034332275390625, 'epoch': 0.95} + 95%|█████████▌| 2381/2500 [14:44:51<36:48, 18.56s/it] 95%|█████████▌| 2382/2500 [14:45:10<36:28, 18.55s/it] {'loss': 0.0001, 'grad_norm': 0.017808983130061053, 'learning_rate': 4.72e-08, 'completion_length': 155.4107208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0029754638671875, 'epoch': 0.95} + 95%|█████████▌| 2382/2500 [14:45:10<36:28, 18.55s/it] 95%|█████████▌| 2383/2500 [14:45:30<36:45, 18.85s/it] {'loss': 0.0003, 'grad_norm': 0.5341277026276497, 'learning_rate': 4.68e-08, 'completion_length': 176.71429443359375, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.11266787722706795, 'kl': 0.007080078125, 'epoch': 0.95} + 95%|█████████▌| 2383/2500 [14:45:30<36:45, 18.85s/it] 95%|█████████▌| 2384/2500 [14:45:48<36:10, 18.71s/it] {'loss': 0.0002, 'grad_norm': 0.01509587411943402, 'learning_rate': 4.639999999999999e-08, 'completion_length': 155.17857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00421142578125, 'epoch': 0.95} + 95%|█████████▌| 2384/2500 [14:45:48<36:10, 18.71s/it] 95%|█████████▌| 2385/2500 [14:46:08<36:40, 19.14s/it] {'loss': 0.0003, 'grad_norm': 1.1866623591546128, 'learning_rate': 4.5999999999999995e-08, 'completion_length': 162.83929443359375, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07695359364151955, 'kl': 0.0062408447265625, 'epoch': 0.95} + 95%|█████████▌| 2385/2500 [14:46:08<36:40, 19.14s/it] 95%|█████████▌| 2386/2500 [14:46:27<36:12, 19.06s/it] {'loss': 0.0002, 'grad_norm': 0.047044462254084804, 'learning_rate': 4.56e-08, 'completion_length': 152.89286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0050506591796875, 'epoch': 0.95} + 95%|█████████▌| 2386/2500 [14:46:27<36:12, 19.06s/it] 95%|█████████▌| 2387/2500 [14:46:45<35:07, 18.65s/it] {'loss': 0.0001, 'grad_norm': 0.013987507547923796, 'learning_rate': 4.5199999999999994e-08, 'completion_length': 143.375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00312042236328125, 'epoch': 0.95} + 95%|█████████▌| 2387/2500 [14:46:45<35:07, 18.65s/it] 96%|█████████▌| 2388/2500 [14:47:04<35:14, 18.88s/it] {'loss': 0.0003, 'grad_norm': 0.1977310952995122, 'learning_rate': 4.48e-08, 'completion_length': 183.5714340209961, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0066070556640625, 'epoch': 0.96} + 96%|█████████▌| 2388/2500 [14:47:04<35:14, 18.88s/it] 96%|█████████▌| 2389/2500 [14:47:23<35:02, 18.95s/it] {'loss': 0.0002, 'grad_norm': 0.08105788642303294, 'learning_rate': 4.44e-08, 'completion_length': 162.9821548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0041656494140625, 'epoch': 0.96} + 96%|█████████▌| 2389/2500 [14:47:23<35:02, 18.95s/it] 96%|█████████▌| 2390/2500 [14:47:41<34:00, 18.55s/it] {'loss': 0.0002, 'grad_norm': 0.06768253802332969, 'learning_rate': 4.4e-08, 'completion_length': 143.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00390625, 'epoch': 0.96} + 96%|█████████▌| 2390/2500 [14:47:41<34:00, 18.55s/it] 96%|█████████▌| 2391/2500 [14:47:59<33:32, 18.47s/it] {'loss': 0.0001, 'grad_norm': 0.06882032937395655, 'learning_rate': 4.36e-08, 'completion_length': 148.30357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00287628173828125, 'epoch': 0.96} + 96%|█████████▌| 2391/2500 [14:47:59<33:32, 18.47s/it] 96%|█████████▌| 2392/2500 [14:48:17<33:12, 18.45s/it] {'loss': 0.0002, 'grad_norm': 0.2889751307433468, 'learning_rate': 4.32e-08, 'completion_length': 155.0357208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00402069091796875, 'epoch': 0.96} + 96%|█████████▌| 2392/2500 [14:48:17<33:12, 18.45s/it] 96%|█████████▌| 2393/2500 [14:48:36<33:04, 18.55s/it] {'loss': 0.0002, 'grad_norm': 0.018334891740797818, 'learning_rate': 4.279999999999999e-08, 'completion_length': 160.67858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00406646728515625, 'epoch': 0.96} + 96%|█████████▌| 2393/2500 [14:48:36<33:04, 18.55s/it] 96%|█████████▌| 2394/2500 [14:48:54<32:15, 18.26s/it] {'loss': 0.0002, 'grad_norm': 0.018662015338432522, 'learning_rate': 4.2399999999999996e-08, 'completion_length': 149.48214721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00391387939453125, 'epoch': 0.96} + 96%|█████████▌| 2394/2500 [14:48:54<32:15, 18.26s/it] 96%|█████████▌| 2395/2500 [14:49:13<32:26, 18.54s/it] {'loss': 0.0002, 'grad_norm': 0.20248803881169328, 'learning_rate': 4.2e-08, 'completion_length': 166.57144165039062, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0045623779296875, 'epoch': 0.96} + 96%|█████████▌| 2395/2500 [14:49:13<32:26, 18.54s/it] 96%|█████████▌| 2396/2500 [14:49:32<32:28, 18.74s/it] {'loss': 0.0002, 'grad_norm': 0.36539866541648863, 'learning_rate': 4.1599999999999995e-08, 'completion_length': 153.8928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00421142578125, 'epoch': 0.96} + 96%|█████████▌| 2396/2500 [14:49:32<32:28, 18.74s/it] 96%|█████████▌| 2397/2500 [14:49:51<32:17, 18.81s/it] {'loss': 0.0002, 'grad_norm': 0.25853729353012694, 'learning_rate': 4.12e-08, 'completion_length': 161.76786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004241943359375, 'epoch': 0.96} + 96%|█████████▌| 2397/2500 [14:49:51<32:17, 18.81s/it] 96%|█████████▌| 2398/2500 [14:50:10<31:51, 18.74s/it] {'loss': 0.0002, 'grad_norm': 0.05299094670415581, 'learning_rate': 4.08e-08, 'completion_length': 156.64286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0043182373046875, 'epoch': 0.96} + 96%|█████████▌| 2398/2500 [14:50:10<31:51, 18.74s/it] 96%|█████████▌| 2399/2500 [14:50:29<31:44, 18.86s/it] {'loss': 0.0002, 'grad_norm': 0.024779619258704165, 'learning_rate': 4.04e-08, 'completion_length': 166.89286041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0052947998046875, 'epoch': 0.96} + 96%|█████████▌| 2399/2500 [14:50:29<31:44, 18.86s/it] 96%|█████████▌| 2400/2500 [14:50:48<31:19, 18.79s/it] {'loss': 0.0002, 'grad_norm': 0.24197160678988686, 'learning_rate': 4e-08, 'completion_length': 147.1607208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.0048370361328125, 'epoch': 0.96} + 96%|█████████▌| 2400/2500 [14:50:48<31:19, 18.79s/it] 96%|█████████▌| 2401/2500 [14:54:09<2:01:25, 73.59s/it] {'loss': 0.0002, 'grad_norm': 0.04050872771335347, 'learning_rate': 3.9600000000000004e-08, 'completion_length': 164.2857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0053863525390625, 'epoch': 0.96} + 96%|█████████▌| 2401/2500 [14:54:09<2:01:25, 73.59s/it] 96%|█████████▌| 2402/2500 [14:54:28<1:33:13, 57.07s/it] {'loss': 0.0001, 'grad_norm': 0.18970042095779263, 'learning_rate': 3.9199999999999994e-08, 'completion_length': 145.87500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.003570556640625, 'epoch': 0.96} + 96%|█████████▌| 2402/2500 [14:54:28<1:33:13, 57.07s/it] 96%|█████████▌| 2403/2500 [14:54:46<1:13:42, 45.60s/it] {'loss': 0.0001, 'grad_norm': 0.3073606664966341, 'learning_rate': 3.88e-08, 'completion_length': 164.58928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.003265380859375, 'epoch': 0.96} + 96%|█████████▌| 2403/2500 [14:54:46<1:13:42, 45.60s/it] 96%|█████████▌| 2404/2500 [14:55:04<59:31, 37.20s/it] {'loss': 0.0001, 'grad_norm': 0.026507901117139607, 'learning_rate': 3.839999999999999e-08, 'completion_length': 152.6607208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00360870361328125, 'epoch': 0.96} + 96%|█████████▌| 2404/2500 [14:55:04<59:31, 37.20s/it] 96%|█████████▌| 2405/2500 [14:55:23<50:18, 31.78s/it] {'loss': 0.0002, 'grad_norm': 0.23767715496734232, 'learning_rate': 3.7999999999999996e-08, 'completion_length': 168.3214340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.00524139404296875, 'epoch': 0.96} + 96%|█████████▌| 2405/2500 [14:55:23<50:18, 31.78s/it] 96%|█████████▌| 2406/2500 [14:55:41<43:17, 27.64s/it] {'loss': 0.0001, 'grad_norm': 0.01657599096973292, 'learning_rate': 3.76e-08, 'completion_length': 152.9821548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00315093994140625, 'epoch': 0.96} + 96%|█████████▌| 2406/2500 [14:55:41<43:17, 27.64s/it] 96%|█████████▋| 2407/2500 [14:56:00<38:58, 25.15s/it] {'loss': 0.0002, 'grad_norm': 0.020125126593790024, 'learning_rate': 3.7199999999999996e-08, 'completion_length': 171.75001525878906, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0051116943359375, 'epoch': 0.96} + 96%|█████████▋| 2407/2500 [14:56:00<38:58, 25.15s/it] 96%|█████████▋| 2408/2500 [14:56:18<34:57, 22.80s/it] {'loss': 0.0001, 'grad_norm': 0.023254754814814017, 'learning_rate': 3.68e-08, 'completion_length': 140.4464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00315093994140625, 'epoch': 0.96} + 96%|█████████▋| 2408/2500 [14:56:18<34:57, 22.80s/it] 96%|█████████▋| 2409/2500 [14:56:36<32:42, 21.57s/it] {'loss': 0.0002, 'grad_norm': 0.34274066114207646, 'learning_rate': 3.64e-08, 'completion_length': 150.51786041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00443267822265625, 'epoch': 0.96} + 96%|█████████▋| 2409/2500 [14:56:36<32:42, 21.57s/it] 96%|█████████▋| 2410/2500 [14:56:55<30:49, 20.55s/it] {'loss': 0.0002, 'grad_norm': 0.823134215288847, 'learning_rate': 3.6e-08, 'completion_length': 151.80358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0037841796875, 'epoch': 0.96} + 96%|█████████▋| 2410/2500 [14:56:55<30:49, 20.55s/it] 96%|█████████▋| 2411/2500 [14:57:13<29:25, 19.84s/it] {'loss': 0.0002, 'grad_norm': 0.033193656501981195, 'learning_rate': 3.56e-08, 'completion_length': 153.71429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00527191162109375, 'epoch': 0.96} + 96%|█████████▋| 2411/2500 [14:57:13<29:25, 19.84s/it] 96%|█████████▋| 2412/2500 [14:57:32<28:39, 19.54s/it] {'loss': 0.0003, 'grad_norm': 0.4492762253486053, 'learning_rate': 3.52e-08, 'completion_length': 159.3571548461914, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00762939453125, 'epoch': 0.96} + 96%|█████████▋| 2412/2500 [14:57:32<28:39, 19.54s/it] 97%|█████████▋| 2413/2500 [14:57:50<27:46, 19.16s/it] {'loss': 0.0002, 'grad_norm': 0.020706138044723543, 'learning_rate': 3.4799999999999994e-08, 'completion_length': 153.85714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0060882568359375, 'epoch': 0.97} + 97%|█████████▋| 2413/2500 [14:57:50<27:46, 19.16s/it] 97%|█████████▋| 2414/2500 [14:58:10<27:40, 19.31s/it] {'loss': 0.0003, 'grad_norm': 0.514928873438858, 'learning_rate': 3.44e-08, 'completion_length': 154.1964340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.00679779052734375, 'epoch': 0.97} + 97%|█████████▋| 2414/2500 [14:58:10<27:40, 19.31s/it] 97%|█████████▋| 2415/2500 [14:58:28<27:11, 19.20s/it] {'loss': 0.0001, 'grad_norm': 0.043261830098739414, 'learning_rate': 3.4e-08, 'completion_length': 158.8214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00341033935546875, 'epoch': 0.97} + 97%|█████████▋| 2415/2500 [14:58:28<27:11, 19.20s/it] 97%|█████████▋| 2416/2500 [14:58:47<26:27, 18.90s/it] {'loss': 0.0002, 'grad_norm': 0.18169460575283547, 'learning_rate': 3.3599999999999996e-08, 'completion_length': 148.58929443359375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00464630126953125, 'epoch': 0.97} + 97%|█████████▋| 2416/2500 [14:58:47<26:27, 18.90s/it] 97%|█████████▋| 2417/2500 [14:59:05<25:53, 18.71s/it] {'loss': 0.0002, 'grad_norm': 0.4011317175523694, 'learning_rate': 3.32e-08, 'completion_length': 153.2857208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00424957275390625, 'epoch': 0.97} + 97%|█████████▋| 2417/2500 [14:59:05<25:53, 18.71s/it] 97%|█████████▋| 2418/2500 [14:59:23<25:16, 18.49s/it] {'loss': 0.0001, 'grad_norm': 0.09189058715802924, 'learning_rate': 3.28e-08, 'completion_length': 162.64286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00337982177734375, 'epoch': 0.97} + 97%|█████████▋| 2418/2500 [14:59:23<25:16, 18.49s/it] 97%|█████████▋| 2419/2500 [14:59:41<24:42, 18.30s/it] {'loss': 0.0001, 'grad_norm': 0.030478337569908226, 'learning_rate': 3.24e-08, 'completion_length': 150.30358123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0025634765625, 'epoch': 0.97} + 97%|█████████▋| 2419/2500 [14:59:41<24:42, 18.30s/it] 97%|█████████▋| 2420/2500 [15:00:00<24:43, 18.55s/it] {'loss': 0.0002, 'grad_norm': 0.5237450683944264, 'learning_rate': 3.2e-08, 'completion_length': 161.7321548461914, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00417327880859375, 'epoch': 0.97} + 97%|█████████▋| 2420/2500 [15:00:00<24:43, 18.55s/it] 97%|█████████▋| 2421/2500 [15:00:18<24:06, 18.31s/it] {'loss': 0.0001, 'grad_norm': 0.7826347255400304, 'learning_rate': 3.16e-08, 'completion_length': 140.50000762939453, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.002593994140625, 'epoch': 0.97} + 97%|█████████▋| 2421/2500 [15:00:18<24:06, 18.31s/it] 97%|█████████▋| 2422/2500 [15:00:36<23:50, 18.33s/it] {'loss': 0.0001, 'grad_norm': 0.3012395961965348, 'learning_rate': 3.1199999999999995e-08, 'completion_length': 148.33928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.003204345703125, 'epoch': 0.97} + 97%|█████████▋| 2422/2500 [15:00:36<23:50, 18.33s/it] 97%|█████████▋| 2423/2500 [15:00:54<23:32, 18.34s/it] {'loss': 0.0002, 'grad_norm': 0.34770900150639933, 'learning_rate': 3.08e-08, 'completion_length': 155.0357208251953, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.00551605224609375, 'epoch': 0.97} + 97%|█████████▋| 2423/2500 [15:00:54<23:32, 18.34s/it] 97%|█████████▋| 2424/2500 [15:01:15<23:59, 18.94s/it] {'loss': 0.0002, 'grad_norm': 0.08317647434759173, 'learning_rate': 3.04e-08, 'completion_length': 181.67858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0053253173828125, 'epoch': 0.97} + 97%|█████████▋| 2424/2500 [15:01:15<23:59, 18.94s/it] 97%|█████████▋| 2425/2500 [15:01:33<23:22, 18.70s/it] {'loss': 0.0002, 'grad_norm': 0.2680722113282763, 'learning_rate': 3e-08, 'completion_length': 166.1964340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00481414794921875, 'epoch': 0.97} + 97%|█████████▋| 2425/2500 [15:01:33<23:22, 18.70s/it] 97%|█████████▋| 2426/2500 [15:01:51<22:57, 18.62s/it] {'loss': 0.0001, 'grad_norm': 0.28989333582285987, 'learning_rate': 2.96e-08, 'completion_length': 156.58929443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0032806396484375, 'epoch': 0.97} + 97%|█████████▋| 2426/2500 [15:01:51<22:57, 18.62s/it] 97%|█████████▋| 2427/2500 [15:02:11<22:56, 18.86s/it] {'loss': 0.0003, 'grad_norm': 0.21984089569399848, 'learning_rate': 2.92e-08, 'completion_length': 179.9821548461914, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00653076171875, 'epoch': 0.97} + 97%|█████████▋| 2427/2500 [15:02:11<22:56, 18.86s/it] 97%|█████████▋| 2428/2500 [15:02:29<22:21, 18.64s/it] {'loss': 0.0002, 'grad_norm': 0.597245462510145, 'learning_rate': 2.8799999999999996e-08, 'completion_length': 154.9107208251953, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.11266787722706795, 'kl': 0.0054473876953125, 'epoch': 0.97} + 97%|█████████▋| 2428/2500 [15:02:29<22:21, 18.64s/it] 97%|█████████▋| 2429/2500 [15:02:47<22:00, 18.60s/it] {'loss': 0.0003, 'grad_norm': 0.44502564100307246, 'learning_rate': 2.84e-08, 'completion_length': 162.33928680419922, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.00766754150390625, 'epoch': 0.97} + 97%|█████████▋| 2429/2500 [15:02:47<22:00, 18.60s/it] 97%|█████████▋| 2430/2500 [15:03:07<21:59, 18.85s/it] {'loss': 0.0002, 'grad_norm': 0.22768322976278718, 'learning_rate': 2.8e-08, 'completion_length': 157.96429443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0048370361328125, 'epoch': 0.97} + 97%|█████████▋| 2430/2500 [15:03:07<21:59, 18.85s/it] 97%|█████████▋| 2431/2500 [15:03:25<21:26, 18.65s/it] {'loss': 0.0002, 'grad_norm': 1.8894978783332315, 'learning_rate': 2.76e-08, 'completion_length': 143.00000762939453, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.0062255859375, 'epoch': 0.97} + 97%|█████████▋| 2431/2500 [15:03:25<21:26, 18.65s/it] 97%|█████████▋| 2432/2500 [15:03:44<21:24, 18.90s/it] {'loss': 0.0002, 'grad_norm': 0.275432608797155, 'learning_rate': 2.72e-08, 'completion_length': 157.50000762939453, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07695359364151955, 'kl': 0.0055084228515625, 'epoch': 0.97} + 97%|█████████▋| 2432/2500 [15:03:44<21:24, 18.90s/it] 97%|█████████▋| 2433/2500 [15:04:03<20:50, 18.67s/it] {'loss': 0.0001, 'grad_norm': 0.01574472721453072, 'learning_rate': 2.68e-08, 'completion_length': 152.80358123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00321197509765625, 'epoch': 0.97} + 97%|█████████▋| 2433/2500 [15:04:03<20:50, 18.67s/it] 97%|█████████▋| 2434/2500 [15:04:22<20:48, 18.91s/it] {'loss': 0.0003, 'grad_norm': 0.24830771403913582, 'learning_rate': 2.6399999999999998e-08, 'completion_length': 179.7678680419922, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0073699951171875, 'epoch': 0.97} + 97%|█████████▋| 2434/2500 [15:04:22<20:48, 18.91s/it] 97%|█████████▋| 2435/2500 [15:04:41<20:22, 18.80s/it] {'loss': 0.0002, 'grad_norm': 0.01633550429010978, 'learning_rate': 2.5999999999999998e-08, 'completion_length': 164.60714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0042266845703125, 'epoch': 0.97} + 97%|█████████▋| 2435/2500 [15:04:41<20:22, 18.80s/it] 97%|█████████▋| 2436/2500 [15:04:59<19:59, 18.75s/it] {'loss': 0.0002, 'grad_norm': 0.021385686986709375, 'learning_rate': 2.56e-08, 'completion_length': 154.50000762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.004974365234375, 'epoch': 0.97} + 97%|█████████▋| 2436/2500 [15:04:59<19:59, 18.75s/it] 97%|█████████▋| 2437/2500 [15:05:18<19:49, 18.87s/it] {'loss': 0.0002, 'grad_norm': 0.5455377683696755, 'learning_rate': 2.52e-08, 'completion_length': 167.3571548461914, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.04123930633068085, 'kl': 0.004913330078125, 'epoch': 0.97} + 97%|█████████▋| 2437/2500 [15:05:18<19:49, 18.87s/it] 98%|█████████▊| 2438/2500 [15:05:36<19:15, 18.64s/it] {'loss': 0.0002, 'grad_norm': 0.2824815938602609, 'learning_rate': 2.4799999999999997e-08, 'completion_length': 162.3928680419922, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.00521087646484375, 'epoch': 0.98} + 98%|█████████▊| 2438/2500 [15:05:36<19:15, 18.64s/it] 98%|█████████▊| 2439/2500 [15:05:55<18:53, 18.58s/it] {'loss': 0.0002, 'grad_norm': 0.025355171620905288, 'learning_rate': 2.44e-08, 'completion_length': 154.71428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0038909912109375, 'epoch': 0.98} + 98%|█████████▊| 2439/2500 [15:05:55<18:53, 18.58s/it] 98%|█████████▊| 2440/2500 [15:06:14<18:40, 18.67s/it] {'loss': 0.0002, 'grad_norm': 0.018555349371050606, 'learning_rate': 2.4e-08, 'completion_length': 154.62500762939453, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00446319580078125, 'epoch': 0.98} + 98%|█████████▊| 2440/2500 [15:06:14<18:40, 18.67s/it] 98%|█████████▊| 2441/2500 [15:06:33<18:25, 18.74s/it] {'loss': 0.0003, 'grad_norm': 2.193960369100718, 'learning_rate': 2.36e-08, 'completion_length': 171.50000762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.00701904296875, 'epoch': 0.98} + 98%|█████████▊| 2441/2500 [15:06:33<18:25, 18.74s/it] 98%|█████████▊| 2442/2500 [15:06:51<18:01, 18.64s/it] {'loss': 0.0002, 'grad_norm': 0.03008075060807917, 'learning_rate': 2.3199999999999996e-08, 'completion_length': 162.0714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.005462646484375, 'epoch': 0.98} + 98%|█████████▊| 2442/2500 [15:06:51<18:01, 18.64s/it] 98%|█████████▊| 2443/2500 [15:07:10<17:53, 18.83s/it] {'loss': 0.0002, 'grad_norm': 0.38967431842985656, 'learning_rate': 2.28e-08, 'completion_length': 152.6607208251953, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00616455078125, 'epoch': 0.98} + 98%|█████████▊| 2443/2500 [15:07:10<17:53, 18.83s/it] 98%|█████████▊| 2444/2500 [15:07:29<17:34, 18.84s/it] {'loss': 0.0003, 'grad_norm': 0.41780261914036815, 'learning_rate': 2.24e-08, 'completion_length': 157.9821548461914, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0073699951171875, 'epoch': 0.98} + 98%|█████████▊| 2444/2500 [15:07:29<17:34, 18.84s/it] 98%|█████████▊| 2445/2500 [15:07:47<17:02, 18.60s/it] {'loss': 0.0002, 'grad_norm': 0.014642711452423494, 'learning_rate': 2.2e-08, 'completion_length': 159.51786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00445556640625, 'epoch': 0.98} + 98%|█████████▊| 2445/2500 [15:07:47<17:02, 18.60s/it] 98%|█████████▊| 2446/2500 [15:08:06<16:43, 18.58s/it] {'loss': 0.0001, 'grad_norm': 0.2093109078698039, 'learning_rate': 2.16e-08, 'completion_length': 147.3571548461914, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0032501220703125, 'epoch': 0.98} + 98%|█████████▊| 2446/2500 [15:08:06<16:43, 18.58s/it] 98%|█████████▊| 2447/2500 [15:08:24<16:17, 18.45s/it] {'loss': 0.0001, 'grad_norm': 0.019647196068557655, 'learning_rate': 2.1199999999999998e-08, 'completion_length': 136.01786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00315093994140625, 'epoch': 0.98} + 98%|█████████▊| 2447/2500 [15:08:24<16:17, 18.45s/it] 98%|█████████▊| 2448/2500 [15:08:43<16:14, 18.75s/it] {'loss': 0.0003, 'grad_norm': 0.33642753401391684, 'learning_rate': 2.0799999999999998e-08, 'completion_length': 169.25000762939453, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.0081787109375, 'epoch': 0.98} + 98%|█████████▊| 2448/2500 [15:08:43<16:14, 18.75s/it] 98%|█████████▊| 2449/2500 [15:09:02<15:58, 18.79s/it] {'loss': 0.0003, 'grad_norm': 0.021692586595027815, 'learning_rate': 2.04e-08, 'completion_length': 175.21429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0070953369140625, 'epoch': 0.98} + 98%|█████████▊| 2449/2500 [15:09:02<15:58, 18.79s/it] 98%|█████████▊| 2450/2500 [15:09:21<15:43, 18.87s/it] {'loss': 0.0002, 'grad_norm': 0.3778997620874851, 'learning_rate': 2e-08, 'completion_length': 166.33929443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0059661865234375, 'epoch': 0.98} + 98%|█████████▊| 2450/2500 [15:09:21<15:43, 18.87s/it] 98%|█████████▊| 2451/2500 [15:09:39<15:01, 18.40s/it] {'loss': 0.0002, 'grad_norm': 0.013578576314421154, 'learning_rate': 1.9599999999999997e-08, 'completion_length': 150.23214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00408172607421875, 'epoch': 0.98} + 98%|█████████▊| 2451/2500 [15:09:39<15:01, 18.40s/it] 98%|█████████▊| 2452/2500 [15:09:58<14:51, 18.58s/it] {'loss': 0.0003, 'grad_norm': 0.6395722759018105, 'learning_rate': 1.9199999999999997e-08, 'completion_length': 166.51786041259766, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.006561279296875, 'epoch': 0.98} + 98%|█████████▊| 2452/2500 [15:09:58<14:51, 18.58s/it] 98%|█████████▊| 2453/2500 [15:10:16<14:27, 18.45s/it] {'loss': 0.0002, 'grad_norm': 2.755833485788607, 'learning_rate': 1.88e-08, 'completion_length': 137.96428680419922, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.004852294921875, 'epoch': 0.98} + 98%|█████████▊| 2453/2500 [15:10:16<14:27, 18.45s/it] 98%|█████████▊| 2454/2500 [15:10:35<14:15, 18.60s/it] {'loss': 0.0002, 'grad_norm': 0.0380315315920441, 'learning_rate': 1.84e-08, 'completion_length': 153.67857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0046844482421875, 'epoch': 0.98} + 98%|█████████▊| 2454/2500 [15:10:35<14:15, 18.60s/it] 98%|█████████▊| 2455/2500 [15:10:55<14:21, 19.14s/it] {'loss': 0.0002, 'grad_norm': 0.3593251734399432, 'learning_rate': 1.8e-08, 'completion_length': 162.07144165039062, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0042572021484375, 'epoch': 0.98} + 98%|█████████▊| 2455/2500 [15:10:55<14:21, 19.14s/it] 98%|█████████▊| 2456/2500 [15:11:14<13:52, 18.93s/it] {'loss': 0.0001, 'grad_norm': 0.0224448052978729, 'learning_rate': 1.76e-08, 'completion_length': 148.14286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.003276824951171875, 'epoch': 0.98} + 98%|█████████▊| 2456/2500 [15:11:14<13:52, 18.93s/it] 98%|█████████▊| 2457/2500 [15:11:32<13:28, 18.80s/it] {'loss': 0.0002, 'grad_norm': 0.020418050163016666, 'learning_rate': 1.72e-08, 'completion_length': 172.98214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00469970703125, 'epoch': 0.98} + 98%|█████████▊| 2457/2500 [15:11:32<13:28, 18.80s/it] 98%|█████████▊| 2458/2500 [15:11:51<13:15, 18.95s/it] {'loss': 0.0002, 'grad_norm': 0.3551181388837755, 'learning_rate': 1.6799999999999998e-08, 'completion_length': 151.46429443359375, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00531005859375, 'epoch': 0.98} + 98%|█████████▊| 2458/2500 [15:11:51<13:15, 18.95s/it] 98%|█████████▊| 2459/2500 [15:12:12<13:17, 19.45s/it] {'loss': 0.0003, 'grad_norm': 0.2306836628139854, 'learning_rate': 1.64e-08, 'completion_length': 174.76786041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.006256103515625, 'epoch': 0.98} + 98%|█████████▊| 2459/2500 [15:12:12<13:17, 19.45s/it] 98%|█████████▊| 2460/2500 [15:12:33<13:10, 19.77s/it] {'loss': 0.0002, 'grad_norm': 0.05289798068950066, 'learning_rate': 1.6e-08, 'completion_length': 169.05358123779297, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.00543975830078125, 'epoch': 0.98} + 98%|█████████▊| 2460/2500 [15:12:33<13:10, 19.77s/it] 98%|█████████▊| 2461/2500 [15:12:51<12:38, 19.46s/it] {'loss': 0.0002, 'grad_norm': 0.0763926189087686, 'learning_rate': 1.5599999999999997e-08, 'completion_length': 144.3035774230957, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00396728515625, 'epoch': 0.98} + 98%|█████████▊| 2461/2500 [15:12:51<12:38, 19.46s/it] 98%|█████████▊| 2462/2500 [15:13:09<11:57, 18.89s/it] {'loss': 0.0001, 'grad_norm': 0.24342353204455333, 'learning_rate': 1.52e-08, 'completion_length': 152.3928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00351715087890625, 'epoch': 0.98} + 98%|█████████▊| 2462/2500 [15:13:09<11:57, 18.89s/it] 99%|█████████▊| 2463/2500 [15:13:28<11:38, 18.87s/it] {'loss': 0.0001, 'grad_norm': 0.01716351308476866, 'learning_rate': 1.48e-08, 'completion_length': 158.6964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00357818603515625, 'epoch': 0.99} + 99%|█████████▊| 2463/2500 [15:13:28<11:38, 18.87s/it] 99%|█████████▊| 2464/2500 [15:13:46<11:11, 18.65s/it] {'loss': 0.0001, 'grad_norm': 0.01980165338193501, 'learning_rate': 1.4399999999999998e-08, 'completion_length': 157.92858123779297, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00281524658203125, 'epoch': 0.99} + 99%|█████████▊| 2464/2500 [15:13:46<11:11, 18.65s/it] 99%|█████████▊| 2465/2500 [15:14:05<10:56, 18.76s/it] {'loss': 0.0002, 'grad_norm': 0.2734295934271051, 'learning_rate': 1.4e-08, 'completion_length': 161.25000762939453, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0041351318359375, 'epoch': 0.99} + 99%|█████████▊| 2465/2500 [15:14:05<10:56, 18.76s/it] 99%|█████████▊| 2466/2500 [15:14:25<10:50, 19.13s/it] {'loss': 0.0002, 'grad_norm': 0.4810696201005601, 'learning_rate': 1.36e-08, 'completion_length': 163.60714721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.00475311279296875, 'epoch': 0.99} + 99%|█████████▊| 2466/2500 [15:14:25<10:50, 19.13s/it] 99%|█████████▊| 2467/2500 [15:14:45<10:44, 19.54s/it] {'loss': 0.0003, 'grad_norm': 0.23082490013825954, 'learning_rate': 1.3199999999999999e-08, 'completion_length': 160.76786041259766, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.006378173828125, 'epoch': 0.99} + 99%|█████████▊| 2467/2500 [15:14:45<10:44, 19.54s/it] 99%|█████████▊| 2468/2500 [15:15:04<10:21, 19.43s/it] {'loss': 0.0002, 'grad_norm': 0.028874145690384736, 'learning_rate': 1.28e-08, 'completion_length': 144.51786041259766, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.0057525634765625, 'epoch': 0.99} + 99%|█████████▊| 2468/2500 [15:15:04<10:21, 19.43s/it] 99%|█████████▉| 2469/2500 [15:15:21<09:37, 18.63s/it] {'loss': 0.0002, 'grad_norm': 0.02601724924648944, 'learning_rate': 1.2399999999999999e-08, 'completion_length': 135.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0047149658203125, 'epoch': 0.99} + 99%|█████████▉| 2469/2500 [15:15:21<09:37, 18.63s/it] 99%|█████████▉| 2470/2500 [15:15:38<08:59, 17.98s/it] {'loss': 0.0002, 'grad_norm': 0.2831007046724547, 'learning_rate': 1.2e-08, 'completion_length': 140.64286422729492, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00421905517578125, 'epoch': 0.99} + 99%|█████████▉| 2470/2500 [15:15:38<08:59, 17.98s/it] 99%|█████████▉| 2471/2500 [15:15:56<08:46, 18.16s/it] {'loss': 0.0002, 'grad_norm': 0.18101551362320198, 'learning_rate': 1.1599999999999998e-08, 'completion_length': 148.55358123779297, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.006103515625, 'epoch': 0.99} + 99%|█████████▉| 2471/2500 [15:15:56<08:46, 18.16s/it] 99%|█████████▉| 2472/2500 [15:16:15<08:30, 18.24s/it] {'loss': 0.0003, 'grad_norm': 0.020669381855623486, 'learning_rate': 1.12e-08, 'completion_length': 159.89286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00689697265625, 'epoch': 0.99} + 99%|█████████▉| 2472/2500 [15:16:15<08:30, 18.24s/it] 99%|█████████▉| 2473/2500 [15:16:33<08:14, 18.33s/it] {'loss': 0.0001, 'grad_norm': 0.014563654130509735, 'learning_rate': 1.08e-08, 'completion_length': 154.21428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.002895355224609375, 'epoch': 0.99} + 99%|█████████▉| 2473/2500 [15:16:33<08:14, 18.33s/it] 99%|█████████▉| 2474/2500 [15:16:52<08:00, 18.47s/it] {'loss': 0.0002, 'grad_norm': 1.0967331533610194, 'learning_rate': 1.0399999999999999e-08, 'completion_length': 161.42858123779297, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0048065185546875, 'epoch': 0.99} + 99%|█████████▉| 2474/2500 [15:16:52<08:00, 18.47s/it] 99%|█████████▉| 2475/2500 [15:17:11<07:42, 18.50s/it] {'loss': 0.0002, 'grad_norm': 0.27461470748821554, 'learning_rate': 1e-08, 'completion_length': 160.5714340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.003936767578125, 'epoch': 0.99} + 99%|█████████▉| 2475/2500 [15:17:11<07:42, 18.50s/it] 99%|█████████▉| 2476/2500 [15:17:28<07:18, 18.28s/it] {'loss': 0.0002, 'grad_norm': 0.24253382350853775, 'learning_rate': 9.599999999999998e-09, 'completion_length': 149.62500762939453, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.04123930633068085, 'kl': 0.00390625, 'epoch': 0.99} + 99%|█████████▉| 2476/2500 [15:17:28<07:18, 18.28s/it] 99%|█████████▉| 2477/2500 [15:17:48<07:08, 18.64s/it] {'loss': 0.0002, 'grad_norm': 0.23533509796550245, 'learning_rate': 9.2e-09, 'completion_length': 147.10714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.005096435546875, 'epoch': 0.99} + 99%|█████████▉| 2477/2500 [15:17:48<07:08, 18.64s/it] 99%|█████████▉| 2478/2500 [15:18:07<06:53, 18.81s/it] {'loss': 0.0002, 'grad_norm': 0.19023717346130264, 'learning_rate': 8.8e-09, 'completion_length': 157.76786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00443267822265625, 'epoch': 0.99} + 99%|█████████▉| 2478/2500 [15:18:07<06:53, 18.81s/it] 99%|█████████▉| 2479/2500 [15:18:26<06:36, 18.88s/it] {'loss': 0.0002, 'grad_norm': 0.034160238956009156, 'learning_rate': 8.399999999999999e-09, 'completion_length': 170.7857208251953, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00460052490234375, 'epoch': 0.99} + 99%|█████████▉| 2479/2500 [15:18:26<06:36, 18.88s/it] 99%|█████████▉| 2480/2500 [15:18:46<06:23, 19.16s/it] {'loss': 0.0003, 'grad_norm': 0.7229005481045045, 'learning_rate': 8e-09, 'completion_length': 186.33929443359375, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.0065460205078125, 'epoch': 0.99} + 99%|█████████▉| 2480/2500 [15:18:46<06:23, 19.16s/it] 99%|█████████▉| 2481/2500 [15:19:06<06:07, 19.37s/it] {'loss': 0.0002, 'grad_norm': 0.5113840146821736, 'learning_rate': 7.6e-09, 'completion_length': 167.55357360839844, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1071428656578064, 'kl': 0.00428009033203125, 'epoch': 0.99} + 99%|█████████▉| 2481/2500 [15:19:06<06:07, 19.37s/it] 99%|█████████▉| 2482/2500 [15:19:24<05:43, 19.07s/it] {'loss': 0.0002, 'grad_norm': 0.3087792965169337, 'learning_rate': 7.199999999999999e-09, 'completion_length': 161.0714340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.0042572021484375, 'epoch': 0.99} + 99%|█████████▉| 2482/2500 [15:19:24<05:43, 19.07s/it] 99%|█████████▉| 2483/2500 [15:19:44<05:28, 19.34s/it] {'loss': 0.0002, 'grad_norm': 0.18890226397319754, 'learning_rate': 6.8e-09, 'completion_length': 164.6607208251953, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.004428863525390625, 'epoch': 0.99} + 99%|█████████▉| 2483/2500 [15:19:44<05:28, 19.34s/it] 99%|█████████▉| 2484/2500 [15:20:03<05:05, 19.12s/it] {'loss': 0.0002, 'grad_norm': 0.23932318973383335, 'learning_rate': 6.4e-09, 'completion_length': 150.21429443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.006015777587890625, 'epoch': 0.99} + 99%|█████████▉| 2484/2500 [15:20:03<05:05, 19.12s/it] 99%|█████████▉| 2485/2500 [15:20:22<04:46, 19.08s/it] {'loss': 0.0003, 'grad_norm': 0.3651353521320805, 'learning_rate': 6e-09, 'completion_length': 155.83929443359375, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.0070953369140625, 'epoch': 0.99} + 99%|█████████▉| 2485/2500 [15:20:22<04:46, 19.08s/it] 99%|█████████▉| 2486/2500 [15:20:40<04:25, 18.95s/it] {'loss': 0.0002, 'grad_norm': 0.0317860363397245, 'learning_rate': 5.6e-09, 'completion_length': 155.0, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0045318603515625, 'epoch': 0.99} + 99%|█████████▉| 2486/2500 [15:20:40<04:25, 18.95s/it] 99%|█████████▉| 2487/2500 [15:21:00<04:08, 19.08s/it] {'loss': 0.0003, 'grad_norm': 0.14638995099005714, 'learning_rate': 5.1999999999999994e-09, 'completion_length': 151.0357208251953, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.00635528564453125, 'epoch': 0.99} + 99%|█████████▉| 2487/2500 [15:21:00<04:08, 19.08s/it] 100%|█████████▉| 2488/2500 [15:21:19<03:50, 19.21s/it] {'loss': 0.0002, 'grad_norm': 0.484898710365812, 'learning_rate': 4.799999999999999e-09, 'completion_length': 156.1964340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.00382232666015625, 'epoch': 1.0} + 100%|█████████▉| 2488/2500 [15:21:19<03:50, 19.21s/it] 100%|█████████▉| 2489/2500 [15:21:39<03:32, 19.35s/it] {'loss': 0.0002, 'grad_norm': 0.22939462804901922, 'learning_rate': 4.4e-09, 'completion_length': 167.6964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.00478363037109375, 'epoch': 1.0} + 100%|█████████▉| 2489/2500 [15:21:39<03:32, 19.35s/it] 100%|█████████▉| 2490/2500 [15:21:58<03:11, 19.16s/it] {'loss': 0.0002, 'grad_norm': 0.15831464127104888, 'learning_rate': 4e-09, 'completion_length': 154.35714721679688, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.0048675537109375, 'epoch': 1.0} + 100%|█████████▉| 2490/2500 [15:21:58<03:11, 19.16s/it] 100%|█████████▉| 2491/2500 [15:22:16<02:49, 18.78s/it] {'loss': 0.0001, 'grad_norm': 0.03639373745498538, 'learning_rate': 3.5999999999999996e-09, 'completion_length': 147.6964340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.00341796875, 'epoch': 1.0} + 100%|█████████▉| 2491/2500 [15:22:16<02:49, 18.78s/it] 100%|█████████▉| 2492/2500 [15:22:35<02:32, 19.07s/it] {'loss': 0.0002, 'grad_norm': 0.33147284280688366, 'learning_rate': 3.2e-09, 'completion_length': 175.33929443359375, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.0040130615234375, 'epoch': 1.0} + 100%|█████████▉| 2492/2500 [15:22:35<02:32, 19.07s/it] 100%|█████████▉| 2493/2500 [15:22:56<02:16, 19.49s/it] {'loss': 0.0002, 'grad_norm': 0.6015044530190338, 'learning_rate': 2.8e-09, 'completion_length': 170.62500762939453, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.11266787722706795, 'kl': 0.0052490234375, 'epoch': 1.0} + 100%|█████████▉| 2493/2500 [15:22:56<02:16, 19.49s/it] 100%|█████████▉| 2494/2500 [15:23:14<01:55, 19.24s/it] {'loss': 0.0002, 'grad_norm': 0.024290426578273645, 'learning_rate': 2.3999999999999996e-09, 'completion_length': 161.41072845458984, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.004974365234375, 'epoch': 1.0} + 100%|█████████▉| 2494/2500 [15:23:14<01:55, 19.24s/it] 100%|█████████▉| 2495/2500 [15:23:32<01:33, 18.74s/it] {'loss': 0.0001, 'grad_norm': 0.024326870076090225, 'learning_rate': 2e-09, 'completion_length': 147.96429443359375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.00322723388671875, 'epoch': 1.0} + 100%|█████████▉| 2495/2500 [15:23:32<01:33, 18.74s/it] 100%|█████████▉| 2496/2500 [15:23:51<01:14, 18.72s/it] {'loss': 0.0001, 'grad_norm': 0.4562999804960021, 'learning_rate': 1.6e-09, 'completion_length': 153.67857360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.0033721923828125, 'epoch': 1.0} + 100%|█████████▉| 2496/2500 [15:23:51<01:14, 18.72s/it] 100%|█████████▉| 2497/2500 [15:24:08<00:54, 18.26s/it] {'loss': 0.0002, 'grad_norm': 0.08359875733792856, 'learning_rate': 1.1999999999999998e-09, 'completion_length': 140.6071548461914, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0053863525390625, 'epoch': 1.0} + 100%|█████████▉| 2497/2500 [15:24:08<00:54, 18.26s/it] 100%|█████████▉| 2498/2500 [15:24:27<00:37, 18.62s/it] {'loss': 0.0003, 'grad_norm': 0.02322713965327548, 'learning_rate': 8e-10, 'completion_length': 163.0357208251953, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.006805419921875, 'epoch': 1.0} + 100%|█████████▉| 2498/2500 [15:24:27<00:37, 18.62s/it] 100%|█████████▉| 2499/2500 [15:24:45<00:18, 18.31s/it] {'loss': 0.0002, 'grad_norm': 0.02103508148569482, 'learning_rate': 4e-10, 'completion_length': 147.87500762939453, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0040740966796875, 'epoch': 1.0} + 100%|█████████▉| 2499/2500 [15:24:45<00:18, 18.31s/it] 100%|██████████| 2500/2500 [15:25:03<00:00, 18.19s/it] {'loss': 0.0002, 'grad_norm': 0.016196487333409008, 'learning_rate': 0.0, 'completion_length': 147.375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0041961669921875, 'epoch': 1.0} + 100%|██████████| 2500/2500 [15:25:03<00:00, 18.19s/it] {'train_runtime': 55665.8442, 'train_samples_per_second': 0.629, 'train_steps_per_second': 0.045, 'train_loss': 0.0002181483645139217, 'epoch': 1.0} + 100%|██████████| 2500/2500 [15:27:36<00:00, 18.19s/it] 100%|██████████| 2500/2500 [15:27:36<00:00, 22.26s/it] +wandb: +wandb: 🚀 View run R1-Resume-COT-VLLM-Correct-Qwen2-VL-7B-GRPO-ClevrMath-35k-2025-02-18-19-35-02 at: https://wandb.ai/tanhuajie264-peking-university/vison-open-r1/runs/wm8v4nwu +wandb: Find logs at: wandb/run-20250218_194259-wm8v4nwu/logs