[2025-02-12 22:45:57,106] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-12 22:45:57,131] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-12 22:45:57,131] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-12 22:45:57,131] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-12 22:45:57,132] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-12 22:45:57,140] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-12 22:45:57,144] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO 02-12 22:46:02 __init__.py:190] Automatically detected platform cuda. INFO 02-12 22:46:02 __init__.py:190] Automatically detected platform cuda. INFO 02-12 22:46:02 __init__.py:190] Automatically detected platform cuda. INFO 02-12 22:46:02 __init__.py:190] Automatically detected platform cuda. INFO 02-12 22:46:02 __init__.py:190] Automatically detected platform cuda. INFO 02-12 22:46:02 __init__.py:190] Automatically detected platform cuda. INFO 02-12 22:46:02 __init__.py:190] Automatically detected platform cuda. [2025-02-12 22:46:10,577] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-12 22:46:10,579] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-12 22:46:10,579] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-12 22:46:10,580] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-12 22:46:10,580] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-12 22:46:10,580] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-12 22:46:10,580] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2025-02-12 22:46:10,580] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-12 22:46:11,661] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 [2025-02-12 22:46:11,661] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 [2025-02-12 22:46:11,661] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 [2025-02-12 22:46:11,661] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour [2025-02-12 22:46:11,707] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1169086 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1169086 [0] NCCL INFO Bootstrap : Using bond0:10.9.200.82<0> p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1169086 [0] NCCL INFO cudaDriverVersion 12040 NCCL version 2.21.5+cuda12.4 p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1169087 [1] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1169092 [6] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1169087 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1169092 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1169088 [2] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1169090 [4] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1169088 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1169090 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1169087 [1] NCCL INFO Bootstrap : Using bond0:10.9.200.82<0> p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1169092 [6] NCCL INFO Bootstrap : Using bond0:10.9.200.82<0> p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1169088 [2] NCCL INFO Bootstrap : Using bond0:10.9.200.82<0> p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1169090 [4] NCCL INFO Bootstrap : Using bond0:10.9.200.82<0> p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.82<0> p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.82<0> p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.82<0> p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.82<0> p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.82<0> p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Using network IBext_v8 [2025-02-12 22:46:16,506] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1169089 [3] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1169089 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1169089 [3] NCCL INFO Bootstrap : Using bond0:10.9.200.82<0> p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.82<0> p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Using network IBext_v8 [2025-02-12 22:46:21,796] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1169091 [5] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1169091 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1169091 [5] NCCL INFO Bootstrap : Using bond0:10.9.200.82<0> p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.82<0> p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO ncclCommInitRank comm 0x562db2d0a3b0 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 8f000 commId 0x83b3a9d764c58334 - Init START p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO ncclCommInitRank comm 0x56453f682a30 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 10000 commId 0x83b3a9d764c58334 - Init START p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO ncclCommInitRank comm 0x5572a53ea2d0 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 16000 commId 0x83b3a9d764c58334 - Init START p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO ncclCommInitRank comm 0x55c812643e80 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId c6000 commId 0x83b3a9d764c58334 - Init START p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO ncclCommInitRank comm 0x55fde6c73920 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 49000 commId 0x83b3a9d764c58334 - Init START p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO ncclCommInitRank comm 0x55b6e2ac8260 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 4d000 commId 0x83b3a9d764c58334 - Init START p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO ncclCommInitRank comm 0x559426479280 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8a000 commId 0x83b3a9d764c58334 - Init START p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO NVLS multicast support is not available on dev 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000 p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO NVLS multicast support is not available on dev 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO NVLS multicast support is not available on dev 3 p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO NVLS multicast support is not available on dev 2 p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff,00000000 p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO NVLS multicast support is not available on dev 4 p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000 p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO NVLS multicast support is not available on dev 5 p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO NVLS multicast support is not available on dev 1 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO comm 0x56453f682a30 rank 0 nRanks 7 nNodes 1 localRanks 7 localRank 0 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO comm 0x55c812643e80 rank 6 nRanks 7 nNodes 1 localRanks 7 localRank 6 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 00/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5 [2] -1/-1/-1->6->5 [3] -1/-1/-1->6->5 [4] -1/-1/-1->6->5 [5] -1/-1/-1->6->5 [6] -1/-1/-1->6->5 [7] -1/-1/-1->6->5 [8] -1/-1/-1->6->5 [9] -1/-1/-1->6->5 [10] -1/-1/-1->6->5 [11] -1/-1/-1->6->5 [12] -1/-1/-1->6->5 [13] -1/-1/-1->6->5 [14] -1/-1/-1->6->5 [15] -1/-1/-1->6->5 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 01/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO comm 0x562db2d0a3b0 rank 5 nRanks 7 nNodes 1 localRanks 7 localRank 5 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO comm 0x559426479280 rank 4 nRanks 7 nNodes 1 localRanks 7 localRank 4 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 02/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 03/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 04/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 05/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 06/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 07/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 08/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 09/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO comm 0x55b6e2ac8260 rank 3 nRanks 7 nNodes 1 localRanks 7 localRank 3 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 10/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 11/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 12/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 13/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 14/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 15/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO comm 0x5572a53ea2d0 rank 1 nRanks 7 nNodes 1 localRanks 7 localRank 1 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO comm 0x55fde6c73920 rank 2 nRanks 7 nNodes 1 localRanks 7 localRank 2 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 00/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 01/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 02/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 03/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 04/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 05/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 06/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 07/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 08/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 09/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 10/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 11/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 12/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 13/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 14/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 15/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1171818 [3] NCCL INFO ncclCommInitRank comm 0x55b6e2ac8260 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 4d000 commId 0x83b3a9d764c58334 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1171564 [0] NCCL INFO ncclCommInitRank comm 0x56453f682a30 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 10000 commId 0x83b3a9d764c58334 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1171563 [2] NCCL INFO ncclCommInitRank comm 0x55fde6c73920 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 49000 commId 0x83b3a9d764c58334 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1171560 [4] NCCL INFO ncclCommInitRank comm 0x559426479280 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8a000 commId 0x83b3a9d764c58334 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1171561 [1] NCCL INFO ncclCommInitRank comm 0x5572a53ea2d0 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 16000 commId 0x83b3a9d764c58334 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1171562 [6] NCCL INFO ncclCommInitRank comm 0x55c812643e80 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId c6000 commId 0x83b3a9d764c58334 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1172064 [5] NCCL INFO ncclCommInitRank comm 0x562db2d0a3b0 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 8f000 commId 0x83b3a9d764c58334 - Init COMPLETE [2025-02-12 22:46:25,604] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 730, num_elems = 2.44B Loading checkpoint shards: 0%| | 0/2 [00:00 [2025-02-12 22:46:50,269] [INFO] [config.py:1003:print] communication_data_type ...... None [2025-02-12 22:46:50,269] [INFO] [config.py:1003:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2025-02-12 22:46:50,269] [INFO] [config.py:1003:print] curriculum_enabled_legacy .... False [2025-02-12 22:46:50,269] [INFO] [config.py:1003:print] curriculum_params_legacy ..... False [2025-02-12 22:46:50,269] [INFO] [config.py:1003:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-02-12 22:46:50,269] [INFO] [config.py:1003:print] data_efficiency_enabled ...... False [2025-02-12 22:46:50,269] [INFO] [config.py:1003:print] dataloader_drop_last ......... False [2025-02-12 22:46:50,269] [INFO] [config.py:1003:print] disable_allgather ............ False [2025-02-12 22:46:50,269] [INFO] [config.py:1003:print] dump_state ................... False [2025-02-12 22:46:50,269] [INFO] [config.py:1003:print] dynamic_loss_scale_args ...... None [2025-02-12 22:46:50,269] [INFO] [config.py:1003:print] eigenvalue_enabled ........... False [2025-02-12 22:46:50,269] [INFO] [config.py:1003:print] eigenvalue_gas_boundary_resolution 1 [2025-02-12 22:46:50,269] [INFO] [config.py:1003:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] eigenvalue_layer_num ......... 0 [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] eigenvalue_max_iter .......... 100 [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] eigenvalue_stability ......... 1e-06 [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] eigenvalue_tol ............... 0.01 [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] eigenvalue_verbose ........... False [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] elasticity_enabled ........... False [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] fp16_auto_cast ............... None [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] fp16_enabled ................. False [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] fp16_master_weights_and_gradients False [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] global_rank .................. 0 [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] grad_accum_dtype ............. None [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] gradient_accumulation_steps .. 2 [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] gradient_clipping ............ 1.0 [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] gradient_predivide_factor .... 1.0 [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] graph_harvesting ............. False [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] initial_dynamic_scale ........ 1 [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] load_universal_checkpoint .... False [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] loss_scale ................... 1.0 [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] memory_breakdown ............. False [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] mics_hierarchial_params_gather False [2025-02-12 22:46:50,270] [INFO] [config.py:1003:print] mics_shard_size .............. -1 [2025-02-12 22:46:50,271] [INFO] [config.py:1003:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2025-02-12 22:46:50,271] [INFO] [config.py:1003:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-02-12 22:46:50,271] [INFO] [config.py:1003:print] optimizer_legacy_fusion ...... False [2025-02-12 22:46:50,271] [INFO] [config.py:1003:print] optimizer_name ............... None [2025-02-12 22:46:50,271] [INFO] [config.py:1003:print] optimizer_params ............. None [2025-02-12 22:46:50,271] [INFO] [config.py:1003:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-02-12 22:46:50,271] [INFO] [config.py:1003:print] pld_enabled .................. False [2025-02-12 22:46:50,271] [INFO] [config.py:1003:print] pld_params ................... False [2025-02-12 22:46:50,271] [INFO] [config.py:1003:print] prescale_gradients ........... False [2025-02-12 22:46:50,271] [INFO] [config.py:1003:print] scheduler_name ............... None [2025-02-12 22:46:50,271] [INFO] [config.py:1003:print] scheduler_params ............. None [2025-02-12 22:46:50,272] [INFO] [config.py:1003:print] seq_parallel_communication_data_type torch.float32 [2025-02-12 22:46:50,272] [INFO] [config.py:1003:print] sparse_attention ............. None [2025-02-12 22:46:50,272] [INFO] [config.py:1003:print] sparse_gradients_enabled ..... False [2025-02-12 22:46:50,272] [INFO] [config.py:1003:print] steps_per_print .............. inf [2025-02-12 22:46:50,272] [INFO] [config.py:1003:print] timers_config ................ enabled=True synchronized=True [2025-02-12 22:46:50,272] [INFO] [config.py:1003:print] train_batch_size ............. 14 [2025-02-12 22:46:50,272] [INFO] [config.py:1003:print] train_micro_batch_size_per_gpu 1 [2025-02-12 22:46:50,273] [INFO] [config.py:1003:print] use_data_before_expert_parallel_ False [2025-02-12 22:46:50,273] [INFO] [config.py:1003:print] use_node_local_storage ....... False [2025-02-12 22:46:50,273] [INFO] [config.py:1003:print] wall_clock_breakdown ......... False [2025-02-12 22:46:50,273] [INFO] [config.py:1003:print] weight_quantization_config ... None [2025-02-12 22:46:50,273] [INFO] [config.py:1003:print] world_size ................... 7 [2025-02-12 22:46:50,273] [INFO] [config.py:1003:print] zero_allow_untested_optimizer False [2025-02-12 22:46:50,273] [INFO] [config.py:1003:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=True, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2025-02-12 22:46:50,273] [INFO] [config.py:1003:print] zero_enabled ................. True [2025-02-12 22:46:50,273] [INFO] [config.py:1003:print] zero_force_ds_cpu_optimizer .. True [2025-02-12 22:46:50,273] [INFO] [config.py:1003:print] zero_optimization_stage ...... 3 [2025-02-12 22:46:50,273] [INFO] [config.py:989:print_user_config] json = { "fp16": { "enabled": false, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "none", "pin_memory": true }, "offload_param": { "device": "none", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1.000000e+09, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1.000000e+09, "stage3_max_reuse_distance": 1.000000e+09, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": 2, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 14, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false, "zero_optimization.reduce_bucket_size": 2.359296e+06, "zero_optimization.stage3_param_persistence_threshold": 1.536000e+04, "zero_optimization.stage3_prefetch_bucket_size": 2.123366e+06 } INFO 02-12 22:47:42 config.py:542] This model supports multiple tasks: {'score', 'embed', 'reward', 'classify', 'generate'}. Defaulting to 'generate'. WARNING 02-12 22:47:42 arg_utils.py:1079] --enable-prefix-caching is currently not supported for multimodal models in v0 and has been disabled. INFO 02-12 22:47:42 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='/home/vlm/pretrain_model/Qwen2-VL-2B-Instruct', speculative_config=None, tokenizer='/home/vlm/pretrain_model/Qwen2-VL-2B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda:7, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/vlm/pretrain_model/Qwen2-VL-2B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, INFO 02-12 22:47:44 cuda.py:230] Using Flash Attention backend. INFO 02-12 22:47:45 model_runner.py:1110] Starting to load model /home/vlm/pretrain_model/Qwen2-VL-2B-Instruct... INFO 02-12 22:47:46 config.py:2992] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248] Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00 32768). Running this sequence through the model will result in indexing errors WARNING 02-12 23:00:54 profiling.py:187] The context length (32768) of the model is too short to hold the multi-modal embeddings in the worst case (49152 tokens in total, out of which {'image': 16384, 'video': 32768} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`. INFO 02-12 23:00:57 worker.py:267] Memory profiling takes 783.52 seconds INFO 02-12 23:00:57 worker.py:267] the current vLLM instance can use total_gpu_memory (79.32GiB) x gpu_memory_utilization (0.80) = 63.46GiB INFO 02-12 23:00:57 worker.py:267] model weights take 0.00GiB; non_torch_memory takes 0.00GiB; PyTorch activation peak memory takes 0.00GiB; the rest of the memory reserved for KV Cache is 63.46GiB. INFO 02-12 23:00:58 executor_base.py:110] # CUDA blocks: 148532, # CPU blocks: 9362 INFO 02-12 23:00:58 executor_base.py:115] Maximum concurrency for 32768 tokens per request: 72.53x INFO 02-12 23:01:00 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. Capturing CUDA graph shapes: 0%| | 0/35 [00:004->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO comm 0x7fbb200736d0 rank 1 nRanks 7 nNodes 1 localRanks 7 localRank 1 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO comm 0x7f1dbc073110 rank 2 nRanks 7 nNodes 1 localRanks 7 localRank 2 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO comm 0x7f41880746a0 rank 5 nRanks 7 nNodes 1 localRanks 7 localRank 5 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5 [2] -1/-1/-1->6->5 [3] -1/-1/-1->6->5 [4] -1/-1/-1->6->5 [5] -1/-1/-1->6->5 [6] -1/-1/-1->6->5 [7] -1/-1/-1->6->5 [8] -1/-1/-1->6->5 [9] -1/-1/-1->6->5 [10] -1/-1/-1->6->5 [11] -1/-1/-1->6->5 [12] -1/-1/-1->6->5 [13] -1/-1/-1->6->5 [14] -1/-1/-1->6->5 [15] -1/-1/-1->6->5 p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 00/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 01/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 02/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 03/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 04/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 05/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 06/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 07/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 08/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 09/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 10/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 11/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 12/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 13/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 14/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 15/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 00/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 01/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 02/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 03/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 04/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 05/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 06/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 07/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 08/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 09/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 10/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 11/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 12/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 13/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 14/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 15/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-82:1169087:1208805 [1] NCCL INFO ncclCommSplit comm 0x7fbb200736d0 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 16000 parent 0x5572a53ea2d0 color -1326228412 key 1 commId 0x3720299aada5ad54 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-82:1169089:1208803 [3] NCCL INFO ncclCommSplit comm 0x7fc4880737f0 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 4d000 parent 0x55b6e2ac8260 color -1326228412 key 3 commId 0x3720299aada5ad54 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-82:1169092:1208804 [6] NCCL INFO ncclCommSplit comm 0x7f5784073b80 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId c6000 parent 0x55c812643e80 color -1326228412 key 6 commId 0x3720299aada5ad54 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-82:1169086:1208809 [0] NCCL INFO ncclCommSplit comm 0x7f5b48072cc0 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 10000 parent 0x56453f682a30 color -1326228412 key 0 commId 0x3720299aada5ad54 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-82:1169090:1208808 [4] NCCL INFO ncclCommSplit comm 0x7fa2a80730f0 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8a000 parent 0x559426479280 color -1326228412 key 4 commId 0x3720299aada5ad54 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-82:1169088:1208806 [2] NCCL INFO ncclCommSplit comm 0x7f1dbc073110 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 49000 parent 0x55fde6c73920 color -1326228412 key 2 commId 0x3720299aada5ad54 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-82:1169091:1208807 [5] NCCL INFO ncclCommSplit comm 0x7f41880746a0 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 8f000 parent 0x562db2d0a3b0 color -1326228412 key 5 commId 0x3720299aada5ad54 - Init COMPLETE 0%| | 1/2500 [00:27<19:19:06, 27.83s/it] {'loss': 0.0, 'grad_norm': 60.88710924899898, 'learning_rate': 9.996e-07, 'completion_length': 36.875, 'rewards/accuracy_reward': 0.133928582072258, 'rewards/format_reward': 0.3750000298023224, 'reward': 0.5089285969734192, 'reward_std': 0.46817223727703094, 'kl': 0.0, 'epoch': 0.0} 0%| | 1/2500 [00:28<19:19:06, 27.83s/it] 0%| | 2/2500 [00:40<13:12:25, 19.03s/it] {'loss': 0.0003, 'grad_norm': 48.79779062796824, 'learning_rate': 9.992e-07, 'completion_length': 46.08035850524902, 'rewards/accuracy_reward': 0.2946428656578064, 'rewards/format_reward': 0.4017857313156128, 'reward': 0.6964285969734192, 'reward_std': 0.547734409570694, 'kl': 0.0065460205078125, 'epoch': 0.0} 0%| | 2/2500 [00:40<13:12:25, 19.03s/it] 0%| | 3/2500 [00:49<10:06:26, 14.57s/it] {'loss': 0.0009, 'grad_norm': 59.79396424960145, 'learning_rate': 9.988e-07, 'completion_length': 36.34821701049805, 'rewards/accuracy_reward': 0.2678571492433548, 'rewards/format_reward': 0.446428582072258, 'reward': 0.7142857611179352, 'reward_std': 0.4725345969200134, 'kl': 0.023193359375, 'epoch': 0.0} 0%| | 3/2500 [00:49<10:06:26, 14.57s/it] 0%| | 4/2500 [01:04<10:06:59, 14.59s/it] {'loss': 0.0016, 'grad_norm': 35.6239283282582, 'learning_rate': 9.983999999999998e-07, 'completion_length': 62.68750190734863, 'rewards/accuracy_reward': 0.258928582072258, 'rewards/format_reward': 0.5535714626312256, 'reward': 0.8125000298023224, 'reward_std': 0.5807228684425354, 'kl': 0.038818359375, 'epoch': 0.0} 0%| | 4/2500 [01:04<10:06:59, 14.59s/it] 0%| | 5/2500 [01:19<10:17:21, 14.85s/it] {'loss': 0.0008, 'grad_norm': 9.646915669699961, 'learning_rate': 9.98e-07, 'completion_length': 88.4464340209961, 'rewards/accuracy_reward': 0.2767857313156128, 'rewards/format_reward': 0.785714328289032, 'reward': 1.0625000596046448, 'reward_std': 0.5277653783559799, 'kl': 0.0206146240234375, 'epoch': 0.0} 0%| | 5/2500 [01:19<10:17:21, 14.85s/it] 0%| | 6/2500 [01:34<10:16:50, 14.84s/it] {'loss': 0.0006, 'grad_norm': 4.306240577829012, 'learning_rate': 9.976e-07, 'completion_length': 80.43750381469727, 'rewards/accuracy_reward': 0.196428582072258, 'rewards/format_reward': 0.8839286267757416, 'reward': 1.0803571939468384, 'reward_std': 0.46401800215244293, 'kl': 0.01409912109375, 'epoch': 0.0} 0%| | 6/2500 [01:34<10:16:50, 14.84s/it] 0%| | 7/2500 [01:49<10:18:28, 14.89s/it] {'loss': 0.0004, 'grad_norm': 6.177348553944549, 'learning_rate': 9.972e-07, 'completion_length': 76.81250381469727, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.285714328289032, 'reward_std': 0.4333198666572571, 'kl': 0.009735107421875, 'epoch': 0.0} 0%| | 7/2500 [01:49<10:18:28, 14.89s/it] 0%| | 8/2500 [02:02<9:56:09, 14.35s/it] {'loss': 0.0008, 'grad_norm': 15.43748108928438, 'learning_rate': 9.968e-07, 'completion_length': 64.89285850524902, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 0.8035714626312256, 'reward': 1.1160714626312256, 'reward_std': 0.43693573772907257, 'kl': 0.019775390625, 'epoch': 0.0} 0%| | 8/2500 [02:02<9:56:09, 14.35s/it] 0%| | 9/2500 [02:18<10:09:24, 14.68s/it] {'loss': 0.0007, 'grad_norm': 8.951655066657516, 'learning_rate': 9.964e-07, 'completion_length': 85.68750381469727, 'rewards/accuracy_reward': 0.2589285895228386, 'rewards/format_reward': 0.8839285969734192, 'reward': 1.1428571939468384, 'reward_std': 0.43667128682136536, 'kl': 0.01837158203125, 'epoch': 0.0} 0%| | 9/2500 [02:18<10:09:24, 14.68s/it] 0%| | 10/2500 [02:33<10:13:24, 14.78s/it] {'loss': 0.0012, 'grad_norm': 20.26683458003515, 'learning_rate': 9.959999999999999e-07, 'completion_length': 67.82143020629883, 'rewards/accuracy_reward': 0.2678571492433548, 'rewards/format_reward': 0.7321428954601288, 'reward': 1.0000000298023224, 'reward_std': 0.3611467182636261, 'kl': 0.030517578125, 'epoch': 0.0} 0%| | 10/2500 [02:33<10:13:24, 14.78s/it] 0%| | 11/2500 [02:48<10:20:29, 14.96s/it] {'loss': 0.0013, 'grad_norm': 5.956373427441315, 'learning_rate': 9.956e-07, 'completion_length': 72.16964721679688, 'rewards/accuracy_reward': 0.2410714477300644, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.196428656578064, 'reward_std': 0.3119049519300461, 'kl': 0.0318603515625, 'epoch': 0.0} 0%| | 11/2500 [02:48<10:20:29, 14.96s/it] 0%| | 12/2500 [03:03<10:17:15, 14.89s/it] {'loss': 0.0024, 'grad_norm': 4.8767207016933565, 'learning_rate': 9.952e-07, 'completion_length': 68.6964340209961, 'rewards/accuracy_reward': 0.401785746216774, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.3571429252624512, 'reward_std': 0.4623453915119171, 'kl': 0.0604248046875, 'epoch': 0.0} 0%| | 12/2500 [03:03<10:17:15, 14.89s/it] 1%| | 13/2500 [03:18<10:19:11, 14.94s/it] {'loss': 0.002, 'grad_norm': 5.990055130272151, 'learning_rate': 9.948e-07, 'completion_length': 74.77678680419922, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.321428656578064, 'reward_std': 0.3839495927095413, 'kl': 0.05029296875, 'epoch': 0.01} 1%| | 13/2500 [03:18<10:19:11, 14.94s/it] 1%| | 14/2500 [03:34<10:31:35, 15.24s/it] {'loss': 0.0021, 'grad_norm': 6.1911132914114955, 'learning_rate': 9.944e-07, 'completion_length': 77.30357360839844, 'rewards/accuracy_reward': 0.2232142984867096, 'rewards/format_reward': 0.973214328289032, 'reward': 1.1964285969734192, 'reward_std': 0.38847145438194275, 'kl': 0.05224609375, 'epoch': 0.01} 1%| | 14/2500 [03:34<10:31:35, 15.24s/it] 1%| | 15/2500 [03:48<10:14:33, 14.84s/it] {'loss': 0.0032, 'grad_norm': 5.144672851570093, 'learning_rate': 9.94e-07, 'completion_length': 68.61607360839844, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2946429252624512, 'reward_std': 0.4076070785522461, 'kl': 0.08056640625, 'epoch': 0.01} 1%| | 15/2500 [03:48<10:14:33, 14.84s/it] 1%| | 16/2500 [04:00<9:46:22, 14.16s/it] {'loss': 0.0029, 'grad_norm': 5.514105747515247, 'learning_rate': 9.936e-07, 'completion_length': 66.87500381469727, 'rewards/accuracy_reward': 0.2857143059372902, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2767857909202576, 'reward_std': 0.37783700227737427, 'kl': 0.0721435546875, 'epoch': 0.01} 1%| | 16/2500 [04:00<9:46:22, 14.16s/it] 1%| | 17/2500 [04:13<9:27:37, 13.72s/it] {'loss': 0.0036, 'grad_norm': 5.518139735124, 'learning_rate': 9.931999999999999e-07, 'completion_length': 63.66964340209961, 'rewards/accuracy_reward': 0.2678571566939354, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.258928656578064, 'reward_std': 0.35771289467811584, 'kl': 0.089599609375, 'epoch': 0.01} 1%| | 17/2500 [04:13<9:27:37, 13.72s/it] 1%| | 18/2500 [04:23<8:42:10, 12.62s/it] {'loss': 0.0032, 'grad_norm': 6.00168839107049, 'learning_rate': 9.928e-07, 'completion_length': 59.44643020629883, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3660715222358704, 'reward_std': 0.4517725855112076, 'kl': 0.0791015625, 'epoch': 0.01} 1%| | 18/2500 [04:23<8:42:10, 12.62s/it] 1%| | 19/2500 [04:36<8:39:29, 12.56s/it] {'loss': 0.0036, 'grad_norm': 4.776438920186083, 'learning_rate': 9.923999999999998e-07, 'completion_length': 63.687503814697266, 'rewards/accuracy_reward': 0.3482143133878708, 'rewards/format_reward': 1.0, 'reward': 1.3482143878936768, 'reward_std': 0.2987862080335617, 'kl': 0.09130859375, 'epoch': 0.01} 1%| | 19/2500 [04:36<8:39:29, 12.56s/it] 1%| | 20/2500 [04:46<8:12:46, 11.92s/it] {'loss': 0.004, 'grad_norm': 6.1418656861958105, 'learning_rate': 9.92e-07, 'completion_length': 54.19643211364746, 'rewards/accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 1.0, 'reward': 1.285714328289032, 'reward_std': 0.2954969108104706, 'kl': 0.098876953125, 'epoch': 0.01} 1%| | 20/2500 [04:46<8:12:46, 11.92s/it] 1%| | 21/2500 [04:55<7:33:31, 10.98s/it] {'loss': 0.0057, 'grad_norm': 6.531725428148904, 'learning_rate': 9.916e-07, 'completion_length': 48.69643211364746, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3125000596046448, 'reward_std': 0.31622885167598724, 'kl': 0.14306640625, 'epoch': 0.01} 1%| | 21/2500 [04:55<7:33:31, 10.98s/it] 1%| | 22/2500 [05:08<7:57:53, 11.57s/it] {'loss': 0.0069, 'grad_norm': 9.24710737435478, 'learning_rate': 9.912e-07, 'completion_length': 54.142860412597656, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.508928656578064, 'reward_std': 0.3375798761844635, 'kl': 0.17333984375, 'epoch': 0.01} 1%| | 22/2500 [05:08<7:57:53, 11.57s/it] 1%| | 23/2500 [05:18<7:39:02, 11.12s/it] {'loss': 0.0057, 'grad_norm': 8.718298212070678, 'learning_rate': 9.908e-07, 'completion_length': 46.49107360839844, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.3014822453260422, 'kl': 0.1435546875, 'epoch': 0.01} 1%| | 23/2500 [05:18<7:39:02, 11.12s/it] 1%| | 24/2500 [05:27<7:18:53, 10.64s/it] {'loss': 0.0073, 'grad_norm': 5.274578479254987, 'learning_rate': 9.903999999999999e-07, 'completion_length': 45.80357360839844, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.244989275932312, 'kl': 0.18359375, 'epoch': 0.01} 1%| | 24/2500 [05:27<7:18:53, 10.64s/it] 1%| | 25/2500 [05:40<7:46:17, 11.30s/it] {'loss': 0.0083, 'grad_norm': 8.076964173617347, 'learning_rate': 9.9e-07, 'completion_length': 40.99107360839844, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.535714328289032, 'reward_std': 0.3094206750392914, 'kl': 0.208984375, 'epoch': 0.01} 1%| | 25/2500 [05:40<7:46:17, 11.30s/it] 1%| | 26/2500 [05:49<7:18:22, 10.63s/it] {'loss': 0.0079, 'grad_norm': 5.2043035919428045, 'learning_rate': 9.896e-07, 'completion_length': 38.29464340209961, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.1030978113412857, 'kl': 0.19677734375, 'epoch': 0.01} 1%| | 26/2500 [05:49<7:18:22, 10.63s/it] 1%| | 27/2500 [05:58<6:56:09, 10.10s/it] {'loss': 0.0119, 'grad_norm': 4.867699829620964, 'learning_rate': 9.892e-07, 'completion_length': 34.82142925262451, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 1.0, 'reward': 1.5803571939468384, 'reward_std': 0.2467949464917183, 'kl': 0.2958984375, 'epoch': 0.01} 1%| | 27/2500 [05:58<6:56:09, 10.10s/it] 1%| | 28/2500 [06:06<6:25:51, 9.37s/it] {'loss': 0.0119, 'grad_norm': 5.79088842149081, 'learning_rate': 9.888e-07, 'completion_length': 30.205358505249023, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 1.0, 'reward': 1.508928656578064, 'reward_std': 0.30390140414237976, 'kl': 0.2978515625, 'epoch': 0.01} 1%| | 28/2500 [06:06<6:25:51, 9.37s/it] 1%| | 29/2500 [06:15<6:22:46, 9.29s/it] {'loss': 0.0118, 'grad_norm': 6.055256339380625, 'learning_rate': 9.884e-07, 'completion_length': 29.937501907348633, 'rewards/accuracy_reward': 0.6696428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6696429252624512, 'reward_std': 0.23326967656612396, 'kl': 0.2939453125, 'epoch': 0.01} 1%| | 29/2500 [06:15<6:22:46, 9.29s/it] 1%| | 30/2500 [06:23<6:14:02, 9.09s/it] {'loss': 0.0122, 'grad_norm': 16.29313034332932, 'learning_rate': 9.88e-07, 'completion_length': 30.500000953674316, 'rewards/accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410714626312256, 'reward_std': 0.2513168156147003, 'kl': 0.3037109375, 'epoch': 0.01} 1%| | 30/2500 [06:23<6:14:02, 9.09s/it] 1%| | 31/2500 [06:33<6:20:40, 9.25s/it] {'loss': 0.0125, 'grad_norm': 8.08187456838552, 'learning_rate': 9.876e-07, 'completion_length': 28.750000953674316, 'rewards/accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 1.0, 'reward': 1.633928656578064, 'reward_std': 0.17104807496070862, 'kl': 0.3125, 'epoch': 0.01} 1%| | 31/2500 [06:33<6:20:40, 9.25s/it] 1%|▏ | 32/2500 [06:42<6:13:00, 9.07s/it] {'loss': 0.0122, 'grad_norm': 6.514716223032737, 'learning_rate': 9.871999999999998e-07, 'completion_length': 31.33928680419922, 'rewards/accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.25521960109472275, 'kl': 0.3046875, 'epoch': 0.01} 1%|▏ | 32/2500 [06:42<6:13:00, 9.07s/it] 1%|▏ | 33/2500 [06:50<6:02:47, 8.82s/it] {'loss': 0.0119, 'grad_norm': 5.622465715077616, 'learning_rate': 9.868e-07, 'completion_length': 29.392858505249023, 'rewards/accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.2383904606103897, 'kl': 0.296875, 'epoch': 0.01} 1%|▏ | 33/2500 [06:50<6:02:47, 8.82s/it] 1%|▏ | 34/2500 [06:59<6:05:37, 8.90s/it] {'loss': 0.0114, 'grad_norm': 7.1331695518130385, 'learning_rate': 9.864e-07, 'completion_length': 31.77678680419922, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.3111080676317215, 'kl': 0.28515625, 'epoch': 0.01} 1%|▏ | 34/2500 [06:59<6:05:37, 8.90s/it] 1%|▏ | 35/2500 [07:08<6:01:17, 8.79s/it] {'loss': 0.0129, 'grad_norm': 6.120517593560624, 'learning_rate': 9.86e-07, 'completion_length': 27.446430206298828, 'rewards/accuracy_reward': 0.7946428954601288, 'rewards/format_reward': 1.0, 'reward': 1.794642984867096, 'reward_std': 0.1827620565891266, 'kl': 0.3232421875, 'epoch': 0.01} 1%|▏ | 35/2500 [07:08<6:01:17, 8.79s/it] 1%|▏ | 36/2500 [07:15<5:48:51, 8.49s/it] {'loss': 0.0136, 'grad_norm': 7.6567059551878405, 'learning_rate': 9.856e-07, 'completion_length': 29.125000953674316, 'rewards/accuracy_reward': 0.7767857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7767857909202576, 'reward_std': 0.23656462132930756, 'kl': 0.33984375, 'epoch': 0.01} 1%|▏ | 36/2500 [07:15<5:48:51, 8.49s/it] 1%|▏ | 37/2500 [07:24<5:47:49, 8.47s/it] {'loss': 0.0114, 'grad_norm': 7.514587538097256, 'learning_rate': 9.852e-07, 'completion_length': 29.65178680419922, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.29339978098869324, 'kl': 0.2841796875, 'epoch': 0.01} 1%|▏ | 37/2500 [07:24<5:47:49, 8.47s/it] 2%|▏ | 38/2500 [07:33<5:53:17, 8.61s/it] {'loss': 0.0135, 'grad_norm': 8.291608106046633, 'learning_rate': 9.847999999999999e-07, 'completion_length': 29.526787757873535, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.2383904606103897, 'kl': 0.337890625, 'epoch': 0.02} 2%|▏ | 38/2500 [07:33<5:53:17, 8.61s/it] 2%|▏ | 39/2500 [07:41<5:53:08, 8.61s/it] {'loss': 0.0134, 'grad_norm': 9.457829486336536, 'learning_rate': 9.844e-07, 'completion_length': 30.258930206298828, 'rewards/accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.8125001192092896, 'reward_std': 0.24498365819454193, 'kl': 0.3349609375, 'epoch': 0.02} 2%|▏ | 39/2500 [07:41<5:53:08, 8.61s/it] 2%|▏ | 40/2500 [07:50<5:53:14, 8.62s/it] {'loss': 0.0143, 'grad_norm': 7.904005708182238, 'learning_rate': 9.84e-07, 'completion_length': 28.92857265472412, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.2527948468923569, 'kl': 0.357421875, 'epoch': 0.02} 2%|▏ | 40/2500 [07:50<5:53:14, 8.62s/it] 2%|▏ | 41/2500 [07:58<5:46:59, 8.47s/it] {'loss': 0.014, 'grad_norm': 3.6671427395543397, 'learning_rate': 9.836e-07, 'completion_length': 31.687501907348633, 'rewards/accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410715222358704, 'reward_std': 0.24498367309570312, 'kl': 0.3505859375, 'epoch': 0.02} 2%|▏ | 41/2500 [07:58<5:46:59, 8.47s/it] 2%|▏ | 42/2500 [08:06<5:41:51, 8.34s/it] {'loss': 0.0151, 'grad_norm': 4.273760446594566, 'learning_rate': 9.832e-07, 'completion_length': 31.517858505249023, 'rewards/accuracy_reward': 0.7767857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7767857909202576, 'reward_std': 0.2248506247997284, 'kl': 0.3759765625, 'epoch': 0.02} 2%|▏ | 42/2500 [08:06<5:41:51, 8.34s/it] 2%|▏ | 43/2500 [08:15<5:46:15, 8.46s/it] {'loss': 0.0142, 'grad_norm': 4.030043198026595, 'learning_rate': 9.828e-07, 'completion_length': 29.705358505249023, 'rewards/accuracy_reward': 0.866071492433548, 'rewards/format_reward': 1.0, 'reward': 1.8660715222358704, 'reward_std': 0.19178561866283417, 'kl': 0.35546875, 'epoch': 0.02} 2%|▏ | 43/2500 [08:15<5:46:15, 8.46s/it] 2%|▏ | 44/2500 [08:23<5:44:07, 8.41s/it] {'loss': 0.0159, 'grad_norm': 5.7895036959780315, 'learning_rate': 9.824e-07, 'completion_length': 29.48214340209961, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.20350521057844162, 'kl': 0.396484375, 'epoch': 0.02} 2%|▏ | 44/2500 [08:23<5:44:07, 8.41s/it] 2%|▏ | 45/2500 [08:32<5:53:18, 8.63s/it] {'loss': 0.0136, 'grad_norm': 2.2931851987804492, 'learning_rate': 9.819999999999999e-07, 'completion_length': 30.571430206298828, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07124518603086472, 'kl': 0.33984375, 'epoch': 0.02} 2%|▏ | 45/2500 [08:32<5:53:18, 8.63s/it] 2%|▏ | 46/2500 [08:41<5:53:28, 8.64s/it] {'loss': 0.0135, 'grad_norm': 3.9356952261301377, 'learning_rate': 9.816e-07, 'completion_length': 28.250000953674316, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.0835726372897625, 'kl': 0.3369140625, 'epoch': 0.02} 2%|▏ | 46/2500 [08:41<5:53:28, 8.64s/it] 2%|▏ | 47/2500 [08:50<5:56:13, 8.71s/it] {'loss': 0.016, 'grad_norm': 4.059325003746998, 'learning_rate': 9.811999999999998e-07, 'completion_length': 30.580358505249023, 'rewards/accuracy_reward': 0.8839286267757416, 'rewards/format_reward': 1.0, 'reward': 1.883928656578064, 'reward_std': 0.14700662717223167, 'kl': 0.3994140625, 'epoch': 0.02} 2%|▏ | 47/2500 [08:50<5:56:13, 8.71s/it] 2%|▏ | 48/2500 [08:59<5:56:34, 8.73s/it] {'loss': 0.0146, 'grad_norm': 3.745011718602996, 'learning_rate': 9.808e-07, 'completion_length': 28.830357551574707, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.36328125, 'epoch': 0.02} 2%|▏ | 48/2500 [08:59<5:56:34, 8.73s/it] 2%|▏ | 49/2500 [09:06<5:42:38, 8.39s/it] {'loss': 0.017, 'grad_norm': 5.854720787355653, 'learning_rate': 9.804e-07, 'completion_length': 29.919644355773926, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.16323687136173248, 'kl': 0.4248046875, 'epoch': 0.02} 2%|▏ | 49/2500 [09:06<5:42:38, 8.39s/it] 2%|▏ | 50/2500 [09:14<5:36:11, 8.23s/it] {'loss': 0.0153, 'grad_norm': 7.232896253202454, 'learning_rate': 9.8e-07, 'completion_length': 32.70535850524902, 'rewards/accuracy_reward': 0.848214328289032, 'rewards/format_reward': 1.0, 'reward': 1.8482143878936768, 'reward_std': 0.19178562611341476, 'kl': 0.3818359375, 'epoch': 0.02} 2%|▏ | 50/2500 [09:14<5:36:11, 8.23s/it] 2%|▏ | 51/2500 [09:21<5:24:08, 7.94s/it] {'loss': 0.0117, 'grad_norm': 2.455720243055604, 'learning_rate': 9.796e-07, 'completion_length': 33.053571701049805, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.2919921875, 'epoch': 0.02} 2%|▏ | 51/2500 [09:21<5:24:08, 7.94s/it] 2%|▏ | 52/2500 [09:29<5:21:40, 7.88s/it] {'loss': 0.0157, 'grad_norm': 7.413320215867539, 'learning_rate': 9.791999999999999e-07, 'completion_length': 31.74107265472412, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.17495086789131165, 'kl': 0.3916015625, 'epoch': 0.02} 2%|▏ | 52/2500 [09:29<5:21:40, 7.88s/it] 2%|▏ | 53/2500 [09:37<5:22:11, 7.90s/it] {'loss': 0.0136, 'grad_norm': 2.248146734224142, 'learning_rate': 9.788e-07, 'completion_length': 30.241073608398438, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.3408203125, 'epoch': 0.02} 2%|▏ | 53/2500 [09:37<5:22:11, 7.90s/it] 2%|▏ | 54/2500 [09:46<5:29:45, 8.09s/it] {'loss': 0.0133, 'grad_norm': 2.654267133466514, 'learning_rate': 9.784e-07, 'completion_length': 31.196430206298828, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.12175281345844269, 'kl': 0.33203125, 'epoch': 0.02} 2%|▏ | 54/2500 [09:46<5:29:45, 8.09s/it] 2%|▏ | 55/2500 [09:54<5:38:12, 8.30s/it] {'loss': 0.0165, 'grad_norm': 4.011487975717722, 'learning_rate': 9.78e-07, 'completion_length': 29.07142925262451, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.4111328125, 'epoch': 0.02} 2%|▏ | 55/2500 [09:54<5:38:12, 8.30s/it] 2%|▏ | 56/2500 [10:04<5:52:18, 8.65s/it] {'loss': 0.0153, 'grad_norm': 2.105989702501326, 'learning_rate': 9.776e-07, 'completion_length': 32.25892925262451, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.3818359375, 'epoch': 0.02} 2%|▏ | 56/2500 [10:04<5:52:18, 8.65s/it] 2%|▏ | 57/2500 [10:12<5:45:12, 8.48s/it] {'loss': 0.0139, 'grad_norm': 3.10555399659426, 'learning_rate': 9.772e-07, 'completion_length': 33.17857360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.12175281345844269, 'kl': 0.3486328125, 'epoch': 0.02} 2%|▏ | 57/2500 [10:12<5:45:12, 8.48s/it] 2%|▏ | 58/2500 [10:20<5:41:49, 8.40s/it] {'loss': 0.015, 'grad_norm': 3.994803576015248, 'learning_rate': 9.768e-07, 'completion_length': 32.16964530944824, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.12444323301315308, 'kl': 0.3740234375, 'epoch': 0.02} 2%|▏ | 58/2500 [10:20<5:41:49, 8.40s/it] 2%|▏ | 59/2500 [10:29<5:45:49, 8.50s/it] {'loss': 0.0143, 'grad_norm': 3.4847253858399565, 'learning_rate': 9.764e-07, 'completion_length': 34.26785850524902, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.18397442996501923, 'kl': 0.357421875, 'epoch': 0.02} 2%|▏ | 59/2500 [10:29<5:45:49, 8.50s/it] 2%|▏ | 60/2500 [10:39<6:09:10, 9.08s/it] {'loss': 0.0138, 'grad_norm': 7.846457076520717, 'learning_rate': 9.759999999999998e-07, 'completion_length': 34.08035850524902, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.0835726372897625, 'kl': 0.345703125, 'epoch': 0.02} 2%|▏ | 60/2500 [10:39<6:09:10, 9.08s/it] 2%|▏ | 61/2500 [10:48<6:06:59, 9.03s/it] {'loss': 0.0139, 'grad_norm': 2.8950792226678472, 'learning_rate': 9.756e-07, 'completion_length': 32.500000953674316, 'rewards/accuracy_reward': 0.9017857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9017858505249023, 'reward_std': 0.05831881985068321, 'kl': 0.3466796875, 'epoch': 0.02} 2%|▏ | 61/2500 [10:48<6:06:59, 9.03s/it] 2%|▏ | 62/2500 [10:57<6:00:30, 8.87s/it] {'loss': 0.0146, 'grad_norm': 3.1360997111715188, 'learning_rate': 9.752e-07, 'completion_length': 34.58928871154785, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0835726410150528, 'kl': 0.365234375, 'epoch': 0.02} 2%|▏ | 62/2500 [10:57<6:00:30, 8.87s/it] 3%|▎ | 63/2500 [11:06<6:03:57, 8.96s/it] {'loss': 0.0132, 'grad_norm': 1.7420239436058473, 'learning_rate': 9.748e-07, 'completion_length': 34.64285850524902, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.06222161278128624, 'kl': 0.3310546875, 'epoch': 0.03} 3%|▎ | 63/2500 [11:06<6:03:57, 8.96s/it] 3%|▎ | 64/2500 [11:15<6:07:00, 9.04s/it] {'loss': 0.0136, 'grad_norm': 1.6233937890275087, 'learning_rate': 9.744e-07, 'completion_length': 32.410715103149414, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.33984375, 'epoch': 0.03} 3%|▎ | 64/2500 [11:15<6:07:00, 9.04s/it] 3%|▎ | 65/2500 [11:24<6:06:13, 9.02s/it] {'loss': 0.0126, 'grad_norm': 2.2916483712906763, 'learning_rate': 9.74e-07, 'completion_length': 36.29464340209961, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.314453125, 'epoch': 0.03} 3%|▎ | 65/2500 [11:24<6:06:13, 9.02s/it] 3%|▎ | 66/2500 [11:38<6:59:39, 10.35s/it] {'loss': 0.0126, 'grad_norm': 5.691253685037339, 'learning_rate': 9.735999999999999e-07, 'completion_length': 41.21428871154785, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9375001192092896, 'reward_std': 0.08747542649507523, 'kl': 0.3154296875, 'epoch': 0.03} 3%|▎ | 66/2500 [11:38<6:59:39, 10.35s/it] 3%|▎ | 67/2500 [11:50<7:21:14, 10.88s/it] {'loss': 0.0132, 'grad_norm': 3.5246132641561605, 'learning_rate': 9.731999999999998e-07, 'completion_length': 40.785715103149414, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9017857909202576, 'reward_std': 0.12626906484365463, 'kl': 0.3291015625, 'epoch': 0.03} 3%|▎ | 67/2500 [11:50<7:21:14, 10.88s/it] 3%|▎ | 68/2500 [11:58<6:53:55, 10.21s/it] {'loss': 0.0116, 'grad_norm': 3.606342521304553, 'learning_rate': 9.728e-07, 'completion_length': 35.75000190734863, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.0739355981349945, 'kl': 0.2900390625, 'epoch': 0.03} 3%|▎ | 68/2500 [11:58<6:53:55, 10.21s/it] 3%|▎ | 69/2500 [12:07<6:38:49, 9.84s/it] {'loss': 0.0141, 'grad_norm': 2.562692900907611, 'learning_rate': 9.724e-07, 'completion_length': 36.17857360839844, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.3525390625, 'epoch': 0.03} 3%|▎ | 69/2500 [12:07<6:38:49, 9.84s/it] 3%|▎ | 70/2500 [12:15<6:16:52, 9.31s/it] {'loss': 0.0122, 'grad_norm': 8.549036350373534, 'learning_rate': 9.72e-07, 'completion_length': 36.33035850524902, 'rewards/accuracy_reward': 0.8125000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8125001192092896, 'reward_std': 0.16531942784786224, 'kl': 0.3056640625, 'epoch': 0.03} 3%|▎ | 70/2500 [12:15<6:16:52, 9.31s/it] 3%|▎ | 71/2500 [12:24<6:09:44, 9.13s/it] {'loss': 0.011, 'grad_norm': 5.8143118978216135, 'learning_rate': 9.716e-07, 'completion_length': 43.06250190734863, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.06613001227378845, 'kl': 0.275390625, 'epoch': 0.03} 3%|▎ | 71/2500 [12:24<6:09:44, 9.13s/it] 3%|▎ | 72/2500 [12:33<6:08:15, 9.10s/it] {'loss': 0.0129, 'grad_norm': 2.3775336936798332, 'learning_rate': 9.712e-07, 'completion_length': 36.37500190734863, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.12444883584976196, 'kl': 0.3232421875, 'epoch': 0.03} 3%|▎ | 72/2500 [12:33<6:08:15, 9.10s/it] 3%|▎ | 73/2500 [12:43<6:15:53, 9.29s/it] {'loss': 0.0104, 'grad_norm': 1.3999137293719048, 'learning_rate': 9.707999999999999e-07, 'completion_length': 38.83928680419922, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.259765625, 'epoch': 0.03} 3%|▎ | 73/2500 [12:43<6:15:53, 9.29s/it] 3%|▎ | 74/2500 [12:52<6:14:19, 9.26s/it] {'loss': 0.0097, 'grad_norm': 3.4543934981456337, 'learning_rate': 9.704e-07, 'completion_length': 44.42857360839844, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.09138382971286774, 'kl': 0.2421875, 'epoch': 0.03} 3%|▎ | 74/2500 [12:52<6:14:19, 9.26s/it] 3%|▎ | 75/2500 [13:02<6:20:07, 9.40s/it] {'loss': 0.0108, 'grad_norm': 2.211869420723148, 'learning_rate': 9.7e-07, 'completion_length': 43.80357360839844, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.10882645845413208, 'kl': 0.271484375, 'epoch': 0.03} 3%|▎ | 75/2500 [13:02<6:20:07, 9.40s/it] 3%|▎ | 76/2500 [13:11<6:14:26, 9.27s/it] {'loss': 0.0116, 'grad_norm': 2.3809881913030866, 'learning_rate': 9.696e-07, 'completion_length': 41.58035850524902, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.09528662264347076, 'kl': 0.2890625, 'epoch': 0.03} 3%|▎ | 76/2500 [13:11<6:14:26, 9.27s/it] 3%|▎ | 77/2500 [13:20<6:12:04, 9.21s/it] {'loss': 0.0081, 'grad_norm': 2.8840231925913735, 'learning_rate': 9.692e-07, 'completion_length': 43.77678680419922, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.20361328125, 'epoch': 0.03} 3%|▎ | 77/2500 [13:20<6:12:04, 9.21s/it] 3%|▎ | 78/2500 [13:30<6:19:22, 9.40s/it] {'loss': 0.0081, 'grad_norm': 0.6282243095100093, 'learning_rate': 9.688e-07, 'completion_length': 46.33035850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.2021484375, 'epoch': 0.03} 3%|▎ | 78/2500 [13:30<6:19:22, 9.40s/it] 3%|▎ | 79/2500 [13:42<6:50:31, 10.17s/it] {'loss': 0.0099, 'grad_norm': 2.138024316398507, 'learning_rate': 9.684e-07, 'completion_length': 43.83035850524902, 'rewards/accuracy_reward': 0.9375000596046448, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.05831881985068321, 'kl': 0.2490234375, 'epoch': 0.03} 3%|▎ | 79/2500 [13:42<6:50:31, 10.17s/it] 3%|▎ | 80/2500 [13:51<6:43:25, 10.00s/it] {'loss': 0.0071, 'grad_norm': 3.167750302888888, 'learning_rate': 9.679999999999999e-07, 'completion_length': 50.580360412597656, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.17626953125, 'epoch': 0.03} 3%|▎ | 80/2500 [13:51<6:43:25, 10.00s/it] 3%|▎ | 81/2500 [14:01<6:42:58, 10.00s/it] {'loss': 0.008, 'grad_norm': 2.630463980016531, 'learning_rate': 9.676e-07, 'completion_length': 42.910715103149414, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.201171875, 'epoch': 0.03} 3%|▎ | 81/2500 [14:01<6:42:58, 10.00s/it] 3%|▎ | 82/2500 [14:10<6:33:34, 9.77s/it] {'loss': 0.0088, 'grad_norm': 8.95793685003253, 'learning_rate': 9.671999999999998e-07, 'completion_length': 46.64285850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.21923828125, 'epoch': 0.03} 3%|▎ | 82/2500 [14:10<6:33:34, 9.77s/it] 3%|▎ | 83/2500 [14:20<6:30:59, 9.71s/it] {'loss': 0.0084, 'grad_norm': 2.30980030296194, 'learning_rate': 9.668e-07, 'completion_length': 48.267860412597656, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.09918941557407379, 'kl': 0.21044921875, 'epoch': 0.03} 3%|▎ | 83/2500 [14:20<6:30:59, 9.71s/it] 3%|▎ | 84/2500 [14:29<6:19:43, 9.43s/it] {'loss': 0.0078, 'grad_norm': 2.9533184760768916, 'learning_rate': 9.664e-07, 'completion_length': 49.205360412597656, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.919642984867096, 'reward_std': 0.10821297764778137, 'kl': 0.19384765625, 'epoch': 0.03} 3%|▎ | 84/2500 [14:29<6:19:43, 9.43s/it] 3%|▎ | 85/2500 [14:44<7:25:13, 11.06s/it] {'loss': 0.0077, 'grad_norm': 0.31261205547277776, 'learning_rate': 9.66e-07, 'completion_length': 46.49107360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1923828125, 'epoch': 0.03} 3%|▎ | 85/2500 [14:44<7:25:13, 11.06s/it] 3%|▎ | 86/2500 [14:58<8:05:46, 12.07s/it] {'loss': 0.0078, 'grad_norm': 2.814634092912839, 'learning_rate': 9.656e-07, 'completion_length': 51.00893020629883, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.0835726372897625, 'kl': 0.19580078125, 'epoch': 0.03} 3%|▎ | 86/2500 [14:58<8:05:46, 12.07s/it] 3%|▎ | 87/2500 [15:17<9:32:15, 14.23s/it] {'loss': 0.0084, 'grad_norm': 0.3258107570348934, 'learning_rate': 9.651999999999999e-07, 'completion_length': 50.56250190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.208984375, 'epoch': 0.03} 3%|▎ | 87/2500 [15:18<9:32:15, 14.23s/it] 4%|▎ | 88/2500 [15:52<13:40:25, 20.41s/it] {'loss': 0.0079, 'grad_norm': 2.557437672788387, 'learning_rate': 9.647999999999999e-07, 'completion_length': 49.48214530944824, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.09138382971286774, 'kl': 0.1962890625, 'epoch': 0.04} 4%|▎ | 88/2500 [15:52<13:40:25, 20.41s/it] 4%|▎ | 89/2500 [16:07<12:34:00, 18.76s/it] {'loss': 0.0099, 'grad_norm': 0.4672926200486787, 'learning_rate': 9.644e-07, 'completion_length': 47.66071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.24658203125, 'epoch': 0.04} 4%|▎ | 89/2500 [16:07<12:34:00, 18.76s/it] 4%|▎ | 90/2500 [16:21<11:38:48, 17.40s/it] {'loss': 0.0073, 'grad_norm': 6.1674889955633345, 'learning_rate': 9.64e-07, 'completion_length': 56.64285850524902, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.14970265328884125, 'kl': 0.18310546875, 'epoch': 0.04} 4%|▎ | 90/2500 [16:21<11:38:48, 17.40s/it] 4%|▎ | 91/2500 [17:08<17:34:53, 26.27s/it] {'loss': 0.0099, 'grad_norm': 0.36493710104903954, 'learning_rate': 9.636e-07, 'completion_length': 48.080360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.24853515625, 'epoch': 0.04} 4%|▎ | 91/2500 [17:08<17:34:53, 26.27s/it] 4%|▎ | 92/2500 [17:39<18:21:52, 27.46s/it] {'loss': 0.0091, 'grad_norm': 3.7439897649287364, 'learning_rate': 9.632e-07, 'completion_length': 51.36607360839844, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553572535514832, 'reward_std': 0.08747542649507523, 'kl': 0.2275390625, 'epoch': 0.04} 4%|▎ | 92/2500 [17:39<18:21:52, 27.46s/it] 4%|▎ | 93/2500 [18:01<17:25:26, 26.06s/it] {'loss': 0.0096, 'grad_norm': 3.491190011775805, 'learning_rate': 9.628e-07, 'completion_length': 52.13393211364746, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0835726335644722, 'kl': 0.2412109375, 'epoch': 0.04} 4%|▎ | 93/2500 [18:01<17:25:26, 26.06s/it] 4%|▍ | 94/2500 [19:05<24:56:30, 37.32s/it] {'loss': 0.01, 'grad_norm': 0.34074788118769633, 'learning_rate': 9.624e-07, 'completion_length': 51.01785850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.24951171875, 'epoch': 0.04} 4%|▍ | 94/2500 [19:05<24:56:30, 37.32s/it] 4%|▍ | 95/2500 [19:44<25:20:15, 37.93s/it] {'loss': 0.0074, 'grad_norm': 22.64230567294913, 'learning_rate': 9.619999999999999e-07, 'completion_length': 55.32143211364746, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.12175281345844269, 'kl': 0.1845703125, 'epoch': 0.04} 4%|▍ | 95/2500 [19:44<25:20:15, 37.93s/it] 4%|▍ | 96/2500 [20:07<22:13:28, 33.28s/it] {'loss': 0.0076, 'grad_norm': 3.9003874450048075, 'learning_rate': 9.616e-07, 'completion_length': 52.29464340209961, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.10882645472884178, 'kl': 0.1904296875, 'epoch': 0.04} 4%|▍ | 96/2500 [20:07<22:13:28, 33.28s/it] 4%|▍ | 97/2500 [20:17<17:42:42, 26.53s/it] {'loss': 0.0067, 'grad_norm': 3.1224413976574183, 'learning_rate': 9.612e-07, 'completion_length': 56.90178871154785, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.16650390625, 'epoch': 0.04} 4%|▍ | 97/2500 [20:17<17:42:42, 26.53s/it] 4%|▍ | 98/2500 [20:28<14:33:15, 21.81s/it] {'loss': 0.0073, 'grad_norm': 2.259040028056323, 'learning_rate': 9.608e-07, 'completion_length': 58.04464530944824, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.06613001227378845, 'kl': 0.1826171875, 'epoch': 0.04} 4%|▍ | 98/2500 [20:28<14:33:15, 21.81s/it] 4%|▍ | 99/2500 [20:45<13:37:14, 20.42s/it] {'loss': 0.0064, 'grad_norm': 4.694727422838043, 'learning_rate': 9.604e-07, 'completion_length': 60.08036231994629, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.883928656578064, 'reward_std': 0.1827620565891266, 'kl': 0.1591796875, 'epoch': 0.04} 4%|▍ | 99/2500 [20:45<13:37:14, 20.42s/it] 4%|▍ | 100/2500 [21:42<20:48:36, 31.22s/it] {'loss': 0.0056, 'grad_norm': 2.4105295492332472, 'learning_rate': 9.6e-07, 'completion_length': 62.142860412597656, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.0835726372897625, 'kl': 0.140625, 'epoch': 0.04} 4%|▍ | 100/2500 [21:42<20:48:36, 31.22s/it] 4%|▍ | 101/2500 [23:35<37:10:22, 55.78s/it] {'loss': 0.0064, 'grad_norm': 5.144689931441991, 'learning_rate': 9.595999999999999e-07, 'completion_length': 58.205360412597656, 'rewards/accuracy_reward': 0.9375000596046448, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.12054043635725975, 'kl': 0.15966796875, 'epoch': 0.04} 4%|▍ | 101/2500 [23:35<37:10:22, 55.78s/it] 4%|▍ | 102/2500 [23:57<30:30:30, 45.80s/it] {'loss': 0.006, 'grad_norm': 1.077838228600196, 'learning_rate': 9.592e-07, 'completion_length': 62.107147216796875, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.150390625, 'epoch': 0.04} 4%|▍ | 102/2500 [23:57<30:30:30, 45.80s/it] 4%|▍ | 103/2500 [25:01<34:07:37, 51.25s/it] {'loss': 0.0063, 'grad_norm': 3.713594617747095, 'learning_rate': 9.588e-07, 'completion_length': 63.44643211364746, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.919642984867096, 'reward_std': 0.14700662344694138, 'kl': 0.158203125, 'epoch': 0.04} 4%|▍ | 103/2500 [25:01<34:07:37, 51.25s/it] 4%|▍ | 104/2500 [26:50<45:32:45, 68.43s/it] {'loss': 0.0063, 'grad_norm': 2.806150180018411, 'learning_rate': 9.584e-07, 'completion_length': 66.22321701049805, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.14700663834810257, 'kl': 0.15771484375, 'epoch': 0.04} 4%|▍ | 104/2500 [26:50<45:32:45, 68.43s/it] 4%|▍ | 105/2500 [27:37<41:11:33, 61.92s/it] {'loss': 0.006, 'grad_norm': 2.000624591343667, 'learning_rate': 9.58e-07, 'completion_length': 69.21428680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.06613001227378845, 'kl': 0.14892578125, 'epoch': 0.04} 4%|▍ | 105/2500 [27:37<41:11:33, 61.92s/it] 4%|▍ | 106/2500 [28:40<41:25:19, 62.29s/it] {'loss': 0.0059, 'grad_norm': 2.6057507768662576, 'learning_rate': 9.576e-07, 'completion_length': 70.5625, 'rewards/accuracy_reward': 0.8839285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.8750000596046448, 'reward_std': 0.13756756484508514, 'kl': 0.14697265625, 'epoch': 0.04} 4%|▍ | 106/2500 [28:40<41:25:19, 62.29s/it] 4%|▍ | 107/2500 [30:08<46:29:38, 69.95s/it] {'loss': 0.007, 'grad_norm': 1.7083980714075133, 'learning_rate': 9.572e-07, 'completion_length': 59.03571701049805, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.173828125, 'epoch': 0.04} 4%|▍ | 107/2500 [30:08<46:29:38, 69.95s/it] 4%|▍ | 108/2500 [31:50<52:57:27, 79.70s/it] {'loss': 0.0063, 'grad_norm': 0.5878782755224018, 'learning_rate': 9.567999999999999e-07, 'completion_length': 73.08929061889648, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.15869140625, 'epoch': 0.04} 4%|▍ | 108/2500 [31:50<52:57:27, 79.70s/it] 4%|▍ | 109/2500 [32:53<49:40:51, 74.80s/it] {'loss': 0.0057, 'grad_norm': 2.68646390347827, 'learning_rate': 9.564e-07, 'completion_length': 75.63393020629883, 'rewards/accuracy_reward': 0.8839286267757416, 'rewards/format_reward': 1.0, 'reward': 1.883928656578064, 'reward_std': 0.17616324126720428, 'kl': 0.1416015625, 'epoch': 0.04} 4%|▍ | 109/2500 [32:54<49:40:51, 74.80s/it] 4%|▍ | 110/2500 [33:46<45:18:16, 68.24s/it] {'loss': 0.0059, 'grad_norm': 1.6508800780541193, 'learning_rate': 9.559999999999998e-07, 'completion_length': 61.312503814697266, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0835726335644722, 'kl': 0.14794921875, 'epoch': 0.04} 4%|▍ | 110/2500 [33:46<45:18:16, 68.24s/it] 4%|▍ | 111/2500 [34:11<36:40:13, 55.26s/it] {'loss': 0.0068, 'grad_norm': 2.4300933169523424, 'learning_rate': 9.556e-07, 'completion_length': 62.28571701049805, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.06343399360775948, 'kl': 0.169921875, 'epoch': 0.04} 4%|▍ | 111/2500 [34:11<36:40:13, 55.26s/it] 4%|▍ | 112/2500 [34:36<30:38:26, 46.19s/it] {'loss': 0.0067, 'grad_norm': 2.6754478653339375, 'learning_rate': 9.552e-07, 'completion_length': 64.88393020629883, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.1669921875, 'epoch': 0.04} 4%|▍ | 112/2500 [34:36<30:38:26, 46.19s/it] 5%|▍ | 113/2500 [36:26<43:18:46, 65.32s/it] {'loss': 0.0068, 'grad_norm': 2.130023444413891, 'learning_rate': 9.548e-07, 'completion_length': 71.03571701049805, 'rewards/accuracy_reward': 0.8125000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8125001192092896, 'reward_std': 0.12054043635725975, 'kl': 0.169921875, 'epoch': 0.05} 5%|▍ | 113/2500 [36:26<43:18:46, 65.32s/it] 5%|▍ | 114/2500 [37:39<44:49:35, 67.63s/it] {'loss': 0.0062, 'grad_norm': 1.2865066970075152, 'learning_rate': 9.544e-07, 'completion_length': 68.0535774230957, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9375001192092896, 'reward_std': 0.12054043263196945, 'kl': 0.15380859375, 'epoch': 0.05} 5%|▍ | 114/2500 [37:39<44:49:35, 67.63s/it] 5%|▍ | 115/2500 [38:10<37:21:40, 56.39s/it] {'loss': 0.0062, 'grad_norm': 0.2087529960054656, 'learning_rate': 9.539999999999999e-07, 'completion_length': 62.062503814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1552734375, 'epoch': 0.05} 5%|▍ | 115/2500 [38:10<37:21:40, 56.39s/it] 5%|▍ | 116/2500 [38:46<33:27:19, 50.52s/it] {'loss': 0.0082, 'grad_norm': 2.319862727595068, 'learning_rate': 9.536e-07, 'completion_length': 50.392860412597656, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.06222161650657654, 'kl': 0.2041015625, 'epoch': 0.05} 5%|▍ | 116/2500 [38:46<33:27:19, 50.52s/it] 5%|▍ | 117/2500 [40:24<42:50:49, 64.73s/it] {'loss': 0.0056, 'grad_norm': 0.9277725797016175, 'learning_rate': 9.532e-07, 'completion_length': 63.35714530944824, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.140380859375, 'epoch': 0.05} 5%|▍ | 117/2500 [40:24<42:50:49, 64.73s/it] 5%|▍ | 118/2500 [42:25<53:52:07, 81.41s/it] {'loss': 0.0082, 'grad_norm': 0.6311981524657682, 'learning_rate': 9.527999999999999e-07, 'completion_length': 56.267860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.2060546875, 'epoch': 0.05} 5%|▍ | 118/2500 [42:25<53:52:07, 81.41s/it] 5%|▍ | 119/2500 [43:50<54:37:34, 82.59s/it] {'loss': 0.0074, 'grad_norm': 1.894693510583094, 'learning_rate': 9.524e-07, 'completion_length': 65.0089340209961, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9375000596046448, 'reward_std': 0.11394162103533745, 'kl': 0.18408203125, 'epoch': 0.05} 5%|▍ | 119/2500 [43:50<54:37:34, 82.59s/it] 5%|▍ | 120/2500 [44:16<43:28:14, 65.75s/it] {'loss': 0.0081, 'grad_norm': 1.0142262496768302, 'learning_rate': 9.52e-07, 'completion_length': 66.73214721679688, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.2021484375, 'epoch': 0.05} 5%|▍ | 120/2500 [44:16<43:28:14, 65.75s/it] 5%|▍ | 121/2500 [45:14<41:52:02, 63.36s/it] {'loss': 0.0065, 'grad_norm': 0.5801250741286851, 'learning_rate': 9.515999999999999e-07, 'completion_length': 63.29464530944824, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.025253813713788986, 'kl': 0.16259765625, 'epoch': 0.05} 5%|▍ | 121/2500 [45:14<41:52:02, 63.36s/it] 5%|▍ | 122/2500 [45:51<36:37:56, 55.46s/it] {'loss': 0.007, 'grad_norm': 6.833351409702556, 'learning_rate': 9.512e-07, 'completion_length': 61.50000190734863, 'rewards/accuracy_reward': 0.928571492433548, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.919642984867096, 'reward_std': 0.14700663089752197, 'kl': 0.17578125, 'epoch': 0.05} 5%|▍ | 122/2500 [45:51<36:37:56, 55.46s/it] 5%|▍ | 123/2500 [46:16<30:33:35, 46.28s/it] {'loss': 0.007, 'grad_norm': 1.276579053243242, 'learning_rate': 9.508e-07, 'completion_length': 63.43750190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.17578125, 'epoch': 0.05} 5%|▍ | 123/2500 [46:16<30:33:35, 46.28s/it] 5%|▍ | 124/2500 [46:49<27:58:29, 42.39s/it] {'loss': 0.007, 'grad_norm': 3.9338629722615583, 'learning_rate': 9.503999999999999e-07, 'completion_length': 61.53571701049805, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.17626953125, 'epoch': 0.05} 5%|▍ | 124/2500 [46:49<27:58:29, 42.39s/it] 5%|▌ | 125/2500 [47:39<29:20:26, 44.47s/it] {'loss': 0.0069, 'grad_norm': 2.946475594367466, 'learning_rate': 9.499999999999999e-07, 'completion_length': 64.45536231994629, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.883928656578064, 'reward_std': 0.19239908456802368, 'kl': 0.1728515625, 'epoch': 0.05} 5%|▌ | 125/2500 [47:39<29:20:26, 44.47s/it] 5%|▌ | 126/2500 [48:01<25:00:56, 37.93s/it] {'loss': 0.0062, 'grad_norm': 1.1604657353917895, 'learning_rate': 9.496e-07, 'completion_length': 56.65178871154785, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.1552734375, 'epoch': 0.05} 5%|▌ | 126/2500 [48:01<25:00:56, 37.93s/it] 5%|▌ | 127/2500 [48:44<25:49:46, 39.19s/it] {'loss': 0.0073, 'grad_norm': 1.9595416816931035, 'learning_rate': 9.492e-07, 'completion_length': 56.96428871154785, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.09528661891818047, 'kl': 0.1826171875, 'epoch': 0.05} 5%|▌ | 127/2500 [48:44<25:49:46, 39.19s/it] 5%|▌ | 128/2500 [49:17<24:41:01, 37.46s/it] {'loss': 0.0074, 'grad_norm': 1.3031478802719807, 'learning_rate': 9.487999999999999e-07, 'completion_length': 55.45535850524902, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857313156128, 'reward_std': 0.07839837670326233, 'kl': 0.1845703125, 'epoch': 0.05} 5%|▌ | 128/2500 [49:17<24:41:01, 37.46s/it] 5%|▌ | 129/2500 [50:02<26:15:13, 39.86s/it] {'loss': 0.0088, 'grad_norm': 1.9964439804590464, 'learning_rate': 9.484e-07, 'completion_length': 53.37500190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.2197265625, 'epoch': 0.05} 5%|▌ | 129/2500 [50:02<26:15:13, 39.86s/it] 5%|▌ | 130/2500 [51:30<35:35:50, 54.07s/it] {'loss': 0.0069, 'grad_norm': 4.727533708047264, 'learning_rate': 9.479999999999999e-07, 'completion_length': 53.36607360839844, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.17236328125, 'epoch': 0.05} 5%|▌ | 130/2500 [51:30<35:35:50, 54.07s/it] 5%|▌ | 131/2500 [52:26<36:04:03, 54.81s/it] {'loss': 0.0091, 'grad_norm': 1.498174635056329, 'learning_rate': 9.475999999999999e-07, 'completion_length': 52.05357360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.10882644355297089, 'kl': 0.22802734375, 'epoch': 0.05} 5%|▌ | 131/2500 [52:26<36:04:03, 54.81s/it] 5%|▌ | 132/2500 [53:35<38:44:40, 58.90s/it] {'loss': 0.0076, 'grad_norm': 1.1428945079644877, 'learning_rate': 9.472e-07, 'completion_length': 46.91964530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.19091796875, 'epoch': 0.05} 5%|▌ | 132/2500 [53:35<38:44:40, 58.90s/it] 5%|▌ | 133/2500 [55:52<54:14:33, 82.50s/it] {'loss': 0.0081, 'grad_norm': 0.5356214971238241, 'learning_rate': 9.468e-07, 'completion_length': 49.86607360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.20263671875, 'epoch': 0.05} 5%|▌ | 133/2500 [55:52<54:14:33, 82.50s/it] 5%|▌ | 134/2500 [56:16<42:34:21, 64.78s/it] {'loss': 0.0079, 'grad_norm': 2.9235496316200065, 'learning_rate': 9.464e-07, 'completion_length': 46.80357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.19775390625, 'epoch': 0.05} 5%|▌ | 134/2500 [56:16<42:34:21, 64.78s/it] 5%|▌ | 135/2500 [56:39<34:22:45, 52.33s/it] {'loss': 0.0101, 'grad_norm': 1.7394008173536268, 'learning_rate': 9.459999999999999e-07, 'completion_length': 54.10714530944824, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.25341796875, 'epoch': 0.05} 5%|▌ | 135/2500 [56:39<34:22:45, 52.33s/it] 5%|▌ | 136/2500 [57:04<29:01:41, 44.21s/it] {'loss': 0.0109, 'grad_norm': 6.353410542755421, 'learning_rate': 9.456e-07, 'completion_length': 37.848215103149414, 'rewards/accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9017857909202576, 'reward_std': 0.0964989997446537, 'kl': 0.2724609375, 'epoch': 0.05} 5%|▌ | 136/2500 [57:04<29:01:41, 44.21s/it] 5%|▌ | 137/2500 [57:29<25:13:59, 38.44s/it] {'loss': 0.0112, 'grad_norm': 3.217581181089829, 'learning_rate': 9.452e-07, 'completion_length': 40.58035850524902, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.883928656578064, 'reward_std': 0.11394162103533745, 'kl': 0.279296875, 'epoch': 0.05} 5%|▌ | 137/2500 [57:29<25:13:59, 38.44s/it] 6%|▌ | 138/2500 [57:52<22:14:41, 33.90s/it] {'loss': 0.0127, 'grad_norm': 4.862289301079396, 'learning_rate': 9.447999999999999e-07, 'completion_length': 32.875000953674316, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.31640625, 'epoch': 0.06} 6%|▌ | 138/2500 [57:52<22:14:41, 33.90s/it] 6%|▌ | 139/2500 [59:43<37:21:36, 56.97s/it] {'loss': 0.0091, 'grad_norm': 3.301946219338089, 'learning_rate': 9.444e-07, 'completion_length': 45.85714530944824, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.22802734375, 'epoch': 0.06} 6%|▌ | 139/2500 [59:43<37:21:36, 56.97s/it] 6%|▌ | 140/2500 [1:00:07<30:45:42, 46.92s/it] {'loss': 0.0084, 'grad_norm': 3.5345896085125457, 'learning_rate': 9.439999999999999e-07, 'completion_length': 47.16964530944824, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553572535514832, 'reward_std': 0.10882645100355148, 'kl': 0.21044921875, 'epoch': 0.06} 6%|▌ | 140/2500 [1:00:07<30:45:42, 46.92s/it] 6%|▌ | 141/2500 [1:00:28<25:40:43, 39.19s/it] {'loss': 0.0087, 'grad_norm': 1.2236096559521739, 'learning_rate': 9.436e-07, 'completion_length': 59.410715103149414, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.025253813713788986, 'kl': 0.216796875, 'epoch': 0.06} 6%|▌ | 141/2500 [1:00:28<25:40:43, 39.19s/it] 6%|▌ | 142/2500 [1:00:52<22:42:32, 34.67s/it] {'loss': 0.0095, 'grad_norm': 6.982163588499465, 'learning_rate': 9.432e-07, 'completion_length': 43.44643020629883, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.14700663089752197, 'kl': 0.23583984375, 'epoch': 0.06} 6%|▌ | 142/2500 [1:00:52<22:42:32, 34.67s/it] 6%|▌ | 143/2500 [1:01:15<20:27:28, 31.25s/it] {'loss': 0.0087, 'grad_norm': 3.6745881896819674, 'learning_rate': 9.427999999999999e-07, 'completion_length': 45.06250190734863, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.06222161278128624, 'kl': 0.21875, 'epoch': 0.06} 6%|▌ | 143/2500 [1:01:15<20:27:28, 31.25s/it] 6%|▌ | 144/2500 [1:01:45<20:08:12, 30.77s/it] {'loss': 0.0084, 'grad_norm': 8.958322803545965, 'learning_rate': 9.424e-07, 'completion_length': 56.44643020629883, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.946428656578064, 'reward_std': 0.13408026099205017, 'kl': 0.208984375, 'epoch': 0.06} 6%|▌ | 144/2500 [1:01:45<20:08:12, 30.77s/it] 6%|▌ | 145/2500 [1:02:11<19:09:15, 29.28s/it] {'loss': 0.0063, 'grad_norm': 1.1859465430757974, 'learning_rate': 9.419999999999999e-07, 'completion_length': 62.23214530944824, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553572535514832, 'reward_std': 0.08747542649507523, 'kl': 0.1572265625, 'epoch': 0.06} 6%|▌ | 145/2500 [1:02:11<19:09:15, 29.28s/it] 6%|▌ | 146/2500 [1:02:35<18:09:35, 27.77s/it] {'loss': 0.0077, 'grad_norm': 0.3113275822075706, 'learning_rate': 9.415999999999999e-07, 'completion_length': 48.062503814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1923828125, 'epoch': 0.06} 6%|▌ | 146/2500 [1:02:35<18:09:35, 27.77s/it] 6%|▌ | 147/2500 [1:02:59<17:19:55, 26.52s/it] {'loss': 0.0071, 'grad_norm': 0.3835968018915491, 'learning_rate': 9.412e-07, 'completion_length': 53.25893211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.177734375, 'epoch': 0.06} 6%|▌ | 147/2500 [1:02:59<17:19:55, 26.52s/it] 6%|▌ | 148/2500 [1:04:08<25:46:36, 39.45s/it] {'loss': 0.0097, 'grad_norm': 1.5445353530147403, 'learning_rate': 9.408e-07, 'completion_length': 49.36607360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.2412109375, 'epoch': 0.06} 6%|▌ | 148/2500 [1:04:08<25:46:36, 39.45s/it] 6%|▌ | 149/2500 [1:04:39<24:06:39, 36.92s/it] {'loss': 0.0069, 'grad_norm': 0.23402351740928432, 'learning_rate': 9.403999999999999e-07, 'completion_length': 61.75893211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.17138671875, 'epoch': 0.06} 6%|▌ | 149/2500 [1:04:39<24:06:39, 36.92s/it] 6%|▌ | 150/2500 [1:05:21<25:04:03, 38.40s/it] {'loss': 0.0084, 'grad_norm': 0.3434302221304558, 'learning_rate': 9.399999999999999e-07, 'completion_length': 57.13393020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.20947265625, 'epoch': 0.06} 6%|▌ | 150/2500 [1:05:21<25:04:03, 38.40s/it] 6%|▌ | 151/2500 [1:05:50<23:10:23, 35.51s/it] {'loss': 0.0046, 'grad_norm': 0.39755060404337333, 'learning_rate': 9.396e-07, 'completion_length': 69.39286041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.114013671875, 'epoch': 0.06} 6%|▌ | 151/2500 [1:05:50<23:10:23, 35.51s/it] 6%|▌ | 152/2500 [1:06:15<21:13:42, 32.55s/it] {'loss': 0.0044, 'grad_norm': 2.584611229091915, 'learning_rate': 9.391999999999999e-07, 'completion_length': 76.46429061889648, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.110595703125, 'epoch': 0.06} 6%|▌ | 152/2500 [1:06:15<21:13:42, 32.55s/it] 6%|▌ | 153/2500 [1:07:29<29:19:14, 44.97s/it] {'loss': 0.0047, 'grad_norm': 1.5750091763742335, 'learning_rate': 9.387999999999999e-07, 'completion_length': 73.00893020629883, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.117431640625, 'epoch': 0.06} 6%|▌ | 153/2500 [1:07:29<29:19:14, 44.97s/it] 6%|▌ | 154/2500 [1:08:55<37:20:16, 57.30s/it] {'loss': 0.0041, 'grad_norm': 0.7672169834139426, 'learning_rate': 9.384e-07, 'completion_length': 79.21428680419922, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.10302734375, 'epoch': 0.06} 6%|▌ | 154/2500 [1:08:56<37:20:16, 57.30s/it] 6%|▌ | 155/2500 [1:09:23<31:31:53, 48.41s/it] {'loss': 0.0047, 'grad_norm': 1.4968494328908397, 'learning_rate': 9.379999999999998e-07, 'completion_length': 72.91964721679688, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.09138382598757744, 'kl': 0.117431640625, 'epoch': 0.06} 6%|▌ | 155/2500 [1:09:23<31:31:53, 48.41s/it] 6%|▌ | 156/2500 [1:10:54<39:46:23, 61.08s/it] {'loss': 0.0049, 'grad_norm': 1.5319169564329498, 'learning_rate': 9.375999999999999e-07, 'completion_length': 83.84822082519531, 'rewards/accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.06222161278128624, 'kl': 0.1220703125, 'epoch': 0.06} 6%|▌ | 156/2500 [1:10:54<39:46:23, 61.08s/it] 6%|▋ | 157/2500 [1:13:00<52:24:52, 80.53s/it] {'loss': 0.0051, 'grad_norm': 1.6225000666891893, 'learning_rate': 9.372e-07, 'completion_length': 76.64286041259766, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.07003280520439148, 'kl': 0.127197265625, 'epoch': 0.06} 6%|▋ | 157/2500 [1:13:00<52:24:52, 80.53s/it] 6%|▋ | 158/2500 [1:13:25<41:30:33, 63.81s/it] {'loss': 0.0038, 'grad_norm': 0.24778355997450294, 'learning_rate': 9.368e-07, 'completion_length': 77.67857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.094970703125, 'epoch': 0.06} 6%|▋ | 158/2500 [1:13:25<41:30:33, 63.81s/it] 6%|▋ | 159/2500 [1:13:51<34:10:36, 52.56s/it] {'loss': 0.0054, 'grad_norm': 0.38589582236273456, 'learning_rate': 9.363999999999999e-07, 'completion_length': 78.83036041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.134521484375, 'epoch': 0.06} 6%|▋ | 159/2500 [1:13:51<34:10:36, 52.56s/it] 6%|▋ | 160/2500 [1:14:14<28:31:45, 43.89s/it] {'loss': 0.0041, 'grad_norm': 2.500657037563095, 'learning_rate': 9.36e-07, 'completion_length': 80.8839340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.103759765625, 'epoch': 0.06} 6%|▋ | 160/2500 [1:14:14<28:31:45, 43.89s/it] 6%|▋ | 161/2500 [1:14:39<24:45:52, 38.12s/it] {'loss': 0.0044, 'grad_norm': 2.062576438219868, 'learning_rate': 9.356e-07, 'completion_length': 75.75, 'rewards/accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.09528661891818047, 'kl': 0.109130859375, 'epoch': 0.06} 6%|▋ | 161/2500 [1:14:39<24:45:52, 38.12s/it] 6%|▋ | 162/2500 [1:15:08<22:59:33, 35.40s/it] {'loss': 0.0047, 'grad_norm': 2.353687489175249, 'learning_rate': 9.352e-07, 'completion_length': 70.98214721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.117431640625, 'epoch': 0.06} 6%|▋ | 162/2500 [1:15:08<22:59:33, 35.40s/it] 7%|▋ | 163/2500 [1:16:18<29:38:54, 45.67s/it] {'loss': 0.0057, 'grad_norm': 1.7998323730163406, 'learning_rate': 9.347999999999999e-07, 'completion_length': 63.062503814697266, 'rewards/accuracy_reward': 0.830357164144516, 'rewards/format_reward': 1.0, 'reward': 1.8303572535514832, 'reward_std': 0.08747542649507523, 'kl': 0.142822265625, 'epoch': 0.07} 7%|▋ | 163/2500 [1:16:18<29:38:54, 45.67s/it] 7%|▋ | 164/2500 [1:16:42<25:30:08, 39.30s/it] {'loss': 0.004, 'grad_norm': 4.701779475437639, 'learning_rate': 9.344e-07, 'completion_length': 71.25000381469727, 'rewards/accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9017857909202576, 'reward_std': 0.12054044008255005, 'kl': 0.099365234375, 'epoch': 0.07} 7%|▋ | 164/2500 [1:16:42<25:30:08, 39.30s/it] 7%|▋ | 165/2500 [1:17:05<22:14:11, 34.28s/it] {'loss': 0.0042, 'grad_norm': 0.15388148042595406, 'learning_rate': 9.34e-07, 'completion_length': 71.5535774230957, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1044921875, 'epoch': 0.07} 7%|▋ | 165/2500 [1:17:05<22:14:11, 34.28s/it] 7%|▋ | 166/2500 [1:17:32<20:50:08, 32.14s/it] {'loss': 0.0046, 'grad_norm': 1.7243971227325265, 'learning_rate': 9.335999999999999e-07, 'completion_length': 76.49107360839844, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.11376953125, 'epoch': 0.07} 7%|▋ | 166/2500 [1:17:32<20:50:08, 32.14s/it] 7%|▋ | 167/2500 [1:17:58<19:44:01, 30.45s/it] {'loss': 0.0048, 'grad_norm': 0.1487512569800048, 'learning_rate': 9.332e-07, 'completion_length': 69.69643020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12109375, 'epoch': 0.07} 7%|▋ | 167/2500 [1:17:59<19:44:01, 30.45s/it] 7%|▋ | 168/2500 [1:18:23<18:32:37, 28.63s/it] {'loss': 0.0039, 'grad_norm': 5.529435312714815, 'learning_rate': 9.327999999999999e-07, 'completion_length': 66.80357551574707, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.09528662264347076, 'kl': 0.0966796875, 'epoch': 0.07} 7%|▋ | 168/2500 [1:18:23<18:32:37, 28.63s/it] 7%|▋ | 169/2500 [1:18:50<18:13:16, 28.14s/it] {'loss': 0.0039, 'grad_norm': 14.837470566751216, 'learning_rate': 9.324e-07, 'completion_length': 84.80357360839844, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9375000596046448, 'reward_std': 0.15933407470583916, 'kl': 0.0966796875, 'epoch': 0.07} 7%|▋ | 169/2500 [1:18:50<18:13:16, 28.14s/it] 7%|▋ | 170/2500 [1:19:17<17:56:23, 27.72s/it] {'loss': 0.0042, 'grad_norm': 1.673482740256251, 'learning_rate': 9.32e-07, 'completion_length': 88.53571701049805, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.10498046875, 'epoch': 0.07} 7%|▋ | 170/2500 [1:19:17<17:56:23, 27.72s/it] 7%|▋ | 171/2500 [1:19:43<17:37:28, 27.24s/it] {'loss': 0.0046, 'grad_norm': 2.8398667124982597, 'learning_rate': 9.315999999999999e-07, 'completion_length': 73.06250381469727, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.10882645472884178, 'kl': 0.11572265625, 'epoch': 0.07} 7%|▋ | 171/2500 [1:19:43<17:37:28, 27.24s/it] 7%|▋ | 172/2500 [1:20:11<17:44:52, 27.45s/it] {'loss': 0.0038, 'grad_norm': 1.937038392444324, 'learning_rate': 9.312e-07, 'completion_length': 81.93750381469727, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.095703125, 'epoch': 0.07} 7%|▋ | 172/2500 [1:20:11<17:44:52, 27.45s/it] 7%|▋ | 173/2500 [1:20:42<18:24:44, 28.49s/it] {'loss': 0.0038, 'grad_norm': 3.012140736461382, 'learning_rate': 9.307999999999999e-07, 'completion_length': 90.84821701049805, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9375000596046448, 'reward_std': 0.1379830539226532, 'kl': 0.095947265625, 'epoch': 0.07} 7%|▋ | 173/2500 [1:20:42<18:24:44, 28.49s/it] 7%|▋ | 174/2500 [1:21:07<17:52:44, 27.67s/it] {'loss': 0.0046, 'grad_norm': 2.1092135421691816, 'learning_rate': 9.303999999999999e-07, 'completion_length': 78.45536041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.11572265625, 'epoch': 0.07} 7%|▋ | 174/2500 [1:21:07<17:52:44, 27.67s/it] 7%|▋ | 175/2500 [1:21:35<17:47:51, 27.56s/it] {'loss': 0.0041, 'grad_norm': 0.9950392900841121, 'learning_rate': 9.3e-07, 'completion_length': 87.53571701049805, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.946428656578064, 'reward_std': 0.13408026099205017, 'kl': 0.101318359375, 'epoch': 0.07} 7%|▋ | 175/2500 [1:21:35<17:47:51, 27.56s/it] 7%|▋ | 176/2500 [1:22:00<17:18:03, 26.80s/it] {'loss': 0.0038, 'grad_norm': 0.18099726826676277, 'learning_rate': 9.296e-07, 'completion_length': 84.34822082519531, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09619140625, 'epoch': 0.07} 7%|▋ | 176/2500 [1:22:00<17:18:03, 26.80s/it] 7%|▋ | 177/2500 [1:22:22<16:30:14, 25.58s/it] {'loss': 0.0044, 'grad_norm': 0.6421798198298444, 'learning_rate': 9.292e-07, 'completion_length': 75.18750381469727, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.109130859375, 'epoch': 0.07} 7%|▋ | 177/2500 [1:22:22<16:30:14, 25.58s/it] 7%|▋ | 178/2500 [1:22:45<15:52:54, 24.62s/it] {'loss': 0.0047, 'grad_norm': 1.0540477610029328, 'learning_rate': 9.287999999999999e-07, 'completion_length': 82.00000381469727, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.06222161278128624, 'kl': 0.118408203125, 'epoch': 0.07} 7%|▋ | 178/2500 [1:22:45<15:52:54, 24.62s/it] 7%|▋ | 179/2500 [1:23:09<15:50:17, 24.57s/it] {'loss': 0.0057, 'grad_norm': 0.9492416890536377, 'learning_rate': 9.284e-07, 'completion_length': 64.27678871154785, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.1416015625, 'epoch': 0.07} 7%|▋ | 179/2500 [1:23:09<15:50:17, 24.57s/it] 7%|▋ | 180/2500 [1:23:35<15:59:52, 24.82s/it] {'loss': 0.0042, 'grad_norm': 2.2568832967315338, 'learning_rate': 9.28e-07, 'completion_length': 70.42857360839844, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.10431019216775894, 'kl': 0.10498046875, 'epoch': 0.07} 7%|▋ | 180/2500 [1:23:35<15:59:52, 24.82s/it] 7%|▋ | 181/2500 [1:24:00<16:03:43, 24.93s/it] {'loss': 0.0043, 'grad_norm': 0.3525027437996458, 'learning_rate': 9.275999999999999e-07, 'completion_length': 74.51786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10693359375, 'epoch': 0.07} 7%|▋ | 181/2500 [1:24:00<16:03:43, 24.93s/it] 7%|▋ | 182/2500 [1:24:24<15:55:04, 24.72s/it] {'loss': 0.0048, 'grad_norm': 3.018464285610708, 'learning_rate': 9.272e-07, 'completion_length': 66.42857551574707, 'rewards/accuracy_reward': 0.9375000596046448, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285715222358704, 'reward_std': 0.0835726335644722, 'kl': 0.12109375, 'epoch': 0.07} 7%|▋ | 182/2500 [1:24:24<15:55:04, 24.72s/it] 7%|▋ | 183/2500 [1:25:10<19:59:24, 31.06s/it] {'loss': 0.0056, 'grad_norm': 1.277539175054314, 'learning_rate': 9.268e-07, 'completion_length': 60.017860412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.139404296875, 'epoch': 0.07} 7%|▋ | 183/2500 [1:25:10<19:59:24, 31.06s/it] 7%|▋ | 184/2500 [1:26:01<23:48:51, 37.02s/it] {'loss': 0.006, 'grad_norm': 0.3110093812827746, 'learning_rate': 9.263999999999999e-07, 'completion_length': 54.04464530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.14990234375, 'epoch': 0.07} 7%|▋ | 184/2500 [1:26:01<23:48:51, 37.02s/it] 7%|▋ | 185/2500 [1:26:41<24:29:57, 38.10s/it] {'loss': 0.0064, 'grad_norm': 1.312240491231449, 'learning_rate': 9.26e-07, 'completion_length': 44.50893020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1611328125, 'epoch': 0.07} 7%|▋ | 185/2500 [1:26:41<24:29:57, 38.10s/it] 7%|▋ | 186/2500 [1:27:23<25:06:36, 39.06s/it] {'loss': 0.0079, 'grad_norm': 0.4541485052829793, 'learning_rate': 9.256e-07, 'completion_length': 41.83035850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1982421875, 'epoch': 0.07} 7%|▋ | 186/2500 [1:27:23<25:06:36, 39.06s/it] 7%|▋ | 187/2500 [1:28:07<26:00:42, 40.49s/it] {'loss': 0.0067, 'grad_norm': 1.6823221605146783, 'learning_rate': 9.251999999999999e-07, 'completion_length': 43.86607360839844, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.16796875, 'epoch': 0.07} 7%|▋ | 187/2500 [1:28:07<26:00:42, 40.49s/it] 8%|▊ | 188/2500 [1:28:37<24:04:12, 37.48s/it] {'loss': 0.0078, 'grad_norm': 0.9541188635362408, 'learning_rate': 9.247999999999999e-07, 'completion_length': 41.63393020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1943359375, 'epoch': 0.08} 8%|▊ | 188/2500 [1:28:37<24:04:12, 37.48s/it] 8%|▊ | 189/2500 [1:29:16<24:26:27, 38.07s/it] {'loss': 0.0074, 'grad_norm': 1.8738205444257607, 'learning_rate': 9.244e-07, 'completion_length': 47.49107360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.18505859375, 'epoch': 0.08} 8%|▊ | 189/2500 [1:29:17<24:26:27, 38.07s/it] 8%|▊ | 190/2500 [1:29:46<22:49:21, 35.57s/it] {'loss': 0.0063, 'grad_norm': 1.4714943821525375, 'learning_rate': 9.24e-07, 'completion_length': 61.687503814697266, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.8660714626312256, 'reward_std': 0.08747542649507523, 'kl': 0.15673828125, 'epoch': 0.08} 8%|▊ | 190/2500 [1:29:46<22:49:21, 35.57s/it] 8%|▊ | 191/2500 [1:30:08<20:07:32, 31.38s/it] {'loss': 0.0094, 'grad_norm': 1.8085218479098872, 'learning_rate': 9.235999999999999e-07, 'completion_length': 49.65178871154785, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.236328125, 'epoch': 0.08} 8%|▊ | 191/2500 [1:30:08<20:07:32, 31.38s/it] 8%|▊ | 192/2500 [1:30:58<23:38:25, 36.87s/it] {'loss': 0.0057, 'grad_norm': 2.375315642296309, 'learning_rate': 9.232e-07, 'completion_length': 56.616071701049805, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0835726410150528, 'kl': 0.143310546875, 'epoch': 0.08} 8%|▊ | 192/2500 [1:30:58<23:38:25, 36.87s/it] 8%|▊ | 193/2500 [1:31:26<21:58:02, 34.28s/it] {'loss': 0.007, 'grad_norm': 3.569958600684262, 'learning_rate': 9.227999999999999e-07, 'completion_length': 46.09821701049805, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.06613001227378845, 'kl': 0.17431640625, 'epoch': 0.08} 8%|▊ | 193/2500 [1:31:26<21:58:02, 34.28s/it] 8%|▊ | 194/2500 [1:32:03<22:30:42, 35.14s/it] {'loss': 0.0051, 'grad_norm': 2.593989451412725, 'learning_rate': 9.224e-07, 'completion_length': 48.53571701049805, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.06222161650657654, 'kl': 0.1279296875, 'epoch': 0.08} 8%|▊ | 194/2500 [1:32:03<22:30:42, 35.14s/it] 8%|▊ | 195/2500 [1:32:35<21:56:23, 34.27s/it] {'loss': 0.0061, 'grad_norm': 2.14105011887334, 'learning_rate': 9.22e-07, 'completion_length': 46.48214530944824, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.06222161278128624, 'kl': 0.1533203125, 'epoch': 0.08} 8%|▊ | 195/2500 [1:32:35<21:56:23, 34.27s/it] 8%|▊ | 196/2500 [1:33:11<22:15:29, 34.78s/it] {'loss': 0.0065, 'grad_norm': 1.330679569507152, 'learning_rate': 9.215999999999999e-07, 'completion_length': 58.05357551574707, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857313156128, 'reward_std': 0.07839837670326233, 'kl': 0.16259765625, 'epoch': 0.08} 8%|▊ | 196/2500 [1:33:11<22:15:29, 34.78s/it] 8%|▊ | 197/2500 [1:33:53<23:33:38, 36.83s/it] {'loss': 0.0057, 'grad_norm': 1.8517681898844076, 'learning_rate': 9.212e-07, 'completion_length': 62.23214340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0835726335644722, 'kl': 0.14208984375, 'epoch': 0.08} 8%|▊ | 197/2500 [1:33:53<23:33:38, 36.83s/it] 8%|▊ | 198/2500 [1:34:22<22:07:47, 34.61s/it] {'loss': 0.007, 'grad_norm': 1.3651016818175572, 'learning_rate': 9.207999999999999e-07, 'completion_length': 57.71428871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1748046875, 'epoch': 0.08} 8%|▊ | 198/2500 [1:34:22<22:07:47, 34.61s/it] 8%|▊ | 199/2500 [1:35:44<31:15:02, 48.89s/it] {'loss': 0.0046, 'grad_norm': 0.3010288232956711, 'learning_rate': 9.203999999999999e-07, 'completion_length': 61.73214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11572265625, 'epoch': 0.08} 8%|▊ | 199/2500 [1:35:44<31:15:02, 48.89s/it] 8%|▊ | 200/2500 [1:36:47<33:54:49, 53.08s/it] {'loss': 0.006, 'grad_norm': 0.3227258650354649, 'learning_rate': 9.2e-07, 'completion_length': 65.52679061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.14990234375, 'epoch': 0.08} 8%|▊ | 200/2500 [1:36:47<33:54:49, 53.08s/it] 8%|▊ | 201/2500 [1:37:59<37:24:12, 58.57s/it] {'loss': 0.0042, 'grad_norm': 0.7754295354344658, 'learning_rate': 9.196e-07, 'completion_length': 72.38393020629883, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.1044921875, 'epoch': 0.08} 8%|▊ | 201/2500 [1:37:59<37:24:12, 58.57s/it] 8%|▊ | 202/2500 [1:38:42<34:33:58, 54.15s/it] {'loss': 0.0069, 'grad_norm': 1.5150611720406313, 'learning_rate': 9.192e-07, 'completion_length': 54.330360412597656, 'rewards/accuracy_reward': 0.928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.05050762742757797, 'kl': 0.171630859375, 'epoch': 0.08} 8%|▊ | 202/2500 [1:38:43<34:33:58, 54.15s/it] 8%|▊ | 203/2500 [1:40:05<40:02:23, 62.75s/it] {'loss': 0.0046, 'grad_norm': 2.921616134865685, 'learning_rate': 9.187999999999999e-07, 'completion_length': 65.10714530944824, 'rewards/accuracy_reward': 0.928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0835726372897625, 'kl': 0.11474609375, 'epoch': 0.08} 8%|▊ | 203/2500 [1:40:05<40:02:23, 62.75s/it] 8%|▊ | 204/2500 [1:41:19<42:05:23, 65.99s/it] {'loss': 0.0071, 'grad_norm': 0.14666298554484236, 'learning_rate': 9.184e-07, 'completion_length': 47.48214340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.17724609375, 'epoch': 0.08} 8%|▊ | 204/2500 [1:41:19<42:05:23, 65.99s/it] 8%|▊ | 205/2500 [1:41:47<34:47:51, 54.58s/it] {'loss': 0.0096, 'grad_norm': 2.331191677273994, 'learning_rate': 9.18e-07, 'completion_length': 44.142860412597656, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.23876953125, 'epoch': 0.08} 8%|▊ | 205/2500 [1:41:47<34:47:51, 54.58s/it] 8%|▊ | 206/2500 [1:43:01<38:27:10, 60.34s/it] {'loss': 0.0088, 'grad_norm': 5.793344194573152, 'learning_rate': 9.175999999999999e-07, 'completion_length': 47.66964530944824, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.220703125, 'epoch': 0.08} 8%|▊ | 206/2500 [1:43:01<38:27:10, 60.34s/it] 8%|▊ | 207/2500 [1:44:51<48:00:58, 75.39s/it] {'loss': 0.0078, 'grad_norm': 4.745516501633977, 'learning_rate': 9.172e-07, 'completion_length': 47.49107360839844, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9017857313156128, 'reward_std': 0.07576143741607666, 'kl': 0.1953125, 'epoch': 0.08} 8%|▊ | 207/2500 [1:44:51<48:00:58, 75.39s/it] 8%|▊ | 208/2500 [1:45:20<39:05:45, 61.41s/it] {'loss': 0.0063, 'grad_norm': 3.280911187800383, 'learning_rate': 9.168e-07, 'completion_length': 58.90178871154785, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.946428656578064, 'reward_std': 0.0901123583316803, 'kl': 0.15771484375, 'epoch': 0.08} 8%|▊ | 208/2500 [1:45:20<39:05:45, 61.41s/it] 8%|▊ | 209/2500 [1:45:48<32:38:25, 51.29s/it] {'loss': 0.0075, 'grad_norm': 0.28202968674107187, 'learning_rate': 9.163999999999999e-07, 'completion_length': 49.73214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1875, 'epoch': 0.08} 8%|▊ | 209/2500 [1:45:48<32:38:25, 51.29s/it] 8%|▊ | 210/2500 [1:46:14<27:52:48, 43.83s/it] {'loss': 0.0068, 'grad_norm': 1.822601634112234, 'learning_rate': 9.16e-07, 'completion_length': 59.53571701049805, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.06222161650657654, 'kl': 0.16943359375, 'epoch': 0.08} 8%|▊ | 210/2500 [1:46:14<27:52:48, 43.83s/it] 8%|▊ | 211/2500 [1:46:52<26:43:29, 42.03s/it] {'loss': 0.0058, 'grad_norm': 3.8929697084691783, 'learning_rate': 9.156e-07, 'completion_length': 65.57143020629883, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.07514797896146774, 'kl': 0.1455078125, 'epoch': 0.08} 8%|▊ | 211/2500 [1:46:52<26:43:29, 42.03s/it] 8%|▊ | 212/2500 [1:48:10<33:40:16, 52.98s/it] {'loss': 0.0065, 'grad_norm': 0.4272258001443233, 'learning_rate': 9.151999999999999e-07, 'completion_length': 73.41964721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.16357421875, 'epoch': 0.08} 8%|▊ | 212/2500 [1:48:10<33:40:16, 52.98s/it] 9%|▊ | 213/2500 [1:48:37<28:40:52, 45.15s/it] {'loss': 0.0055, 'grad_norm': 5.492505252802898, 'learning_rate': 9.147999999999999e-07, 'completion_length': 61.30357360839844, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9017857909202576, 'reward_std': 0.12565560266375542, 'kl': 0.13671875, 'epoch': 0.09} 9%|▊ | 213/2500 [1:48:37<28:40:52, 45.15s/it] 9%|▊ | 214/2500 [1:49:28<29:42:25, 46.78s/it] {'loss': 0.0061, 'grad_norm': 2.4828080735191476, 'learning_rate': 9.144e-07, 'completion_length': 67.0714340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.06613001227378845, 'kl': 0.15234375, 'epoch': 0.09} 9%|▊ | 214/2500 [1:49:28<29:42:25, 46.78s/it] 9%|▊ | 215/2500 [1:50:56<37:40:32, 59.36s/it] {'loss': 0.0051, 'grad_norm': 1.6801191120990706, 'learning_rate': 9.14e-07, 'completion_length': 67.26786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.126953125, 'epoch': 0.09} 9%|▊ | 215/2500 [1:50:57<37:40:32, 59.36s/it] 9%|▊ | 216/2500 [1:53:09<51:36:45, 81.35s/it] {'loss': 0.0052, 'grad_norm': 0.7395412518333824, 'learning_rate': 9.135999999999999e-07, 'completion_length': 69.90178680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1298828125, 'epoch': 0.09} 9%|▊ | 216/2500 [1:53:09<51:36:45, 81.35s/it] 9%|▊ | 217/2500 [1:54:48<54:52:53, 86.54s/it] {'loss': 0.0041, 'grad_norm': 3.236994075693301, 'learning_rate': 9.132e-07, 'completion_length': 74.64286041259766, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.09138382598757744, 'kl': 0.102783203125, 'epoch': 0.09} 9%|▊ | 217/2500 [1:54:48<54:52:53, 86.54s/it] 9%|▊ | 218/2500 [1:56:16<55:13:55, 87.13s/it] {'loss': 0.0064, 'grad_norm': 1.3118202319071064, 'learning_rate': 9.127999999999999e-07, 'completion_length': 71.3839340209961, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.033065006136894226, 'kl': 0.16064453125, 'epoch': 0.09} 9%|▊ | 218/2500 [1:56:16<55:13:55, 87.13s/it] 9%|▉ | 219/2500 [1:56:43<43:45:13, 69.05s/it] {'loss': 0.0061, 'grad_norm': 2.203081677651229, 'learning_rate': 9.123999999999999e-07, 'completion_length': 72.84821701049805, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.15185546875, 'epoch': 0.09} 9%|▉ | 219/2500 [1:56:43<43:45:13, 69.05s/it] 9%|▉ | 220/2500 [1:57:14<36:27:45, 57.57s/it] {'loss': 0.0049, 'grad_norm': 0.8111988065844101, 'learning_rate': 9.12e-07, 'completion_length': 70.26786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.121337890625, 'epoch': 0.09} 9%|▉ | 220/2500 [1:57:14<36:27:45, 57.57s/it] 9%|▉ | 221/2500 [1:57:39<30:14:53, 47.78s/it] {'loss': 0.006, 'grad_norm': 1.0126181584581695, 'learning_rate': 9.115999999999999e-07, 'completion_length': 73.85714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.14892578125, 'epoch': 0.09} 9%|▉ | 221/2500 [1:57:39<30:14:53, 47.78s/it] 9%|▉ | 222/2500 [1:58:03<25:49:18, 40.81s/it] {'loss': 0.0046, 'grad_norm': 1.7522055875108984, 'learning_rate': 9.112e-07, 'completion_length': 63.125003814697266, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.09138382971286774, 'kl': 0.115234375, 'epoch': 0.09} 9%|▉ | 222/2500 [1:58:03<25:49:18, 40.81s/it] 9%|▉ | 223/2500 [1:58:29<22:52:31, 36.17s/it] {'loss': 0.0062, 'grad_norm': 4.650857453993741, 'learning_rate': 9.108e-07, 'completion_length': 63.63393020629883, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.946428656578064, 'reward_std': 0.08868780359625816, 'kl': 0.15380859375, 'epoch': 0.09} 9%|▉ | 223/2500 [1:58:29<22:52:31, 36.17s/it] 9%|▉ | 224/2500 [1:58:55<21:04:30, 33.34s/it] {'loss': 0.0048, 'grad_norm': 0.2161356969911304, 'learning_rate': 9.103999999999999e-07, 'completion_length': 68.33928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.120849609375, 'epoch': 0.09} 9%|▉ | 224/2500 [1:58:56<21:04:30, 33.34s/it] 9%|▉ | 225/2500 [1:59:19<19:15:30, 30.48s/it] {'loss': 0.0061, 'grad_norm': 5.396780966778058, 'learning_rate': 9.1e-07, 'completion_length': 64.35714721679688, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.15185546875, 'epoch': 0.09} 9%|▉ | 225/2500 [1:59:19<19:15:30, 30.48s/it] 9%|▉ | 226/2500 [1:59:44<18:11:07, 28.79s/it] {'loss': 0.0075, 'grad_norm': 0.333806557445624, 'learning_rate': 9.095999999999999e-07, 'completion_length': 51.90178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.187255859375, 'epoch': 0.09} 9%|▉ | 226/2500 [1:59:44<18:11:07, 28.79s/it] 9%|▉ | 227/2500 [2:00:08<17:12:45, 27.26s/it] {'loss': 0.0061, 'grad_norm': 1.6833556593568884, 'learning_rate': 9.092e-07, 'completion_length': 55.39285850524902, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.1513671875, 'epoch': 0.09} 9%|▉ | 227/2500 [2:00:08<17:12:45, 27.26s/it] 9%|▉ | 228/2500 [2:00:34<16:54:36, 26.79s/it] {'loss': 0.0057, 'grad_norm': 0.720718326041349, 'learning_rate': 9.088e-07, 'completion_length': 56.93750190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.14306640625, 'epoch': 0.09} 9%|▉ | 228/2500 [2:00:34<16:54:36, 26.79s/it] 9%|▉ | 229/2500 [2:00:57<16:20:06, 25.89s/it] {'loss': 0.0085, 'grad_norm': 4.67023467312609, 'learning_rate': 9.084e-07, 'completion_length': 47.91071701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.21240234375, 'epoch': 0.09} 9%|▉ | 229/2500 [2:00:57<16:20:06, 25.89s/it] 9%|▉ | 230/2500 [2:01:25<16:35:32, 26.31s/it] {'loss': 0.0074, 'grad_norm': 2.0754276341490483, 'learning_rate': 9.08e-07, 'completion_length': 56.892860412597656, 'rewards/accuracy_reward': 0.9375000596046448, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285715222358704, 'reward_std': 0.0835726335644722, 'kl': 0.1845703125, 'epoch': 0.09} 9%|▉ | 230/2500 [2:01:25<16:35:32, 26.31s/it] 9%|▉ | 231/2500 [2:01:50<16:22:26, 25.98s/it] {'loss': 0.0079, 'grad_norm': 0.2651795996669062, 'learning_rate': 9.075999999999999e-07, 'completion_length': 48.94643020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.19677734375, 'epoch': 0.09} 9%|▉ | 231/2500 [2:01:50<16:22:26, 25.98s/it] 9%|▉ | 232/2500 [2:02:17<16:36:10, 26.35s/it] {'loss': 0.0077, 'grad_norm': 2.8435089381209995, 'learning_rate': 9.072e-07, 'completion_length': 46.35714530944824, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.193359375, 'epoch': 0.09} 9%|▉ | 232/2500 [2:02:17<16:36:10, 26.35s/it] 9%|▉ | 233/2500 [2:02:54<18:36:43, 29.56s/it] {'loss': 0.0075, 'grad_norm': 1.6416353995949728, 'learning_rate': 9.068e-07, 'completion_length': 52.52678680419922, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.946428656578064, 'reward_std': 0.08868780732154846, 'kl': 0.18603515625, 'epoch': 0.09} 9%|▉ | 233/2500 [2:02:54<18:36:43, 29.56s/it] 9%|▉ | 234/2500 [2:03:17<17:19:55, 27.54s/it] {'loss': 0.0083, 'grad_norm': 2.2308899290061457, 'learning_rate': 9.063999999999999e-07, 'completion_length': 43.23214530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.20751953125, 'epoch': 0.09} 9%|▉ | 234/2500 [2:03:17<17:19:55, 27.54s/it] 9%|▉ | 235/2500 [2:03:40<16:33:27, 26.32s/it] {'loss': 0.0079, 'grad_norm': 2.49969023333232, 'learning_rate': 9.06e-07, 'completion_length': 42.83928680419922, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.19677734375, 'epoch': 0.09} 9%|▉ | 235/2500 [2:03:40<16:33:27, 26.32s/it] 9%|▉ | 236/2500 [2:04:04<16:07:15, 25.63s/it] {'loss': 0.0083, 'grad_norm': 1.9869463002806818, 'learning_rate': 9.056e-07, 'completion_length': 45.66964530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.20654296875, 'epoch': 0.09} 9%|▉ | 236/2500 [2:04:04<16:07:15, 25.63s/it] 9%|▉ | 237/2500 [2:04:28<15:44:54, 25.05s/it] {'loss': 0.0048, 'grad_norm': 0.1717350028856858, 'learning_rate': 9.051999999999999e-07, 'completion_length': 55.04464530944824, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.12060546875, 'epoch': 0.09} 9%|▉ | 237/2500 [2:04:28<15:44:54, 25.05s/it] 10%|▉ | 238/2500 [2:04:52<15:29:44, 24.66s/it] {'loss': 0.0065, 'grad_norm': 2.5007977737099107, 'learning_rate': 9.048e-07, 'completion_length': 50.44643211364746, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.910714328289032, 'reward_std': 0.05050762742757797, 'kl': 0.1630859375, 'epoch': 0.1} 10%|▉ | 238/2500 [2:04:52<15:29:44, 24.66s/it] 10%|▉ | 239/2500 [2:05:18<15:46:08, 25.11s/it] {'loss': 0.005, 'grad_norm': 1.0278440656870242, 'learning_rate': 9.044e-07, 'completion_length': 60.07143020629883, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.126220703125, 'epoch': 0.1} 10%|▉ | 239/2500 [2:05:18<15:46:08, 25.11s/it] 10%|▉ | 240/2500 [2:05:48<16:40:57, 26.57s/it] {'loss': 0.0062, 'grad_norm': 0.7886822749175308, 'learning_rate': 9.039999999999999e-07, 'completion_length': 61.36607360839844, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.15478515625, 'epoch': 0.1} 10%|▉ | 240/2500 [2:05:48<16:40:57, 26.57s/it] 10%|▉ | 241/2500 [2:06:13<16:27:08, 26.22s/it] {'loss': 0.0048, 'grad_norm': 1.3034823049508135, 'learning_rate': 9.035999999999999e-07, 'completion_length': 57.71428871154785, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.119873046875, 'epoch': 0.1} 10%|▉ | 241/2500 [2:06:13<16:27:08, 26.22s/it] 10%|▉ | 242/2500 [2:06:37<15:59:42, 25.50s/it] {'loss': 0.0053, 'grad_norm': 6.251300059775383, 'learning_rate': 9.032e-07, 'completion_length': 63.51786231994629, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.128351628780365, 'kl': 0.133544921875, 'epoch': 0.1} 10%|▉ | 242/2500 [2:06:37<15:59:42, 25.50s/it] 10%|▉ | 243/2500 [2:07:01<15:39:50, 24.98s/it] {'loss': 0.0049, 'grad_norm': 2.3066328833807415, 'learning_rate': 9.028e-07, 'completion_length': 58.517860412597656, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0835726335644722, 'kl': 0.12255859375, 'epoch': 0.1} 10%|▉ | 243/2500 [2:07:01<15:39:50, 24.98s/it] 10%|▉ | 244/2500 [2:07:26<15:38:56, 24.97s/it] {'loss': 0.0053, 'grad_norm': 0.18112575367136333, 'learning_rate': 9.023999999999999e-07, 'completion_length': 61.544647216796875, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1328125, 'epoch': 0.1} 10%|▉ | 244/2500 [2:07:26<15:38:56, 24.97s/it] 10%|▉ | 245/2500 [2:07:55<16:20:03, 26.08s/it] {'loss': 0.0047, 'grad_norm': 0.7508180349832413, 'learning_rate': 9.02e-07, 'completion_length': 64.73214721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.053144559264183044, 'kl': 0.116455078125, 'epoch': 0.1} 10%|▉ | 245/2500 [2:07:55<16:20:03, 26.08s/it] 10%|▉ | 246/2500 [2:08:49<21:39:39, 34.60s/it] {'loss': 0.0056, 'grad_norm': 2.914577400502199, 'learning_rate': 9.015999999999999e-07, 'completion_length': 78.94643020629883, 'rewards/accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.883928656578064, 'reward_std': 0.1297563649713993, 'kl': 0.140869140625, 'epoch': 0.1} 10%|▉ | 246/2500 [2:08:49<21:39:39, 34.60s/it] 10%|▉ | 247/2500 [2:09:16<20:08:41, 32.19s/it] {'loss': 0.0038, 'grad_norm': 2.284407360761322, 'learning_rate': 9.011999999999999e-07, 'completion_length': 74.65178680419922, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9107143878936768, 'reward_std': 0.11146337911486626, 'kl': 0.0947265625, 'epoch': 0.1} 10%|▉ | 247/2500 [2:09:16<20:08:41, 32.19s/it] 10%|▉ | 248/2500 [2:09:43<19:09:14, 30.62s/it] {'loss': 0.005, 'grad_norm': 1.8695394639280276, 'learning_rate': 9.008e-07, 'completion_length': 68.73214721679688, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375001192092896, 'reward_std': 0.08747542649507523, 'kl': 0.1259765625, 'epoch': 0.1} 10%|▉ | 248/2500 [2:09:43<19:09:14, 30.62s/it] 10%|▉ | 249/2500 [2:10:06<17:51:55, 28.57s/it] {'loss': 0.0048, 'grad_norm': 2.250401610310281, 'learning_rate': 9.004e-07, 'completion_length': 73.28571701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.119384765625, 'epoch': 0.1} 10%|▉ | 249/2500 [2:10:06<17:51:55, 28.57s/it] 10%|█ | 250/2500 [2:10:33<17:27:19, 27.93s/it] {'loss': 0.0036, 'grad_norm': 2.6877273761780223, 'learning_rate': 9e-07, 'completion_length': 72.2589340209961, 'rewards/accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9017857909202576, 'reward_std': 0.1418914645910263, 'kl': 0.08984375, 'epoch': 0.1} 10%|█ | 250/2500 [2:10:33<17:27:19, 27.93s/it] 10%|█ | 251/2500 [2:11:02<17:38:26, 28.24s/it] {'loss': 0.0051, 'grad_norm': 2.615788672653061, 'learning_rate': 8.995999999999999e-07, 'completion_length': 71.41071701049805, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.07514797896146774, 'kl': 0.1259765625, 'epoch': 0.1} 10%|█ | 251/2500 [2:11:02<17:38:26, 28.24s/it] 10%|█ | 252/2500 [2:11:27<17:03:24, 27.31s/it] {'loss': 0.004, 'grad_norm': 3.085611634381712, 'learning_rate': 8.992e-07, 'completion_length': 81.73214721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.06222161650657654, 'kl': 0.099609375, 'epoch': 0.1} 10%|█ | 252/2500 [2:11:27<17:03:24, 27.31s/it] 10%|█ | 253/2500 [2:11:53<16:46:31, 26.88s/it] {'loss': 0.0043, 'grad_norm': 0.1656059275835936, 'learning_rate': 8.988e-07, 'completion_length': 79.71429061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10791015625, 'epoch': 0.1} 10%|█ | 253/2500 [2:11:53<16:46:31, 26.88s/it] 10%|█ | 254/2500 [2:12:20<16:55:07, 27.12s/it] {'loss': 0.0034, 'grad_norm': 0.47061721597390327, 'learning_rate': 8.983999999999999e-07, 'completion_length': 88.21429061889648, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.084228515625, 'epoch': 0.1} 10%|█ | 254/2500 [2:12:20<16:55:07, 27.12s/it] 10%|█ | 255/2500 [2:12:48<16:54:25, 27.11s/it] {'loss': 0.0049, 'grad_norm': 3.490396161351023, 'learning_rate': 8.98e-07, 'completion_length': 73.38393020629883, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.11272924393415451, 'kl': 0.12353515625, 'epoch': 0.1} 10%|█ | 255/2500 [2:12:48<16:54:25, 27.11s/it] 10%|█ | 256/2500 [2:13:15<16:55:30, 27.15s/it] {'loss': 0.0044, 'grad_norm': 1.502268489419416, 'learning_rate': 8.975999999999999e-07, 'completion_length': 86.69643020629883, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9285714626312256, 'reward_std': 0.11272925138473511, 'kl': 0.109375, 'epoch': 0.1} 10%|█ | 256/2500 [2:13:15<16:55:30, 27.15s/it] 10%|█ | 257/2500 [2:13:39<16:19:43, 26.21s/it] {'loss': 0.0046, 'grad_norm': 2.3274389837617555, 'learning_rate': 8.972e-07, 'completion_length': 82.14286041259766, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.0964989960193634, 'kl': 0.114013671875, 'epoch': 0.1} 10%|█ | 257/2500 [2:13:39<16:19:43, 26.21s/it] 10%|█ | 258/2500 [2:14:06<16:26:48, 26.41s/it] {'loss': 0.0047, 'grad_norm': 0.6874463055969122, 'learning_rate': 8.968e-07, 'completion_length': 84.16964721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.117431640625, 'epoch': 0.1} 10%|█ | 258/2500 [2:14:06<16:26:48, 26.41s/it] 10%|█ | 259/2500 [2:14:31<16:14:57, 26.10s/it] {'loss': 0.0033, 'grad_norm': 3.9038362484108386, 'learning_rate': 8.963999999999999e-07, 'completion_length': 87.29464721679688, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.06343399360775948, 'kl': 0.0830078125, 'epoch': 0.1} 10%|█ | 259/2500 [2:14:31<16:14:57, 26.10s/it] 10%|█ | 260/2500 [2:14:58<16:20:32, 26.26s/it] {'loss': 0.0038, 'grad_norm': 1.5457936916109578, 'learning_rate': 8.96e-07, 'completion_length': 79.26786041259766, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.07003280520439148, 'kl': 0.095947265625, 'epoch': 0.1} 10%|█ | 260/2500 [2:14:58<16:20:32, 26.26s/it] 10%|█ | 261/2500 [2:15:22<15:57:11, 25.65s/it] {'loss': 0.0041, 'grad_norm': 0.1297657634614775, 'learning_rate': 8.955999999999999e-07, 'completion_length': 82.30357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.102294921875, 'epoch': 0.1} 10%|█ | 261/2500 [2:15:22<15:57:11, 25.65s/it] 10%|█ | 262/2500 [2:15:49<16:10:51, 26.03s/it] {'loss': 0.0036, 'grad_norm': 0.13959128539590127, 'learning_rate': 8.951999999999999e-07, 'completion_length': 88.5714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.091064453125, 'epoch': 0.1} 10%|█ | 262/2500 [2:15:49<16:10:51, 26.03s/it] 11%|█ | 263/2500 [2:16:16<16:23:19, 26.37s/it] {'loss': 0.0035, 'grad_norm': 0.7170021055037441, 'learning_rate': 8.948e-07, 'completion_length': 82.51786041259766, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.910714328289032, 'reward_std': 0.05050762742757797, 'kl': 0.087158203125, 'epoch': 0.11} 11%|█ | 263/2500 [2:16:16<16:23:19, 26.37s/it] 11%|█ | 264/2500 [2:16:44<16:43:08, 26.92s/it] {'loss': 0.0039, 'grad_norm': 1.017193588440286, 'learning_rate': 8.944e-07, 'completion_length': 87.54464721679688, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857313156128, 'reward_std': 0.0835726335644722, 'kl': 0.096435546875, 'epoch': 0.11} 11%|█ | 264/2500 [2:16:44<16:43:08, 26.92s/it] 11%|█ | 265/2500 [2:17:10<16:34:23, 26.70s/it] {'loss': 0.0037, 'grad_norm': 0.5711077684912036, 'learning_rate': 8.939999999999999e-07, 'completion_length': 87.51786422729492, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.093505859375, 'epoch': 0.11} 11%|█ | 265/2500 [2:17:10<16:34:23, 26.70s/it] 11%|█ | 266/2500 [2:17:35<16:12:34, 26.12s/it] {'loss': 0.0038, 'grad_norm': 0.946938599650837, 'learning_rate': 8.935999999999999e-07, 'completion_length': 81.91071701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.09375, 'epoch': 0.11} 11%|█ | 266/2500 [2:17:35<16:12:34, 26.12s/it] 11%|█ | 267/2500 [2:18:11<18:05:49, 29.18s/it] {'loss': 0.0051, 'grad_norm': 0.09455005405482554, 'learning_rate': 8.932e-07, 'completion_length': 73.95536041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12744140625, 'epoch': 0.11} 11%|█ | 267/2500 [2:18:12<18:05:49, 29.18s/it] 11%|█ | 268/2500 [2:18:37<17:28:37, 28.19s/it] {'loss': 0.0037, 'grad_norm': 0.10881814718049021, 'learning_rate': 8.928e-07, 'completion_length': 76.12500381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.093017578125, 'epoch': 0.11} 11%|█ | 268/2500 [2:18:37<17:28:37, 28.19s/it] 11%|█ | 269/2500 [2:19:02<16:52:47, 27.24s/it] {'loss': 0.0054, 'grad_norm': 0.8141317651543418, 'learning_rate': 8.923999999999999e-07, 'completion_length': 77.65179061889648, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.13427734375, 'epoch': 0.11} 11%|█ | 269/2500 [2:19:02<16:52:47, 27.24s/it] 11%|█ | 270/2500 [2:19:28<16:35:25, 26.78s/it] {'loss': 0.0045, 'grad_norm': 3.70966084350633, 'learning_rate': 8.92e-07, 'completion_length': 72.95536041259766, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.06222161650657654, 'kl': 0.112548828125, 'epoch': 0.11} 11%|█ | 270/2500 [2:19:28<16:35:25, 26.78s/it] 11%|█ | 271/2500 [2:19:53<16:14:21, 26.23s/it] {'loss': 0.0041, 'grad_norm': 1.3461070865127895, 'learning_rate': 8.915999999999999e-07, 'completion_length': 74.08929061889648, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553572535514832, 'reward_std': 0.07003280520439148, 'kl': 0.103515625, 'epoch': 0.11} 11%|█ | 271/2500 [2:19:53<16:14:21, 26.23s/it] 11%|█ | 272/2500 [2:20:18<15:58:32, 25.81s/it] {'loss': 0.0053, 'grad_norm': 1.1626605184139664, 'learning_rate': 8.911999999999999e-07, 'completion_length': 59.812503814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.132080078125, 'epoch': 0.11} 11%|█ | 272/2500 [2:20:18<15:58:32, 25.81s/it] 11%|█ | 273/2500 [2:20:42<15:42:45, 25.40s/it] {'loss': 0.0046, 'grad_norm': 0.7534953547138166, 'learning_rate': 8.908e-07, 'completion_length': 66.79464530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.115966796875, 'epoch': 0.11} 11%|█ | 273/2500 [2:20:42<15:42:45, 25.40s/it] 11%|█ | 274/2500 [2:21:07<15:31:01, 25.09s/it] {'loss': 0.0045, 'grad_norm': 1.182829615571783, 'learning_rate': 8.904e-07, 'completion_length': 64.66964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.112060546875, 'epoch': 0.11} 11%|█ | 274/2500 [2:21:07<15:31:01, 25.09s/it] 11%|█ | 275/2500 [2:21:34<15:50:31, 25.63s/it] {'loss': 0.0048, 'grad_norm': 4.748410376185373, 'learning_rate': 8.9e-07, 'completion_length': 63.892860412597656, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9107143878936768, 'reward_std': 0.12686798721551895, 'kl': 0.120361328125, 'epoch': 0.11} 11%|█ | 275/2500 [2:21:34<15:50:31, 25.63s/it] 11%|█ | 276/2500 [2:21:58<15:40:10, 25.36s/it] {'loss': 0.0056, 'grad_norm': 2.0189912616283925, 'learning_rate': 8.895999999999999e-07, 'completion_length': 64.48214530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13916015625, 'epoch': 0.11} 11%|█ | 276/2500 [2:21:58<15:40:10, 25.36s/it] 11%|█ | 277/2500 [2:22:25<15:56:38, 25.82s/it] {'loss': 0.0048, 'grad_norm': 0.7247549562342968, 'learning_rate': 8.892e-07, 'completion_length': 77.6160774230957, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9464285969734192, 'reward_std': 0.11663764715194702, 'kl': 0.1201171875, 'epoch': 0.11} 11%|█ | 277/2500 [2:22:25<15:56:38, 25.82s/it] 11%|█ | 278/2500 [2:22:52<16:10:25, 26.20s/it] {'loss': 0.0045, 'grad_norm': 4.26082029571387, 'learning_rate': 8.888e-07, 'completion_length': 65.01785850524902, 'rewards/accuracy_reward': 0.9375000596046448, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285715222358704, 'reward_std': 0.11272924020886421, 'kl': 0.11328125, 'epoch': 0.11} 11%|█ | 278/2500 [2:22:52<16:10:25, 26.20s/it] 11%|█ | 279/2500 [2:23:17<15:53:10, 25.75s/it] {'loss': 0.0066, 'grad_norm': 0.23768178639901757, 'learning_rate': 8.883999999999999e-07, 'completion_length': 64.46428871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1650390625, 'epoch': 0.11} 11%|█ | 279/2500 [2:23:17<15:53:10, 25.75s/it] 11%|█ | 280/2500 [2:23:41<15:36:26, 25.31s/it] {'loss': 0.0048, 'grad_norm': 1.2585065295434206, 'learning_rate': 8.88e-07, 'completion_length': 61.29464530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.120361328125, 'epoch': 0.11} 11%|█ | 280/2500 [2:23:41<15:36:26, 25.31s/it] 11%|█ | 281/2500 [2:24:07<15:44:25, 25.54s/it] {'loss': 0.0045, 'grad_norm': 3.583144376896293, 'learning_rate': 8.875999999999999e-07, 'completion_length': 64.83036041259766, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9553571939468384, 'reward_std': 0.10882644355297089, 'kl': 0.1123046875, 'epoch': 0.11} 11%|█ | 281/2500 [2:24:07<15:44:25, 25.54s/it] 11%|█▏ | 282/2500 [2:24:31<15:18:07, 24.84s/it] {'loss': 0.0078, 'grad_norm': 1.3024627714420967, 'learning_rate': 8.872e-07, 'completion_length': 51.58035850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1953125, 'epoch': 0.11} 11%|█▏ | 282/2500 [2:24:31<15:18:07, 24.84s/it] 11%|█▏ | 283/2500 [2:24:57<15:35:25, 25.32s/it] {'loss': 0.0047, 'grad_norm': 0.4826465805552152, 'learning_rate': 8.868e-07, 'completion_length': 63.64285850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.118408203125, 'epoch': 0.11} 11%|█▏ | 283/2500 [2:24:57<15:35:25, 25.32s/it] 11%|█▏ | 284/2500 [2:25:21<15:21:45, 24.96s/it] {'loss': 0.0071, 'grad_norm': 0.28274084816175094, 'learning_rate': 8.863999999999999e-07, 'completion_length': 58.482147216796875, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.17822265625, 'epoch': 0.11} 11%|█▏ | 284/2500 [2:25:21<15:21:45, 24.96s/it] 11%|█▏ | 285/2500 [2:25:45<15:14:11, 24.76s/it] {'loss': 0.0068, 'grad_norm': 0.1983025815596479, 'learning_rate': 8.86e-07, 'completion_length': 57.812503814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1708984375, 'epoch': 0.11} 11%|█▏ | 285/2500 [2:25:45<15:14:11, 24.76s/it] 11%|█▏ | 286/2500 [2:26:08<14:49:09, 24.10s/it] {'loss': 0.0065, 'grad_norm': 2.7161593375224324, 'learning_rate': 8.856e-07, 'completion_length': 58.28571701049805, 'rewards/accuracy_reward': 0.9375000596046448, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.08747543022036552, 'kl': 0.16357421875, 'epoch': 0.11} 11%|█▏ | 286/2500 [2:26:08<14:49:09, 24.10s/it] 11%|█▏ | 287/2500 [2:26:31<14:32:24, 23.65s/it] {'loss': 0.0047, 'grad_norm': 1.1644463158969802, 'learning_rate': 8.851999999999999e-07, 'completion_length': 64.94643211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.11669921875, 'epoch': 0.11} 11%|█▏ | 287/2500 [2:26:31<14:32:24, 23.65s/it] 12%|█▏ | 288/2500 [2:26:57<15:03:58, 24.52s/it] {'loss': 0.0057, 'grad_norm': 2.791982417916001, 'learning_rate': 8.848e-07, 'completion_length': 72.52678680419922, 'rewards/accuracy_reward': 0.928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.05050762742757797, 'kl': 0.14208984375, 'epoch': 0.12} 12%|█▏ | 288/2500 [2:26:57<15:03:58, 24.52s/it] 12%|█▏ | 289/2500 [2:27:22<15:02:04, 24.48s/it] {'loss': 0.0062, 'grad_norm': 1.0269730163941369, 'learning_rate': 8.844e-07, 'completion_length': 71.73214721679688, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.15576171875, 'epoch': 0.12} 12%|█▏ | 289/2500 [2:27:22<15:02:04, 24.48s/it] 12%|█▏ | 290/2500 [2:27:46<15:00:44, 24.45s/it] {'loss': 0.0056, 'grad_norm': 0.13904997891833676, 'learning_rate': 8.839999999999999e-07, 'completion_length': 67.03571701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13916015625, 'epoch': 0.12} 12%|█▏ | 290/2500 [2:27:46<15:00:44, 24.45s/it] 12%|█▏ | 291/2500 [2:28:10<14:56:28, 24.35s/it] {'loss': 0.0046, 'grad_norm': 0.12971767255404043, 'learning_rate': 8.836e-07, 'completion_length': 66.9375, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.114501953125, 'epoch': 0.12} 12%|█▏ | 291/2500 [2:28:10<14:56:28, 24.35s/it] 12%|█▏ | 292/2500 [2:28:34<14:56:31, 24.36s/it] {'loss': 0.0035, 'grad_norm': 0.5871174648336142, 'learning_rate': 8.832e-07, 'completion_length': 73.62500381469727, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.025253813713788986, 'kl': 0.087158203125, 'epoch': 0.12} 12%|█▏ | 292/2500 [2:28:34<14:56:31, 24.36s/it] 12%|█▏ | 293/2500 [2:29:01<15:22:33, 25.08s/it] {'loss': 0.0042, 'grad_norm': 0.1758632540842837, 'learning_rate': 8.827999999999999e-07, 'completion_length': 73.21428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.106201171875, 'epoch': 0.12} 12%|█▏ | 293/2500 [2:29:01<15:22:33, 25.08s/it] 12%|█▏ | 294/2500 [2:29:26<15:13:59, 24.86s/it] {'loss': 0.0041, 'grad_norm': 2.0244095147566803, 'learning_rate': 8.823999999999999e-07, 'completion_length': 79.91071701049805, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.06343399360775948, 'kl': 0.10205078125, 'epoch': 0.12} 12%|█▏ | 294/2500 [2:29:26<15:13:59, 24.86s/it] 12%|█▏ | 295/2500 [2:29:51<15:20:40, 25.05s/it] {'loss': 0.0034, 'grad_norm': 0.6365049736455792, 'learning_rate': 8.82e-07, 'completion_length': 80.48214721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.086181640625, 'epoch': 0.12} 12%|█▏ | 295/2500 [2:29:51<15:20:40, 25.05s/it] 12%|█▏ | 296/2500 [2:30:15<15:10:40, 24.79s/it] {'loss': 0.004, 'grad_norm': 0.4076399460980237, 'learning_rate': 8.816000000000001e-07, 'completion_length': 71.26786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10009765625, 'epoch': 0.12} 12%|█▏ | 296/2500 [2:30:15<15:10:40, 24.79s/it] 12%|█▏ | 297/2500 [2:30:39<14:54:40, 24.37s/it] {'loss': 0.0041, 'grad_norm': 1.9749954747448086, 'learning_rate': 8.811999999999999e-07, 'completion_length': 69.80357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.10205078125, 'epoch': 0.12} 12%|█▏ | 297/2500 [2:30:39<14:54:40, 24.37s/it] 12%|█▏ | 298/2500 [2:31:04<15:06:02, 24.69s/it] {'loss': 0.0046, 'grad_norm': 1.2130804553192425, 'learning_rate': 8.808e-07, 'completion_length': 79.91964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.11474609375, 'epoch': 0.12} 12%|█▏ | 298/2500 [2:31:04<15:06:02, 24.69s/it] 12%|█▏ | 299/2500 [2:31:28<15:01:33, 24.58s/it] {'loss': 0.0042, 'grad_norm': 1.2036979699301955, 'learning_rate': 8.804e-07, 'completion_length': 82.2410774230957, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.025253813713788986, 'kl': 0.105224609375, 'epoch': 0.12} 12%|█▏ | 299/2500 [2:31:28<15:01:33, 24.58s/it] 12%|█▏ | 300/2500 [2:31:54<15:09:59, 24.82s/it] {'loss': 0.0048, 'grad_norm': 1.8376396771859653, 'learning_rate': 8.799999999999999e-07, 'completion_length': 72.98214530944824, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0739355981349945, 'kl': 0.119384765625, 'epoch': 0.12} 12%|█▏ | 300/2500 [2:31:54<15:09:59, 24.82s/it] 12%|█▏ | 301/2500 [2:32:53<21:32:14, 35.26s/it] {'loss': 0.0037, 'grad_norm': 1.3939719031438997, 'learning_rate': 8.796e-07, 'completion_length': 79.38393020629883, 'rewards/accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.10040179640054703, 'kl': 0.09326171875, 'epoch': 0.12} 12%|█▏ | 301/2500 [2:32:53<21:32:14, 35.26s/it] 12%|█▏ | 302/2500 [2:33:14<18:54:17, 30.96s/it] {'loss': 0.0047, 'grad_norm': 1.7434721320196869, 'learning_rate': 8.792e-07, 'completion_length': 78.18750381469727, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642858505249023, 'reward_std': 0.06613001227378845, 'kl': 0.11767578125, 'epoch': 0.12} 12%|█▏ | 302/2500 [2:33:14<18:54:17, 30.96s/it] 12%|█▏ | 303/2500 [2:33:42<18:18:00, 29.99s/it] {'loss': 0.0054, 'grad_norm': 1.0812841487651026, 'learning_rate': 8.788e-07, 'completion_length': 80.48214721679688, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9375000596046448, 'reward_std': 0.11394162103533745, 'kl': 0.135498046875, 'epoch': 0.12} 12%|█▏ | 303/2500 [2:33:42<18:18:00, 29.99s/it] 12%|█▏ | 304/2500 [2:34:23<20:21:49, 33.38s/it] {'loss': 0.0057, 'grad_norm': 1.9163012850642072, 'learning_rate': 8.783999999999999e-07, 'completion_length': 73.20536041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1416015625, 'epoch': 0.12} 12%|█▏ | 304/2500 [2:34:23<20:21:49, 33.38s/it] 12%|█▏ | 305/2500 [2:35:04<21:36:26, 35.44s/it] {'loss': 0.0046, 'grad_norm': 3.7079785467807236, 'learning_rate': 8.78e-07, 'completion_length': 62.294647216796875, 'rewards/accuracy_reward': 0.8839286267757416, 'rewards/format_reward': 1.0, 'reward': 1.883928656578064, 'reward_std': 0.1638357900083065, 'kl': 0.115478515625, 'epoch': 0.12} 12%|█▏ | 305/2500 [2:35:04<21:36:26, 35.44s/it] 12%|█▏ | 306/2500 [2:36:11<27:25:30, 45.00s/it] {'loss': 0.0052, 'grad_norm': 0.24112964301356624, 'learning_rate': 8.776e-07, 'completion_length': 75.46429061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.129150390625, 'epoch': 0.12} 12%|█▏ | 306/2500 [2:36:11<27:25:30, 45.00s/it] 12%|█▏ | 307/2500 [2:37:41<35:37:35, 58.48s/it] {'loss': 0.0047, 'grad_norm': 2.2942210207741764, 'learning_rate': 8.771999999999999e-07, 'completion_length': 66.69643020629883, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.08747543022036552, 'kl': 0.116455078125, 'epoch': 0.12} 12%|█▏ | 307/2500 [2:37:41<35:37:35, 58.48s/it] 12%|█▏ | 308/2500 [2:39:08<40:50:38, 67.08s/it] {'loss': 0.0058, 'grad_norm': 1.0704044683200122, 'learning_rate': 8.768e-07, 'completion_length': 59.32143211364746, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.14404296875, 'epoch': 0.12} 12%|█▏ | 308/2500 [2:39:08<40:50:38, 67.08s/it] 12%|█▏ | 309/2500 [2:39:34<33:20:24, 54.78s/it] {'loss': 0.0044, 'grad_norm': 1.8555930625250947, 'learning_rate': 8.763999999999999e-07, 'completion_length': 68.43750381469727, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.109130859375, 'epoch': 0.12} 12%|█▏ | 309/2500 [2:39:34<33:20:24, 54.78s/it] 12%|█▏ | 310/2500 [2:40:01<28:10:58, 46.33s/it] {'loss': 0.0051, 'grad_norm': 0.21375355464721119, 'learning_rate': 8.76e-07, 'completion_length': 66.16071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.126708984375, 'epoch': 0.12} 12%|█▏ | 310/2500 [2:40:01<28:10:58, 46.33s/it] 12%|█▏ | 311/2500 [2:41:21<34:24:58, 56.60s/it] {'loss': 0.0035, 'grad_norm': 1.4427660136442064, 'learning_rate': 8.756e-07, 'completion_length': 86.95536422729492, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.10882644355297089, 'kl': 0.088623046875, 'epoch': 0.12} 12%|█▏ | 311/2500 [2:41:21<34:24:58, 56.60s/it] 12%|█▏ | 312/2500 [2:41:47<28:49:46, 47.43s/it] {'loss': 0.0047, 'grad_norm': 1.4441035043108348, 'learning_rate': 8.751999999999999e-07, 'completion_length': 78.41964340209961, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.116455078125, 'epoch': 0.12} 12%|█▏ | 312/2500 [2:41:47<28:49:46, 47.43s/it] 13%|█▎ | 313/2500 [2:42:12<24:44:37, 40.73s/it] {'loss': 0.0035, 'grad_norm': 0.9034820398720546, 'learning_rate': 8.748e-07, 'completion_length': 75.21428680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.06343398988246918, 'kl': 0.088134765625, 'epoch': 0.13} 13%|█▎ | 313/2500 [2:42:12<24:44:37, 40.73s/it] 13%|█▎ | 314/2500 [2:42:41<22:30:22, 37.06s/it] {'loss': 0.0047, 'grad_norm': 1.1797682114062453, 'learning_rate': 8.743999999999999e-07, 'completion_length': 67.85714530944824, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.118408203125, 'epoch': 0.13} 13%|█▎ | 314/2500 [2:42:41<22:30:22, 37.06s/it] 13%|█▎ | 315/2500 [2:43:06<20:21:40, 33.55s/it] {'loss': 0.0049, 'grad_norm': 1.2237538834815642, 'learning_rate': 8.739999999999999e-07, 'completion_length': 67.40179061889648, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.06222161278128624, 'kl': 0.121826171875, 'epoch': 0.13} 13%|█▎ | 315/2500 [2:43:06<20:21:40, 33.55s/it] 13%|█▎ | 316/2500 [2:43:39<20:18:03, 33.46s/it] {'loss': 0.0044, 'grad_norm': 2.5009442374487794, 'learning_rate': 8.736e-07, 'completion_length': 75.38393020629883, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9017857909202576, 'reward_std': 0.19234000146389008, 'kl': 0.10986328125, 'epoch': 0.13} 13%|█▎ | 316/2500 [2:43:39<20:18:03, 33.46s/it] 13%|█▎ | 317/2500 [2:44:34<24:07:35, 39.79s/it] {'loss': 0.0045, 'grad_norm': 0.8560687399798217, 'learning_rate': 8.732e-07, 'completion_length': 72.27678680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.11279296875, 'epoch': 0.13} 13%|█▎ | 317/2500 [2:44:34<24:07:35, 39.79s/it] 13%|█▎ | 318/2500 [2:45:01<21:47:26, 35.95s/it] {'loss': 0.0043, 'grad_norm': 1.027095399453207, 'learning_rate': 8.728e-07, 'completion_length': 76.03571701049805, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.10791015625, 'epoch': 0.13} 13%|█▎ | 318/2500 [2:45:01<21:47:26, 35.95s/it] 13%|█▎ | 319/2500 [2:45:26<19:47:04, 32.66s/it] {'loss': 0.0052, 'grad_norm': 1.9263612217598405, 'learning_rate': 8.723999999999999e-07, 'completion_length': 65.56250190734863, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.0835726372897625, 'kl': 0.12939453125, 'epoch': 0.13} 13%|█▎ | 319/2500 [2:45:26<19:47:04, 32.66s/it] 13%|█▎ | 320/2500 [2:45:50<18:15:12, 30.14s/it] {'loss': 0.0039, 'grad_norm': 0.13684794168502057, 'learning_rate': 8.72e-07, 'completion_length': 64.65178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09814453125, 'epoch': 0.13} 13%|█▎ | 320/2500 [2:45:50<18:15:12, 30.14s/it] 13%|█▎ | 321/2500 [2:46:16<17:26:10, 28.81s/it] {'loss': 0.0041, 'grad_norm': 1.8029464404092486, 'learning_rate': 8.716e-07, 'completion_length': 69.29464721679688, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.1025390625, 'epoch': 0.13} 13%|█▎ | 321/2500 [2:46:16<17:26:10, 28.81s/it] 13%|█▎ | 322/2500 [2:46:43<17:11:20, 28.41s/it] {'loss': 0.004, 'grad_norm': 2.5922143694524573, 'learning_rate': 8.711999999999999e-07, 'completion_length': 78.50000381469727, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8750000596046448, 'reward_std': 0.18276764452457428, 'kl': 0.10107421875, 'epoch': 0.13} 13%|█▎ | 322/2500 [2:46:43<17:11:20, 28.41s/it] 13%|█▎ | 323/2500 [2:47:07<16:16:46, 26.92s/it] {'loss': 0.0038, 'grad_norm': 1.409376532496402, 'learning_rate': 8.708e-07, 'completion_length': 68.43750381469727, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.093994140625, 'epoch': 0.13} 13%|█▎ | 323/2500 [2:47:07<16:16:46, 26.92s/it] 13%|█▎ | 324/2500 [2:47:31<15:50:09, 26.20s/it] {'loss': 0.0048, 'grad_norm': 0.5659677422678797, 'learning_rate': 8.704e-07, 'completion_length': 65.65178680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.120361328125, 'epoch': 0.13} 13%|█▎ | 324/2500 [2:47:31<15:50:09, 26.20s/it] 13%|█▎ | 325/2500 [2:47:56<15:35:41, 25.81s/it] {'loss': 0.0063, 'grad_norm': 0.19929332837100577, 'learning_rate': 8.699999999999999e-07, 'completion_length': 64.43750190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.15869140625, 'epoch': 0.13} 13%|█▎ | 325/2500 [2:47:56<15:35:41, 25.81s/it] 13%|█▎ | 326/2500 [2:48:46<19:55:03, 32.98s/it] {'loss': 0.0048, 'grad_norm': 0.558032762956363, 'learning_rate': 8.696e-07, 'completion_length': 68.46429061889648, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.119384765625, 'epoch': 0.13} 13%|█▎ | 326/2500 [2:48:46<19:55:03, 32.98s/it] 13%|█▎ | 327/2500 [2:49:12<18:43:52, 31.03s/it] {'loss': 0.0042, 'grad_norm': 1.7845020223697259, 'learning_rate': 8.692e-07, 'completion_length': 72.44643211364746, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.08868780732154846, 'kl': 0.105224609375, 'epoch': 0.13} 13%|█▎ | 327/2500 [2:49:12<18:43:52, 31.03s/it] 13%|█▎ | 328/2500 [2:49:38<17:47:51, 29.50s/it] {'loss': 0.0054, 'grad_norm': 2.3014899191856597, 'learning_rate': 8.687999999999999e-07, 'completion_length': 61.90178680419922, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.135009765625, 'epoch': 0.13} 13%|█▎ | 328/2500 [2:49:38<17:47:51, 29.50s/it] 13%|█▎ | 329/2500 [2:50:03<16:57:49, 28.13s/it] {'loss': 0.004, 'grad_norm': 0.9435664302943666, 'learning_rate': 8.683999999999999e-07, 'completion_length': 68.26786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.10107421875, 'epoch': 0.13} 13%|█▎ | 329/2500 [2:50:03<16:57:49, 28.13s/it] 13%|█▎ | 330/2500 [2:50:28<16:21:25, 27.14s/it] {'loss': 0.0043, 'grad_norm': 0.7237036879967828, 'learning_rate': 8.68e-07, 'completion_length': 74.12500381469727, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.107177734375, 'epoch': 0.13} 13%|█▎ | 330/2500 [2:50:28<16:21:25, 27.14s/it] 13%|█▎ | 331/2500 [2:50:53<15:58:22, 26.51s/it] {'loss': 0.0048, 'grad_norm': 0.7271205803690467, 'learning_rate': 8.676e-07, 'completion_length': 65.33036041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1201171875, 'epoch': 0.13} 13%|█▎ | 331/2500 [2:50:53<15:58:22, 26.51s/it] 13%|█▎ | 332/2500 [2:51:17<15:29:59, 25.74s/it] {'loss': 0.0038, 'grad_norm': 2.196560675625549, 'learning_rate': 8.671999999999999e-07, 'completion_length': 62.38393020629883, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.10101525112986565, 'kl': 0.09375, 'epoch': 0.13} 13%|█▎ | 332/2500 [2:51:17<15:29:59, 25.74s/it] 13%|█▎ | 333/2500 [2:51:45<15:49:33, 26.29s/it] {'loss': 0.0043, 'grad_norm': 1.7786209641893005, 'learning_rate': 8.668e-07, 'completion_length': 67.55357360839844, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9553571939468384, 'reward_std': 0.12626907229423523, 'kl': 0.107666015625, 'epoch': 0.13} 13%|█▎ | 333/2500 [2:51:45<15:49:33, 26.29s/it] 13%|█▎ | 334/2500 [2:52:07<15:10:47, 25.23s/it] {'loss': 0.0035, 'grad_norm': 1.3098142832795525, 'learning_rate': 8.663999999999999e-07, 'completion_length': 70.8839340209961, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9017857909202576, 'reward_std': 0.053144559264183044, 'kl': 0.088623046875, 'epoch': 0.13} 13%|█▎ | 334/2500 [2:52:07<15:10:47, 25.23s/it] 13%|█▎ | 335/2500 [2:52:33<15:15:48, 25.38s/it] {'loss': 0.0057, 'grad_norm': 0.6544193350510685, 'learning_rate': 8.659999999999999e-07, 'completion_length': 64.99107360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1416015625, 'epoch': 0.13} 13%|█▎ | 335/2500 [2:52:33<15:15:48, 25.38s/it] 13%|█▎ | 336/2500 [2:53:01<15:43:40, 26.16s/it] {'loss': 0.0039, 'grad_norm': 0.4878903615254858, 'learning_rate': 8.656e-07, 'completion_length': 77.52678680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.097900390625, 'epoch': 0.13} 13%|█▎ | 336/2500 [2:53:01<15:43:40, 26.16s/it] 13%|█▎ | 337/2500 [2:53:25<15:16:06, 25.41s/it] {'loss': 0.0038, 'grad_norm': 2.6622221097463172, 'learning_rate': 8.651999999999999e-07, 'completion_length': 67.10714721679688, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.09138382598757744, 'kl': 0.095458984375, 'epoch': 0.13} 13%|█▎ | 337/2500 [2:53:25<15:16:06, 25.41s/it] 14%|█▎ | 338/2500 [2:53:49<14:59:12, 24.95s/it] {'loss': 0.0046, 'grad_norm': 3.83683617741428, 'learning_rate': 8.648e-07, 'completion_length': 69.25000381469727, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9375000596046448, 'reward_std': 0.11394162476062775, 'kl': 0.115234375, 'epoch': 0.14} 14%|█▎ | 338/2500 [2:53:49<14:59:12, 24.95s/it] 14%|█▎ | 339/2500 [2:54:13<14:54:59, 24.85s/it] {'loss': 0.0047, 'grad_norm': 1.1350663980319378, 'learning_rate': 8.643999999999999e-07, 'completion_length': 68.68750381469727, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.117919921875, 'epoch': 0.14} 14%|█▎ | 339/2500 [2:54:13<14:54:59, 24.85s/it] 14%|█▎ | 340/2500 [2:54:39<15:03:56, 25.11s/it] {'loss': 0.0045, 'grad_norm': 0.9561278875189725, 'learning_rate': 8.639999999999999e-07, 'completion_length': 71.73214340209961, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.05050762742757797, 'kl': 0.111572265625, 'epoch': 0.14} 14%|█▎ | 340/2500 [2:54:39<15:03:56, 25.11s/it] 14%|█▎ | 341/2500 [2:55:05<15:08:31, 25.25s/it] {'loss': 0.0037, 'grad_norm': 0.8628605748729781, 'learning_rate': 8.636e-07, 'completion_length': 77.25000381469727, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.093017578125, 'epoch': 0.14} 14%|█▎ | 341/2500 [2:55:05<15:08:31, 25.25s/it] 14%|█▎ | 342/2500 [2:55:36<16:10:36, 26.99s/it] {'loss': 0.0055, 'grad_norm': 0.6793512443044447, 'learning_rate': 8.632e-07, 'completion_length': 60.66964530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13720703125, 'epoch': 0.14} 14%|█▎ | 342/2500 [2:55:36<16:10:36, 26.99s/it] 14%|█▎ | 343/2500 [2:56:26<20:27:09, 34.14s/it] {'loss': 0.0046, 'grad_norm': 0.40843929704587056, 'learning_rate': 8.628e-07, 'completion_length': 70.75000381469727, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.114013671875, 'epoch': 0.14} 14%|█▎ | 343/2500 [2:56:27<20:27:09, 34.14s/it] 14%|█▍ | 344/2500 [2:56:51<18:39:19, 31.15s/it] {'loss': 0.0043, 'grad_norm': 4.882441519957665, 'learning_rate': 8.624e-07, 'completion_length': 73.58036041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.107421875, 'epoch': 0.14} 14%|█▍ | 344/2500 [2:56:51<18:39:19, 31.15s/it] 14%|█▍ | 345/2500 [2:57:14<17:09:24, 28.66s/it] {'loss': 0.0047, 'grad_norm': 2.7957859864876813, 'learning_rate': 8.62e-07, 'completion_length': 57.33928871154785, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.06343398988246918, 'kl': 0.1171875, 'epoch': 0.14} 14%|█▍ | 345/2500 [2:57:14<17:09:24, 28.66s/it] 14%|█▍ | 346/2500 [2:57:38<16:23:28, 27.39s/it] {'loss': 0.0045, 'grad_norm': 6.399243825176041, 'learning_rate': 8.616e-07, 'completion_length': 61.812503814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.113525390625, 'epoch': 0.14} 14%|█▍ | 346/2500 [2:57:38<16:23:28, 27.39s/it] 14%|█▍ | 347/2500 [2:58:02<15:49:39, 26.46s/it] {'loss': 0.0042, 'grad_norm': 0.19537065778603868, 'learning_rate': 8.611999999999999e-07, 'completion_length': 70.00893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.105712890625, 'epoch': 0.14} 14%|█▍ | 347/2500 [2:58:02<15:49:39, 26.46s/it] 14%|█▍ | 348/2500 [2:58:27<15:32:52, 26.01s/it] {'loss': 0.0039, 'grad_norm': 0.24458518943156482, 'learning_rate': 8.608e-07, 'completion_length': 70.6339340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09716796875, 'epoch': 0.14} 14%|█▍ | 348/2500 [2:58:27<15:32:52, 26.01s/it] 14%|█▍ | 349/2500 [2:58:53<15:28:02, 25.89s/it] {'loss': 0.004, 'grad_norm': 0.1681010468261039, 'learning_rate': 8.604000000000001e-07, 'completion_length': 69.60714530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.099365234375, 'epoch': 0.14} 14%|█▍ | 349/2500 [2:58:53<15:28:02, 25.89s/it] 14%|█▍ | 350/2500 [2:59:15<14:52:21, 24.90s/it] {'loss': 0.0043, 'grad_norm': 1.497083119656203, 'learning_rate': 8.599999999999999e-07, 'completion_length': 63.55357360839844, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.06343398988246918, 'kl': 0.10693359375, 'epoch': 0.14} 14%|█▍ | 350/2500 [2:59:15<14:52:21, 24.90s/it] 14%|█▍ | 351/2500 [2:59:44<15:30:09, 25.97s/it] {'loss': 0.0038, 'grad_norm': 1.6856766694668854, 'learning_rate': 8.596e-07, 'completion_length': 70.49107360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.07576143741607666, 'kl': 0.095947265625, 'epoch': 0.14} 14%|█▍ | 351/2500 [2:59:44<15:30:09, 25.97s/it] 14%|█▍ | 352/2500 [3:00:08<15:09:36, 25.41s/it] {'loss': 0.0033, 'grad_norm': 1.7367946445969893, 'learning_rate': 8.592e-07, 'completion_length': 64.21428680419922, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.06222161278128624, 'kl': 0.08349609375, 'epoch': 0.14} 14%|█▍ | 352/2500 [3:00:08<15:09:36, 25.41s/it] 14%|█▍ | 353/2500 [3:00:35<15:28:24, 25.95s/it] {'loss': 0.0054, 'grad_norm': 1.177883418385441, 'learning_rate': 8.587999999999999e-07, 'completion_length': 65.52679061889648, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.1357421875, 'epoch': 0.14} 14%|█▍ | 353/2500 [3:00:35<15:28:24, 25.95s/it] 14%|█▍ | 354/2500 [3:01:01<15:24:12, 25.84s/it] {'loss': 0.0056, 'grad_norm': 1.6107294093119005, 'learning_rate': 8.584e-07, 'completion_length': 62.33928871154785, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.14013671875, 'epoch': 0.14} 14%|█▍ | 354/2500 [3:01:01<15:24:12, 25.84s/it] 14%|█▍ | 355/2500 [3:01:24<14:56:31, 25.08s/it] {'loss': 0.0043, 'grad_norm': 0.23873109614561924, 'learning_rate': 8.58e-07, 'completion_length': 64.26785850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.107666015625, 'epoch': 0.14} 14%|█▍ | 355/2500 [3:01:24<14:56:31, 25.08s/it] 14%|█▍ | 356/2500 [3:01:49<14:51:12, 24.94s/it] {'loss': 0.0044, 'grad_norm': 1.9582917601365866, 'learning_rate': 8.576e-07, 'completion_length': 63.875003814697266, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.06222161650657654, 'kl': 0.111083984375, 'epoch': 0.14} 14%|█▍ | 356/2500 [3:01:49<14:51:12, 24.94s/it] 14%|█▍ | 357/2500 [3:02:13<14:41:53, 24.69s/it] {'loss': 0.0043, 'grad_norm': 0.8950939103057329, 'learning_rate': 8.571999999999999e-07, 'completion_length': 60.080360412597656, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.1083984375, 'epoch': 0.14} 14%|█▍ | 357/2500 [3:02:13<14:41:53, 24.69s/it] 14%|█▍ | 358/2500 [3:02:39<14:53:18, 25.02s/it] {'loss': 0.0049, 'grad_norm': 1.3393416760913466, 'learning_rate': 8.568e-07, 'completion_length': 55.589290618896484, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.12353515625, 'epoch': 0.14} 14%|█▍ | 358/2500 [3:02:39<14:53:18, 25.02s/it] 14%|█▍ | 359/2500 [3:03:03<14:44:45, 24.79s/it] {'loss': 0.0042, 'grad_norm': 0.2235389504298132, 'learning_rate': 8.564e-07, 'completion_length': 56.44643020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10595703125, 'epoch': 0.14} 14%|█▍ | 359/2500 [3:03:03<14:44:45, 24.79s/it] 14%|█▍ | 360/2500 [3:03:27<14:32:21, 24.46s/it] {'loss': 0.004, 'grad_norm': 1.1778733320461467, 'learning_rate': 8.559999999999999e-07, 'completion_length': 60.69643020629883, 'rewards/accuracy_reward': 0.9017857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9017857313156128, 'reward_std': 0.03696779906749725, 'kl': 0.100830078125, 'epoch': 0.14} 14%|█▍ | 360/2500 [3:03:27<14:32:21, 24.46s/it] 14%|█▍ | 361/2500 [3:03:54<15:00:03, 25.25s/it] {'loss': 0.0046, 'grad_norm': 1.91242903075621, 'learning_rate': 8.556e-07, 'completion_length': 58.187503814697266, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.946428656578064, 'reward_std': 0.11272923648357391, 'kl': 0.11474609375, 'epoch': 0.14} 14%|█▍ | 361/2500 [3:03:54<15:00:03, 25.25s/it] 14%|█▍ | 362/2500 [3:04:17<14:36:48, 24.61s/it] {'loss': 0.005, 'grad_norm': 0.22405081561351362, 'learning_rate': 8.551999999999999e-07, 'completion_length': 49.14285850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.125732421875, 'epoch': 0.14} 14%|█▍ | 362/2500 [3:04:17<14:36:48, 24.61s/it] 15%|█▍ | 363/2500 [3:04:44<15:02:56, 25.35s/it] {'loss': 0.0051, 'grad_norm': 2.0827354475898714, 'learning_rate': 8.548e-07, 'completion_length': 52.78571701049805, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642858505249023, 'reward_std': 0.0835726335644722, 'kl': 0.12646484375, 'epoch': 0.15} 15%|█▍ | 363/2500 [3:04:44<15:02:56, 25.35s/it] 15%|█▍ | 364/2500 [3:05:10<15:10:17, 25.57s/it] {'loss': 0.0043, 'grad_norm': 1.7051794793228798, 'learning_rate': 8.544e-07, 'completion_length': 56.65178871154785, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857909202576, 'reward_std': 0.10101525112986565, 'kl': 0.1083984375, 'epoch': 0.15} 15%|█▍ | 364/2500 [3:05:10<15:10:17, 25.57s/it] 15%|█▍ | 365/2500 [3:05:35<15:06:07, 25.46s/it] {'loss': 0.0054, 'grad_norm': 0.18450121890913895, 'learning_rate': 8.539999999999999e-07, 'completion_length': 55.955360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1357421875, 'epoch': 0.15} 15%|█▍ | 365/2500 [3:05:35<15:06:07, 25.46s/it] 15%|█▍ | 366/2500 [3:06:02<15:16:36, 25.77s/it] {'loss': 0.0054, 'grad_norm': 0.6904652245777505, 'learning_rate': 8.536e-07, 'completion_length': 55.937503814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13623046875, 'epoch': 0.15} 15%|█▍ | 366/2500 [3:06:02<15:16:36, 25.77s/it] 15%|█▍ | 367/2500 [3:06:26<15:05:28, 25.47s/it] {'loss': 0.0049, 'grad_norm': 1.5522923176594785, 'learning_rate': 8.531999999999999e-07, 'completion_length': 57.955360412597656, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.12158203125, 'epoch': 0.15} 15%|█▍ | 367/2500 [3:06:26<15:05:28, 25.47s/it] 15%|█▍ | 368/2500 [3:06:52<15:06:47, 25.52s/it] {'loss': 0.0046, 'grad_norm': 1.7582919159877572, 'learning_rate': 8.528e-07, 'completion_length': 55.53571701049805, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.115966796875, 'epoch': 0.15} 15%|█▍ | 368/2500 [3:06:52<15:06:47, 25.52s/it] 15%|█▍ | 369/2500 [3:07:17<15:01:29, 25.38s/it] {'loss': 0.0052, 'grad_norm': 0.49906633626773084, 'learning_rate': 8.524e-07, 'completion_length': 56.48214530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.131103515625, 'epoch': 0.15} 15%|█▍ | 369/2500 [3:07:17<15:01:29, 25.38s/it] 15%|█▍ | 370/2500 [3:07:42<14:57:36, 25.28s/it] {'loss': 0.0051, 'grad_norm': 1.2911033929012958, 'learning_rate': 8.52e-07, 'completion_length': 59.25893211364746, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.05831882357597351, 'kl': 0.127197265625, 'epoch': 0.15} 15%|█▍ | 370/2500 [3:07:42<14:57:36, 25.28s/it] 15%|█▍ | 371/2500 [3:08:06<14:45:03, 24.94s/it] {'loss': 0.0047, 'grad_norm': 0.1975232500585094, 'learning_rate': 8.516e-07, 'completion_length': 57.455360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.116943359375, 'epoch': 0.15} 15%|█▍ | 371/2500 [3:08:06<14:45:03, 24.94s/it] 15%|█▍ | 372/2500 [3:08:29<14:19:20, 24.23s/it] {'loss': 0.0054, 'grad_norm': 2.0525091595790625, 'learning_rate': 8.511999999999999e-07, 'completion_length': 55.83928871154785, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.134521484375, 'epoch': 0.15} 15%|█▍ | 372/2500 [3:08:29<14:19:20, 24.23s/it] 15%|█▍ | 373/2500 [3:08:54<14:32:09, 24.60s/it] {'loss': 0.0039, 'grad_norm': 3.1631004906330427, 'learning_rate': 8.508e-07, 'completion_length': 61.82143211364746, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.09814453125, 'epoch': 0.15} 15%|█▍ | 373/2500 [3:08:54<14:32:09, 24.60s/it] 15%|█▍ | 374/2500 [3:09:17<14:12:57, 24.07s/it] {'loss': 0.0048, 'grad_norm': 2.790360181879057, 'learning_rate': 8.504e-07, 'completion_length': 67.33036231994629, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.119140625, 'epoch': 0.15} 15%|█▍ | 374/2500 [3:09:17<14:12:57, 24.07s/it] 15%|█▌ | 375/2500 [3:09:46<14:59:02, 25.38s/it] {'loss': 0.0044, 'grad_norm': 2.115313270399408, 'learning_rate': 8.499999999999999e-07, 'completion_length': 63.30357360839844, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.10888671875, 'epoch': 0.15} 15%|█▌ | 375/2500 [3:09:46<14:59:02, 25.38s/it] 15%|█▌ | 376/2500 [3:10:10<14:52:18, 25.21s/it] {'loss': 0.005, 'grad_norm': 12.188374344058309, 'learning_rate': 8.496e-07, 'completion_length': 63.95535850524902, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.857142984867096, 'reward_std': 0.06613001227378845, 'kl': 0.125, 'epoch': 0.15} 15%|█▌ | 376/2500 [3:10:10<14:52:18, 25.21s/it] 15%|█▌ | 377/2500 [3:10:39<15:23:19, 26.09s/it] {'loss': 0.0041, 'grad_norm': 1.2819046717026628, 'learning_rate': 8.492e-07, 'completion_length': 74.10714721679688, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9017857909202576, 'reward_std': 0.07576143741607666, 'kl': 0.101318359375, 'epoch': 0.15} 15%|█▌ | 377/2500 [3:10:39<15:23:19, 26.09s/it] 15%|█▌ | 378/2500 [3:11:05<15:27:51, 26.24s/it] {'loss': 0.0039, 'grad_norm': 0.48428551066178616, 'learning_rate': 8.487999999999999e-07, 'completion_length': 70.77678680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.0966796875, 'epoch': 0.15} 15%|█▌ | 378/2500 [3:11:05<15:27:51, 26.24s/it] 15%|█▌ | 379/2500 [3:11:32<15:36:29, 26.49s/it] {'loss': 0.0053, 'grad_norm': 3.5115792590610013, 'learning_rate': 8.484e-07, 'completion_length': 69.98214340209961, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.13232421875, 'epoch': 0.15} 15%|█▌ | 379/2500 [3:11:32<15:36:29, 26.49s/it] 15%|█▌ | 380/2500 [3:12:00<15:49:03, 26.86s/it] {'loss': 0.0044, 'grad_norm': 0.17493476734324626, 'learning_rate': 8.48e-07, 'completion_length': 72.08928680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.111083984375, 'epoch': 0.15} 15%|█▌ | 380/2500 [3:12:00<15:49:03, 26.86s/it] 15%|█▌ | 381/2500 [3:12:25<15:29:11, 26.31s/it] {'loss': 0.0049, 'grad_norm': 1.2213098945783691, 'learning_rate': 8.475999999999999e-07, 'completion_length': 75.13393020629883, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.07003280520439148, 'kl': 0.12158203125, 'epoch': 0.15} 15%|█▌ | 381/2500 [3:12:25<15:29:11, 26.31s/it] 15%|█▌ | 382/2500 [3:13:13<19:14:00, 32.69s/it] {'loss': 0.0058, 'grad_norm': 0.2021434111507671, 'learning_rate': 8.471999999999999e-07, 'completion_length': 63.41071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.144775390625, 'epoch': 0.15} 15%|█▌ | 382/2500 [3:13:13<19:14:00, 32.69s/it] 15%|█▌ | 383/2500 [3:14:29<26:58:23, 45.87s/it] {'loss': 0.005, 'grad_norm': 1.2285597188939534, 'learning_rate': 8.468e-07, 'completion_length': 72.39286041259766, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.946428656578064, 'reward_std': 0.11663764342665672, 'kl': 0.1240234375, 'epoch': 0.15} 15%|█▌ | 383/2500 [3:14:29<26:58:23, 45.87s/it] 15%|█▌ | 384/2500 [3:14:56<23:34:15, 40.10s/it] {'loss': 0.0034, 'grad_norm': 1.1978821689351424, 'learning_rate': 8.464e-07, 'completion_length': 70.84821701049805, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.0849609375, 'epoch': 0.15} 15%|█▌ | 384/2500 [3:14:56<23:34:15, 40.10s/it] 15%|█▌ | 385/2500 [3:15:20<20:50:37, 35.48s/it] {'loss': 0.0048, 'grad_norm': 0.16142156116973314, 'learning_rate': 8.459999999999999e-07, 'completion_length': 61.642860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12109375, 'epoch': 0.15} 15%|█▌ | 385/2500 [3:15:20<20:50:37, 35.48s/it] 15%|█▌ | 386/2500 [3:15:59<21:18:05, 36.27s/it] {'loss': 0.0029, 'grad_norm': 0.15316163647006284, 'learning_rate': 8.456e-07, 'completion_length': 72.7589340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0733642578125, 'epoch': 0.15} 15%|█▌ | 386/2500 [3:15:59<21:18:05, 36.27s/it] 15%|█▌ | 387/2500 [3:16:36<21:31:50, 36.68s/it] {'loss': 0.0039, 'grad_norm': 0.8556057299005498, 'learning_rate': 8.451999999999999e-07, 'completion_length': 66.00893211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.097900390625, 'epoch': 0.15} 15%|█▌ | 387/2500 [3:16:36<21:31:50, 36.68s/it] 16%|█▌ | 388/2500 [3:17:01<19:27:52, 33.18s/it] {'loss': 0.0044, 'grad_norm': 1.224141225182576, 'learning_rate': 8.447999999999999e-07, 'completion_length': 70.40178680419922, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.110107421875, 'epoch': 0.16} 16%|█▌ | 388/2500 [3:17:01<19:27:52, 33.18s/it] 16%|█▌ | 389/2500 [3:17:31<18:51:53, 32.17s/it] {'loss': 0.0051, 'grad_norm': 2.1177988732868815, 'learning_rate': 8.444e-07, 'completion_length': 62.794647216796875, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.1279296875, 'epoch': 0.16} 16%|█▌ | 389/2500 [3:17:31<18:51:53, 32.17s/it] 16%|█▌ | 390/2500 [3:18:04<18:55:16, 32.28s/it] {'loss': 0.0047, 'grad_norm': 0.647758189083672, 'learning_rate': 8.439999999999999e-07, 'completion_length': 73.54464721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.117431640625, 'epoch': 0.16} 16%|█▌ | 390/2500 [3:18:04<18:55:16, 32.28s/it] 16%|█▌ | 391/2500 [3:18:37<19:10:31, 32.73s/it] {'loss': 0.0051, 'grad_norm': 4.059334959734946, 'learning_rate': 8.436e-07, 'completion_length': 58.05357360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9464285969734192, 'reward_std': 0.09919501841068268, 'kl': 0.126220703125, 'epoch': 0.16} 16%|█▌ | 391/2500 [3:18:37<19:10:31, 32.73s/it] 16%|█▌ | 392/2500 [3:19:18<20:29:32, 35.00s/it] {'loss': 0.0037, 'grad_norm': 0.1966189098395276, 'learning_rate': 8.431999999999999e-07, 'completion_length': 65.08928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09228515625, 'epoch': 0.16} 16%|█▌ | 392/2500 [3:19:18<20:29:32, 35.00s/it] 16%|█▌ | 393/2500 [3:19:43<18:48:58, 32.15s/it] {'loss': 0.005, 'grad_norm': 6.743029505460742, 'learning_rate': 8.428e-07, 'completion_length': 61.85714530944824, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.910714328289032, 'reward_std': 0.11663763970136642, 'kl': 0.126220703125, 'epoch': 0.16} 16%|█▌ | 393/2500 [3:19:43<18:48:58, 32.15s/it] 16%|█▌ | 394/2500 [3:20:07<17:25:51, 29.80s/it] {'loss': 0.003, 'grad_norm': 2.1218184232830053, 'learning_rate': 8.424e-07, 'completion_length': 60.62500190734863, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375001192092896, 'reward_std': 0.0964989960193634, 'kl': 0.0743408203125, 'epoch': 0.16} 16%|█▌ | 394/2500 [3:20:07<17:25:51, 29.80s/it] 16%|█▌ | 395/2500 [3:20:44<18:32:08, 31.70s/it] {'loss': 0.0041, 'grad_norm': 5.811112408877893, 'learning_rate': 8.419999999999999e-07, 'completion_length': 59.74107551574707, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.09918941557407379, 'kl': 0.1025390625, 'epoch': 0.16} 16%|█▌ | 395/2500 [3:20:44<18:32:08, 31.70s/it] 16%|█▌ | 396/2500 [3:21:10<17:34:08, 30.06s/it] {'loss': 0.0047, 'grad_norm': 0.9095515055125446, 'learning_rate': 8.416e-07, 'completion_length': 62.214290618896484, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.117431640625, 'epoch': 0.16} 16%|█▌ | 396/2500 [3:21:10<17:34:08, 30.06s/it] 16%|█▌ | 397/2500 [3:21:38<17:09:53, 29.38s/it] {'loss': 0.004, 'grad_norm': 1.6216999598585626, 'learning_rate': 8.411999999999999e-07, 'completion_length': 65.58036041259766, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.025253813713788986, 'kl': 0.1005859375, 'epoch': 0.16} 16%|█▌ | 397/2500 [3:21:38<17:09:53, 29.38s/it] 16%|█▌ | 398/2500 [3:22:30<21:13:47, 36.36s/it] {'loss': 0.0045, 'grad_norm': 0.17813744675548615, 'learning_rate': 8.408e-07, 'completion_length': 69.97322082519531, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.113525390625, 'epoch': 0.16} 16%|█▌ | 398/2500 [3:22:30<21:13:47, 36.36s/it] 16%|█▌ | 399/2500 [3:22:56<19:24:35, 33.26s/it] {'loss': 0.0042, 'grad_norm': 22.77852146000886, 'learning_rate': 8.404e-07, 'completion_length': 65.44643020629883, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.07514797896146774, 'kl': 0.105224609375, 'epoch': 0.16} 16%|█▌ | 399/2500 [3:22:56<19:24:35, 33.26s/it] 16%|█▌ | 400/2500 [3:23:20<17:48:06, 30.52s/it] {'loss': 0.0043, 'grad_norm': 3.5639768908165417, 'learning_rate': 8.399999999999999e-07, 'completion_length': 60.67857360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9375000596046448, 'reward_std': 0.1379830539226532, 'kl': 0.106689453125, 'epoch': 0.16} 16%|█▌ | 400/2500 [3:23:20<17:48:06, 30.52s/it] 16%|█▌ | 401/2500 [3:24:15<21:56:12, 37.62s/it] {'loss': 0.0042, 'grad_norm': 3.7064281036044977, 'learning_rate': 8.396e-07, 'completion_length': 74.11607360839844, 'rewards/accuracy_reward': 0.928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.12175281718373299, 'kl': 0.104736328125, 'epoch': 0.16} 16%|█▌ | 401/2500 [3:24:15<21:56:12, 37.62s/it] 16%|█▌ | 402/2500 [3:24:37<19:11:25, 32.93s/it] {'loss': 0.0045, 'grad_norm': 1.3462590157097039, 'learning_rate': 8.391999999999999e-07, 'completion_length': 77.21429061889648, 'rewards/accuracy_reward': 0.866071492433548, 'rewards/format_reward': 1.0, 'reward': 1.8660715222358704, 'reward_std': 0.05831881985068321, 'kl': 0.113037109375, 'epoch': 0.16} 16%|█▌ | 402/2500 [3:24:37<19:11:25, 32.93s/it] 16%|█▌ | 403/2500 [3:24:58<17:08:24, 29.43s/it] {'loss': 0.0044, 'grad_norm': 2.132532816103048, 'learning_rate': 8.387999999999999e-07, 'completion_length': 65.28571701049805, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.10986328125, 'epoch': 0.16} 16%|█▌ | 403/2500 [3:24:58<17:08:24, 29.43s/it] 16%|█▌ | 404/2500 [3:25:21<16:02:12, 27.54s/it] {'loss': 0.0046, 'grad_norm': 0.9371132953235743, 'learning_rate': 8.384e-07, 'completion_length': 61.80357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.115234375, 'epoch': 0.16} 16%|█▌ | 404/2500 [3:25:21<16:02:12, 27.54s/it] 16%|█▌ | 405/2500 [3:25:44<15:18:46, 26.31s/it] {'loss': 0.0031, 'grad_norm': 4.603177747757701, 'learning_rate': 8.38e-07, 'completion_length': 61.267860412597656, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857909202576, 'reward_std': 0.10101525112986565, 'kl': 0.078125, 'epoch': 0.16} 16%|█▌ | 405/2500 [3:25:44<15:18:46, 26.31s/it] 16%|█▌ | 406/2500 [3:26:19<16:46:48, 28.85s/it] {'loss': 0.0038, 'grad_norm': 2.4988641441134245, 'learning_rate': 8.375999999999999e-07, 'completion_length': 61.20535850524902, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.07003280520439148, 'kl': 0.095703125, 'epoch': 0.16} 16%|█▌ | 406/2500 [3:26:19<16:46:48, 28.85s/it] 16%|█▋ | 407/2500 [3:26:45<16:16:09, 27.98s/it] {'loss': 0.004, 'grad_norm': 2.539944123847337, 'learning_rate': 8.372e-07, 'completion_length': 67.14285850524902, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.07514797896146774, 'kl': 0.100830078125, 'epoch': 0.16} 16%|█▋ | 407/2500 [3:26:45<16:16:09, 27.98s/it] 16%|█▋ | 408/2500 [3:27:09<15:33:15, 26.77s/it] {'loss': 0.0045, 'grad_norm': 4.79129178763252, 'learning_rate': 8.368e-07, 'completion_length': 67.20535850524902, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.112548828125, 'epoch': 0.16} 16%|█▋ | 408/2500 [3:27:09<15:33:15, 26.77s/it] 16%|█▋ | 409/2500 [3:27:39<16:00:07, 27.55s/it] {'loss': 0.0038, 'grad_norm': 0.6682219188042321, 'learning_rate': 8.363999999999999e-07, 'completion_length': 66.57143211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.09619140625, 'epoch': 0.16} 16%|█▋ | 409/2500 [3:27:39<16:00:07, 27.55s/it] 16%|█▋ | 410/2500 [3:28:03<15:33:01, 26.79s/it] {'loss': 0.0042, 'grad_norm': 12.560241340950725, 'learning_rate': 8.359999999999999e-07, 'completion_length': 61.732147216796875, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.10595703125, 'epoch': 0.16} 16%|█▋ | 410/2500 [3:28:04<15:33:01, 26.79s/it] 16%|█▋ | 411/2500 [3:28:28<15:11:49, 26.19s/it] {'loss': 0.0041, 'grad_norm': 2.6289514795930855, 'learning_rate': 8.356e-07, 'completion_length': 61.15178680419922, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1025390625, 'epoch': 0.16} 16%|█▋ | 411/2500 [3:28:28<15:11:49, 26.19s/it] 16%|█▋ | 412/2500 [3:28:56<15:22:37, 26.51s/it] {'loss': 0.0046, 'grad_norm': 0.3815335497559522, 'learning_rate': 8.352000000000001e-07, 'completion_length': 61.67857551574707, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.114990234375, 'epoch': 0.16} 16%|█▋ | 412/2500 [3:28:56<15:22:37, 26.51s/it] 17%|█▋ | 413/2500 [3:29:20<15:03:08, 25.96s/it] {'loss': 0.0049, 'grad_norm': 2.3893050495005115, 'learning_rate': 8.347999999999999e-07, 'completion_length': 58.88393211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1240234375, 'epoch': 0.17} 17%|█▋ | 413/2500 [3:29:20<15:03:08, 25.96s/it] 17%|█▋ | 414/2500 [3:29:47<15:07:16, 26.10s/it] {'loss': 0.005, 'grad_norm': 1.5966557439765594, 'learning_rate': 8.344e-07, 'completion_length': 66.30357551574707, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.06613001227378845, 'kl': 0.125, 'epoch': 0.17} 17%|█▋ | 414/2500 [3:29:47<15:07:16, 26.10s/it] 17%|█▋ | 415/2500 [3:30:15<15:31:25, 26.80s/it] {'loss': 0.0036, 'grad_norm': 1.1870230169000804, 'learning_rate': 8.34e-07, 'completion_length': 65.61607360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.091064453125, 'epoch': 0.17} 17%|█▋ | 415/2500 [3:30:15<15:31:25, 26.80s/it] 17%|█▋ | 416/2500 [3:30:54<17:36:28, 30.42s/it] {'loss': 0.0048, 'grad_norm': 0.20675671601169773, 'learning_rate': 8.335999999999999e-07, 'completion_length': 60.90178871154785, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.119873046875, 'epoch': 0.17} 17%|█▋ | 416/2500 [3:30:54<17:36:28, 30.42s/it] 17%|█▋ | 417/2500 [3:31:18<16:25:32, 28.39s/it] {'loss': 0.0038, 'grad_norm': 0.1882174455735784, 'learning_rate': 8.332e-07, 'completion_length': 64.73214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.095703125, 'epoch': 0.17} 17%|█▋ | 417/2500 [3:31:18<16:25:32, 28.39s/it] 17%|█▋ | 418/2500 [3:31:59<18:43:48, 32.39s/it] {'loss': 0.0038, 'grad_norm': 0.1298795214406162, 'learning_rate': 8.328e-07, 'completion_length': 60.080360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.095703125, 'epoch': 0.17} 17%|█▋ | 418/2500 [3:31:59<18:43:48, 32.39s/it] 17%|█▋ | 419/2500 [3:32:41<20:23:23, 35.27s/it] {'loss': 0.0038, 'grad_norm': 1.0796825404924308, 'learning_rate': 8.324e-07, 'completion_length': 66.52678871154785, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.095703125, 'epoch': 0.17} 17%|█▋ | 419/2500 [3:32:41<20:23:23, 35.27s/it] 17%|█▋ | 420/2500 [3:33:06<18:29:38, 32.01s/it] {'loss': 0.0043, 'grad_norm': 2.3535507547185355, 'learning_rate': 8.319999999999999e-07, 'completion_length': 62.31250190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.10791015625, 'epoch': 0.17} 17%|█▋ | 420/2500 [3:33:06<18:29:38, 32.01s/it] 17%|█▋ | 421/2500 [3:33:59<22:05:29, 38.25s/it] {'loss': 0.0039, 'grad_norm': 0.5478836485147217, 'learning_rate': 8.316e-07, 'completion_length': 61.82143020629883, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.097412109375, 'epoch': 0.17} 17%|█▋ | 421/2500 [3:33:59<22:05:29, 38.25s/it] 17%|█▋ | 422/2500 [3:34:23<19:44:10, 34.19s/it] {'loss': 0.0058, 'grad_norm': 1.764732716555874, 'learning_rate': 8.312e-07, 'completion_length': 59.82143020629883, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.14501953125, 'epoch': 0.17} 17%|█▋ | 422/2500 [3:34:23<19:44:10, 34.19s/it] 17%|█▋ | 423/2500 [3:34:48<18:10:25, 31.50s/it] {'loss': 0.0036, 'grad_norm': 3.8698778382758845, 'learning_rate': 8.308e-07, 'completion_length': 62.857147216796875, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.090087890625, 'epoch': 0.17} 17%|█▋ | 423/2500 [3:34:49<18:10:25, 31.50s/it] 17%|█▋ | 424/2500 [3:35:12<16:51:50, 29.24s/it] {'loss': 0.004, 'grad_norm': 0.6373979301838881, 'learning_rate': 8.304e-07, 'completion_length': 63.705360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.100341796875, 'epoch': 0.17} 17%|█▋ | 424/2500 [3:35:12<16:51:50, 29.24s/it] 17%|█▋ | 425/2500 [3:35:37<16:03:52, 27.87s/it] {'loss': 0.004, 'grad_norm': 0.15827751852023772, 'learning_rate': 8.299999999999999e-07, 'completion_length': 59.48214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09912109375, 'epoch': 0.17} 17%|█▋ | 425/2500 [3:35:37<16:03:52, 27.87s/it] 17%|█▋ | 426/2500 [3:36:01<15:23:27, 26.72s/it] {'loss': 0.004, 'grad_norm': 0.1303777868971204, 'learning_rate': 8.296e-07, 'completion_length': 62.892860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.099609375, 'epoch': 0.17} 17%|█▋ | 426/2500 [3:36:01<15:23:27, 26.72s/it] 17%|█▋ | 427/2500 [3:36:24<14:46:39, 25.66s/it] {'loss': 0.0044, 'grad_norm': 1.7595191675713129, 'learning_rate': 8.292e-07, 'completion_length': 66.65179061889648, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.10888671875, 'epoch': 0.17} 17%|█▋ | 427/2500 [3:36:24<14:46:39, 25.66s/it] 17%|█▋ | 428/2500 [3:36:50<14:49:58, 25.77s/it] {'loss': 0.0041, 'grad_norm': 0.9613314370756899, 'learning_rate': 8.287999999999999e-07, 'completion_length': 67.01786041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1025390625, 'epoch': 0.17} 17%|█▋ | 428/2500 [3:36:50<14:49:58, 25.77s/it] 17%|█▋ | 429/2500 [3:37:14<14:25:59, 25.09s/it] {'loss': 0.0045, 'grad_norm': 3.1889060263155264, 'learning_rate': 8.284e-07, 'completion_length': 64.06250190734863, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.11376953125, 'epoch': 0.17} 17%|█▋ | 429/2500 [3:37:14<14:25:59, 25.09s/it] 17%|█▋ | 430/2500 [3:37:38<14:17:06, 24.84s/it] {'loss': 0.0041, 'grad_norm': 0.14880351995812519, 'learning_rate': 8.28e-07, 'completion_length': 61.55357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.101318359375, 'epoch': 0.17} 17%|█▋ | 430/2500 [3:37:38<14:17:06, 24.84s/it] 17%|█▋ | 431/2500 [3:38:02<14:08:54, 24.62s/it] {'loss': 0.004, 'grad_norm': 2.7171907049771393, 'learning_rate': 8.275999999999999e-07, 'completion_length': 68.33036041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1005859375, 'epoch': 0.17} 17%|█▋ | 431/2500 [3:38:02<14:08:54, 24.62s/it] 17%|█▋ | 432/2500 [3:38:26<13:58:36, 24.33s/it] {'loss': 0.0035, 'grad_norm': 1.8644519446923526, 'learning_rate': 8.272e-07, 'completion_length': 65.07143211364746, 'rewards/accuracy_reward': 0.9375000596046448, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.11394162103533745, 'kl': 0.088623046875, 'epoch': 0.17} 17%|█▋ | 432/2500 [3:38:26<13:58:36, 24.33s/it] 17%|█▋ | 433/2500 [3:38:52<14:19:18, 24.94s/it] {'loss': 0.0031, 'grad_norm': 0.09805167844132275, 'learning_rate': 8.268e-07, 'completion_length': 71.31250381469727, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.078369140625, 'epoch': 0.17} 17%|█▋ | 433/2500 [3:38:52<14:19:18, 24.94s/it] 17%|█▋ | 434/2500 [3:39:16<14:02:48, 24.48s/it] {'loss': 0.0048, 'grad_norm': 2.242706619965644, 'learning_rate': 8.263999999999999e-07, 'completion_length': 62.83928871154785, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.06222161650657654, 'kl': 0.11865234375, 'epoch': 0.17} 17%|█▋ | 434/2500 [3:39:16<14:02:48, 24.48s/it] 17%|█▋ | 435/2500 [3:39:40<13:57:16, 24.33s/it] {'loss': 0.0038, 'grad_norm': 2.1244685300188997, 'learning_rate': 8.259999999999999e-07, 'completion_length': 71.2410774230957, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.025253813713788986, 'kl': 0.093994140625, 'epoch': 0.17} 17%|█▋ | 435/2500 [3:39:40<13:57:16, 24.33s/it] 17%|█▋ | 436/2500 [3:40:03<13:44:21, 23.96s/it] {'loss': 0.0046, 'grad_norm': 1.0576953285878574, 'learning_rate': 8.256e-07, 'completion_length': 58.92857360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.115966796875, 'epoch': 0.17} 17%|█▋ | 436/2500 [3:40:03<13:44:21, 23.96s/it] 17%|█▋ | 437/2500 [3:40:26<13:36:46, 23.75s/it] {'loss': 0.0039, 'grad_norm': 0.14243591434458155, 'learning_rate': 8.252000000000001e-07, 'completion_length': 67.98214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.096923828125, 'epoch': 0.17} 17%|█▋ | 437/2500 [3:40:26<13:36:46, 23.75s/it] 18%|█▊ | 438/2500 [3:40:51<13:46:23, 24.05s/it] {'loss': 0.0047, 'grad_norm': 2.3395899896877475, 'learning_rate': 8.247999999999999e-07, 'completion_length': 63.77678871154785, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553572535514832, 'reward_std': 0.09138382971286774, 'kl': 0.11865234375, 'epoch': 0.18} 18%|█▊ | 438/2500 [3:40:51<13:46:23, 24.05s/it] 18%|█▊ | 439/2500 [3:41:18<14:20:35, 25.05s/it] {'loss': 0.0039, 'grad_norm': 1.9865987843945538, 'learning_rate': 8.244e-07, 'completion_length': 77.90178680419922, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.06343399360775948, 'kl': 0.09765625, 'epoch': 0.18} 18%|█▊ | 439/2500 [3:41:18<14:20:35, 25.05s/it] 18%|█▊ | 440/2500 [3:41:46<14:52:10, 25.99s/it] {'loss': 0.0037, 'grad_norm': 2.210968799800189, 'learning_rate': 8.24e-07, 'completion_length': 67.84821510314941, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.09138382598757744, 'kl': 0.09326171875, 'epoch': 0.18} 18%|█▊ | 440/2500 [3:41:46<14:52:10, 25.99s/it] 18%|█▊ | 441/2500 [3:42:13<14:53:52, 26.05s/it] {'loss': 0.0035, 'grad_norm': 0.16168570592158898, 'learning_rate': 8.235999999999999e-07, 'completion_length': 74.33929061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.086669921875, 'epoch': 0.18} 18%|█▊ | 441/2500 [3:42:13<14:53:52, 26.05s/it] 18%|█▊ | 442/2500 [3:42:41<15:13:49, 26.64s/it] {'loss': 0.0033, 'grad_norm': 1.8746616111751766, 'learning_rate': 8.232e-07, 'completion_length': 64.50000190734863, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.10040178894996643, 'kl': 0.08154296875, 'epoch': 0.18} 18%|█▊ | 442/2500 [3:42:41<15:13:49, 26.64s/it] 18%|█▊ | 443/2500 [3:43:09<15:27:44, 27.06s/it] {'loss': 0.0046, 'grad_norm': 1.2687947860363853, 'learning_rate': 8.228e-07, 'completion_length': 68.62500381469727, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9642857313156128, 'reward_std': 0.05399492755532265, 'kl': 0.114013671875, 'epoch': 0.18} 18%|█▊ | 443/2500 [3:43:09<15:27:44, 27.06s/it] 18%|█▊ | 444/2500 [3:43:35<15:20:01, 26.85s/it] {'loss': 0.0038, 'grad_norm': 1.2537850025844108, 'learning_rate': 8.224e-07, 'completion_length': 69.93750381469727, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.06222161278128624, 'kl': 0.09423828125, 'epoch': 0.18} 18%|█▊ | 444/2500 [3:43:35<15:20:01, 26.85s/it] 18%|█▊ | 445/2500 [3:44:09<16:30:57, 28.93s/it] {'loss': 0.0032, 'grad_norm': 1.500524004459538, 'learning_rate': 8.219999999999999e-07, 'completion_length': 70.7589340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.06613001227378845, 'kl': 0.080810546875, 'epoch': 0.18} 18%|█▊ | 445/2500 [3:44:09<16:30:57, 28.93s/it] 18%|█▊ | 446/2500 [3:44:34<15:57:51, 27.98s/it] {'loss': 0.0035, 'grad_norm': 0.9481945116913043, 'learning_rate': 8.216e-07, 'completion_length': 78.28571701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.08740234375, 'epoch': 0.18} 18%|█▊ | 446/2500 [3:44:35<15:57:51, 27.98s/it] 18%|█▊ | 447/2500 [3:45:00<15:32:05, 27.24s/it] {'loss': 0.0031, 'grad_norm': 1.264272667433872, 'learning_rate': 8.212e-07, 'completion_length': 77.51786041259766, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.078857421875, 'epoch': 0.18} 18%|█▊ | 447/2500 [3:45:00<15:32:05, 27.24s/it] 18%|█▊ | 448/2500 [3:45:32<16:17:00, 28.57s/it] {'loss': 0.0044, 'grad_norm': 0.5242979599048105, 'learning_rate': 8.207999999999999e-07, 'completion_length': 70.90178680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.11083984375, 'epoch': 0.18} 18%|█▊ | 448/2500 [3:45:32<16:17:00, 28.57s/it] 18%|█▊ | 449/2500 [3:45:57<15:43:58, 27.62s/it] {'loss': 0.0028, 'grad_norm': 0.7576053898785552, 'learning_rate': 8.204e-07, 'completion_length': 75.64286041259766, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.025253813713788986, 'kl': 0.070556640625, 'epoch': 0.18} 18%|█▊ | 449/2500 [3:45:57<15:43:58, 27.62s/it] 18%|█▊ | 450/2500 [3:46:21<15:04:29, 26.47s/it] {'loss': 0.0032, 'grad_norm': 0.13872658018864753, 'learning_rate': 8.199999999999999e-07, 'completion_length': 69.20536041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.080078125, 'epoch': 0.18} 18%|█▊ | 450/2500 [3:46:21<15:04:29, 26.47s/it] 18%|█▊ | 451/2500 [3:47:02<17:37:38, 30.97s/it] {'loss': 0.0034, 'grad_norm': 0.6743217142931927, 'learning_rate': 8.196e-07, 'completion_length': 85.08036041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.085205078125, 'epoch': 0.18} 18%|█▊ | 451/2500 [3:47:02<17:37:38, 30.97s/it] 18%|█▊ | 452/2500 [3:47:30<17:00:06, 29.89s/it] {'loss': 0.0025, 'grad_norm': 0.6415127019428041, 'learning_rate': 8.192e-07, 'completion_length': 78.52679061889648, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9017857909202576, 'reward_std': 0.07576143741607666, 'kl': 0.0628662109375, 'epoch': 0.18} 18%|█▊ | 452/2500 [3:47:31<17:00:06, 29.89s/it] 18%|█▊ | 453/2500 [3:48:40<23:53:10, 42.01s/it] {'loss': 0.0038, 'grad_norm': 0.6234506149478456, 'learning_rate': 8.187999999999999e-07, 'completion_length': 64.41071701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.093994140625, 'epoch': 0.18} 18%|█▊ | 453/2500 [3:48:40<23:53:10, 42.01s/it] 18%|█▊ | 454/2500 [3:49:07<21:15:12, 37.40s/it] {'loss': 0.0041, 'grad_norm': 3.100461560094771, 'learning_rate': 8.184e-07, 'completion_length': 58.85714530944824, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.103515625, 'epoch': 0.18} 18%|█▊ | 454/2500 [3:49:07<21:15:12, 37.40s/it] 18%|█▊ | 455/2500 [3:49:40<20:32:12, 36.15s/it] {'loss': 0.004, 'grad_norm': 1.107893539100025, 'learning_rate': 8.179999999999999e-07, 'completion_length': 68.42857360839844, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.099609375, 'epoch': 0.18} 18%|█▊ | 455/2500 [3:49:40<20:32:12, 36.15s/it] 18%|█▊ | 456/2500 [3:50:13<20:05:23, 35.38s/it] {'loss': 0.0038, 'grad_norm': 0.1851015618561487, 'learning_rate': 8.175999999999999e-07, 'completion_length': 62.91071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.093994140625, 'epoch': 0.18} 18%|█▊ | 456/2500 [3:50:13<20:05:23, 35.38s/it] 18%|█▊ | 457/2500 [3:50:55<21:10:09, 37.30s/it] {'loss': 0.0035, 'grad_norm': 0.16743926061321282, 'learning_rate': 8.172e-07, 'completion_length': 75.37500381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.087158203125, 'epoch': 0.18} 18%|█▊ | 457/2500 [3:50:55<21:10:09, 37.30s/it] 18%|█▊ | 458/2500 [3:51:33<21:11:59, 37.37s/it] {'loss': 0.0039, 'grad_norm': 6.554757117032259, 'learning_rate': 8.168e-07, 'completion_length': 66.65178680419922, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553572535514832, 'reward_std': 0.08747542649507523, 'kl': 0.096923828125, 'epoch': 0.18} 18%|█▊ | 458/2500 [3:51:33<21:11:59, 37.37s/it] 18%|█▊ | 459/2500 [3:52:52<28:22:07, 50.04s/it] {'loss': 0.004, 'grad_norm': 1.1184095877099098, 'learning_rate': 8.163999999999999e-07, 'completion_length': 59.69643020629883, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.10009765625, 'epoch': 0.18} 18%|█▊ | 459/2500 [3:52:52<28:22:07, 50.04s/it] 18%|█▊ | 460/2500 [3:53:34<26:56:01, 47.53s/it] {'loss': 0.0039, 'grad_norm': 0.12278477844545456, 'learning_rate': 8.159999999999999e-07, 'completion_length': 66.77679061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0966796875, 'epoch': 0.18} 18%|█▊ | 460/2500 [3:53:34<26:56:01, 47.53s/it] 18%|█▊ | 461/2500 [3:54:02<23:34:35, 41.63s/it] {'loss': 0.0029, 'grad_norm': 0.1237064355724003, 'learning_rate': 8.156e-07, 'completion_length': 69.70536041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.07177734375, 'epoch': 0.18} 18%|█▊ | 461/2500 [3:54:02<23:34:35, 41.63s/it] 18%|█▊ | 462/2500 [3:54:27<20:47:54, 36.74s/it] {'loss': 0.0043, 'grad_norm': 0.13174396415666667, 'learning_rate': 8.152e-07, 'completion_length': 76.13393020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.108154296875, 'epoch': 0.18} 18%|█▊ | 462/2500 [3:54:27<20:47:54, 36.74s/it] 19%|█▊ | 463/2500 [3:54:56<19:22:06, 34.23s/it] {'loss': 0.003, 'grad_norm': 0.536804953588463, 'learning_rate': 8.147999999999999e-07, 'completion_length': 65.55357551574707, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.0751953125, 'epoch': 0.19} 19%|█▊ | 463/2500 [3:54:56<19:22:06, 34.23s/it] 19%|█▊ | 464/2500 [3:55:20<17:39:35, 31.23s/it] {'loss': 0.006, 'grad_norm': 0.20562445832460183, 'learning_rate': 8.144e-07, 'completion_length': 64.75000190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1513671875, 'epoch': 0.19} 19%|█▊ | 464/2500 [3:55:20<17:39:35, 31.23s/it] 19%|█▊ | 465/2500 [3:56:19<22:25:19, 39.67s/it] {'loss': 0.0024, 'grad_norm': 2.2112528742281694, 'learning_rate': 8.14e-07, 'completion_length': 73.07143020629883, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.0609130859375, 'epoch': 0.19} 19%|█▊ | 465/2500 [3:56:19<22:25:19, 39.67s/it] 19%|█▊ | 466/2500 [3:56:46<20:09:55, 35.69s/it] {'loss': 0.0039, 'grad_norm': 0.16965070305637567, 'learning_rate': 8.135999999999999e-07, 'completion_length': 65.43750190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0966796875, 'epoch': 0.19} 19%|█▊ | 466/2500 [3:56:46<20:09:55, 35.69s/it] 19%|█▊ | 467/2500 [3:57:11<18:24:38, 32.60s/it] {'loss': 0.0035, 'grad_norm': 0.15290273453506906, 'learning_rate': 8.132e-07, 'completion_length': 66.00893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.087158203125, 'epoch': 0.19} 19%|█▊ | 467/2500 [3:57:11<18:24:38, 32.60s/it] 19%|█▊ | 468/2500 [3:57:36<17:09:15, 30.39s/it] {'loss': 0.0032, 'grad_norm': 2.9674677862006478, 'learning_rate': 8.128e-07, 'completion_length': 68.93750381469727, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.079833984375, 'epoch': 0.19} 19%|█▊ | 468/2500 [3:57:36<17:09:15, 30.39s/it] 19%|█▉ | 469/2500 [3:58:03<16:30:35, 29.26s/it] {'loss': 0.0037, 'grad_norm': 0.7893273410901386, 'learning_rate': 8.123999999999999e-07, 'completion_length': 70.08928871154785, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.033065006136894226, 'kl': 0.09375, 'epoch': 0.19} 19%|█▉ | 469/2500 [3:58:03<16:30:35, 29.26s/it] 19%|█▉ | 470/2500 [3:58:52<19:56:17, 35.36s/it] {'loss': 0.0046, 'grad_norm': 5.517665679212294, 'learning_rate': 8.12e-07, 'completion_length': 69.18750381469727, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.1142578125, 'epoch': 0.19} 19%|█▉ | 470/2500 [3:58:52<19:56:17, 35.36s/it] 19%|█▉ | 471/2500 [3:59:16<17:59:10, 31.91s/it] {'loss': 0.0035, 'grad_norm': 2.356729591057947, 'learning_rate': 8.116e-07, 'completion_length': 65.20536231994629, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.087890625, 'epoch': 0.19} 19%|█▉ | 471/2500 [3:59:16<17:59:10, 31.91s/it] 19%|█▉ | 472/2500 [3:59:42<16:50:54, 29.91s/it] {'loss': 0.0028, 'grad_norm': 2.1940288740875373, 'learning_rate': 8.112e-07, 'completion_length': 67.02678871154785, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.08747543022036552, 'kl': 0.0689697265625, 'epoch': 0.19} 19%|█▉ | 472/2500 [3:59:42<16:50:54, 29.91s/it] 19%|█▉ | 473/2500 [4:00:06<15:56:22, 28.31s/it] {'loss': 0.0036, 'grad_norm': 0.38876607482676134, 'learning_rate': 8.107999999999999e-07, 'completion_length': 64.91964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.089111328125, 'epoch': 0.19} 19%|█▉ | 473/2500 [4:00:07<15:56:22, 28.31s/it] 19%|█▉ | 474/2500 [4:00:31<15:25:52, 27.42s/it] {'loss': 0.0028, 'grad_norm': 0.1655942933148762, 'learning_rate': 8.104e-07, 'completion_length': 72.55357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0703125, 'epoch': 0.19} 19%|█▉ | 474/2500 [4:00:31<15:25:52, 27.42s/it] 19%|█▉ | 475/2500 [4:00:59<15:23:04, 27.35s/it] {'loss': 0.0035, 'grad_norm': 1.1985117880477831, 'learning_rate': 8.1e-07, 'completion_length': 71.6875, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.08740234375, 'epoch': 0.19} 19%|█▉ | 475/2500 [4:00:59<15:23:04, 27.35s/it] 19%|█▉ | 476/2500 [4:01:25<15:15:04, 27.13s/it] {'loss': 0.0037, 'grad_norm': 1.0574639915351298, 'learning_rate': 8.095999999999999e-07, 'completion_length': 71.83036041259766, 'rewards/accuracy_reward': 0.8839286267757416, 'rewards/format_reward': 1.0, 'reward': 1.883928656578064, 'reward_std': 0.03696779906749725, 'kl': 0.091796875, 'epoch': 0.19} 19%|█▉ | 476/2500 [4:01:25<15:15:04, 27.13s/it] 19%|█▉ | 477/2500 [4:01:53<15:22:14, 27.35s/it] {'loss': 0.0028, 'grad_norm': 0.5169792247377396, 'learning_rate': 8.092e-07, 'completion_length': 74.19643020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.053144559264183044, 'kl': 0.071044921875, 'epoch': 0.19} 19%|█▉ | 477/2500 [4:01:53<15:22:14, 27.35s/it] 19%|█▉ | 478/2500 [4:02:20<15:20:55, 27.33s/it] {'loss': 0.0024, 'grad_norm': 2.322074242257949, 'learning_rate': 8.087999999999999e-07, 'completion_length': 75.8214340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9464285969734192, 'reward_std': 0.07839837670326233, 'kl': 0.059814453125, 'epoch': 0.19} 19%|█▉ | 478/2500 [4:02:20<15:20:55, 27.33s/it] 19%|█▉ | 479/2500 [4:02:46<14:58:14, 26.67s/it] {'loss': 0.0039, 'grad_norm': 0.6498906819141819, 'learning_rate': 8.084e-07, 'completion_length': 71.09821701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.097412109375, 'epoch': 0.19} 19%|█▉ | 479/2500 [4:02:46<14:58:14, 26.67s/it] 19%|█▉ | 480/2500 [4:03:10<14:36:27, 26.03s/it] {'loss': 0.0038, 'grad_norm': 3.931285091360658, 'learning_rate': 8.08e-07, 'completion_length': 66.28571701049805, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.06613001227378845, 'kl': 0.095458984375, 'epoch': 0.19} 19%|█▉ | 480/2500 [4:03:10<14:36:27, 26.03s/it] 19%|█▉ | 481/2500 [4:03:38<14:58:25, 26.70s/it] {'loss': 0.0025, 'grad_norm': 2.6333423524868764, 'learning_rate': 8.075999999999999e-07, 'completion_length': 74.59821701049805, 'rewards/accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.8928572535514832, 'reward_std': 0.08868780359625816, 'kl': 0.062744140625, 'epoch': 0.19} 19%|█▉ | 481/2500 [4:03:38<14:58:25, 26.70s/it] 19%|█▉ | 482/2500 [4:04:25<18:21:56, 32.76s/it] {'loss': 0.0042, 'grad_norm': 2.6810411459067383, 'learning_rate': 8.072e-07, 'completion_length': 72.08928680419922, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8750001192092896, 'reward_std': 0.09919501841068268, 'kl': 0.104248046875, 'epoch': 0.19} 19%|█▉ | 482/2500 [4:04:25<18:21:56, 32.76s/it] 19%|█▉ | 483/2500 [4:04:54<17:44:56, 31.68s/it] {'loss': 0.0036, 'grad_norm': 0.27780090970104804, 'learning_rate': 8.067999999999999e-07, 'completion_length': 85.6339340209961, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.089111328125, 'epoch': 0.19} 19%|█▉ | 483/2500 [4:04:54<17:44:56, 31.68s/it] 19%|█▉ | 484/2500 [4:05:20<16:38:58, 29.73s/it] {'loss': 0.0027, 'grad_norm': 0.11884728103986246, 'learning_rate': 8.064e-07, 'completion_length': 70.59821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.066650390625, 'epoch': 0.19} 19%|█▉ | 484/2500 [4:05:20<16:38:58, 29.73s/it] 19%|█▉ | 485/2500 [4:06:04<19:08:42, 34.20s/it] {'loss': 0.0025, 'grad_norm': 0.14170916282282645, 'learning_rate': 8.06e-07, 'completion_length': 75.10714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0631103515625, 'epoch': 0.19} 19%|█▉ | 485/2500 [4:06:04<19:08:42, 34.20s/it] 19%|█▉ | 486/2500 [4:06:29<17:31:58, 31.34s/it] {'loss': 0.0023, 'grad_norm': 1.8053001564916222, 'learning_rate': 8.056e-07, 'completion_length': 70.5535774230957, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.0572509765625, 'epoch': 0.19} 19%|█▉ | 486/2500 [4:06:29<17:31:58, 31.34s/it] 19%|█▉ | 487/2500 [4:06:57<16:55:00, 30.25s/it] {'loss': 0.0029, 'grad_norm': 1.0223939313181518, 'learning_rate': 8.052e-07, 'completion_length': 76.91964340209961, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.073486328125, 'epoch': 0.19} 19%|█▉ | 487/2500 [4:06:57<16:55:00, 30.25s/it] 20%|█▉ | 488/2500 [4:07:33<17:59:09, 32.18s/it] {'loss': 0.003, 'grad_norm': 0.11107353445938922, 'learning_rate': 8.047999999999999e-07, 'completion_length': 74.50893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.07470703125, 'epoch': 0.2} 20%|█▉ | 488/2500 [4:07:33<17:59:09, 32.18s/it] 20%|█▉ | 489/2500 [4:07:59<16:55:50, 30.31s/it] {'loss': 0.0038, 'grad_norm': 0.7948648790085626, 'learning_rate': 8.044e-07, 'completion_length': 68.05357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.09619140625, 'epoch': 0.2} 20%|█▉ | 489/2500 [4:07:59<16:55:50, 30.31s/it] 20%|█▉ | 490/2500 [4:08:22<15:39:41, 28.05s/it] {'loss': 0.0036, 'grad_norm': 1.4752273220884955, 'learning_rate': 8.04e-07, 'completion_length': 73.28571701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.08935546875, 'epoch': 0.2} 20%|█▉ | 490/2500 [4:08:22<15:39:41, 28.05s/it] 20%|█▉ | 491/2500 [4:08:48<15:23:28, 27.58s/it] {'loss': 0.0038, 'grad_norm': 3.9379010841854023, 'learning_rate': 8.035999999999999e-07, 'completion_length': 73.20536041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.09619140625, 'epoch': 0.2} 20%|█▉ | 491/2500 [4:08:49<15:23:28, 27.58s/it] 20%|█▉ | 492/2500 [4:09:15<15:10:30, 27.21s/it] {'loss': 0.0038, 'grad_norm': 2.4650249842754453, 'learning_rate': 8.032e-07, 'completion_length': 66.43750190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.0947265625, 'epoch': 0.2} 20%|█▉ | 492/2500 [4:09:15<15:10:30, 27.21s/it] 20%|█▉ | 493/2500 [4:09:49<16:20:38, 29.32s/it] {'loss': 0.0038, 'grad_norm': 0.9239926816860806, 'learning_rate': 8.028e-07, 'completion_length': 80.17857360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.08747542649507523, 'kl': 0.095703125, 'epoch': 0.2} 20%|█▉ | 493/2500 [4:09:49<16:20:38, 29.32s/it] 20%|█▉ | 494/2500 [4:10:18<16:13:01, 29.10s/it] {'loss': 0.0039, 'grad_norm': 1.8477398334543036, 'learning_rate': 8.023999999999999e-07, 'completion_length': 77.62500381469727, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.07003280520439148, 'kl': 0.097412109375, 'epoch': 0.2} 20%|█▉ | 494/2500 [4:10:18<16:13:01, 29.10s/it] 20%|█▉ | 495/2500 [4:10:44<15:41:18, 28.17s/it] {'loss': 0.0046, 'grad_norm': 1.4089347027675938, 'learning_rate': 8.02e-07, 'completion_length': 66.5089340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.114990234375, 'epoch': 0.2} 20%|█▉ | 495/2500 [4:10:44<15:41:18, 28.17s/it] 20%|█▉ | 496/2500 [4:11:09<15:08:37, 27.20s/it] {'loss': 0.0022, 'grad_norm': 1.700869910236032, 'learning_rate': 8.016e-07, 'completion_length': 72.02679061889648, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.0555419921875, 'epoch': 0.2} 20%|█▉ | 496/2500 [4:11:09<15:08:37, 27.20s/it] 20%|█▉ | 497/2500 [4:11:37<15:20:00, 27.56s/it] {'loss': 0.0032, 'grad_norm': 1.3982435403684437, 'learning_rate': 8.012e-07, 'completion_length': 63.83928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.0810546875, 'epoch': 0.2} 20%|█▉ | 497/2500 [4:11:37<15:20:00, 27.56s/it] 20%|█▉ | 498/2500 [4:12:05<15:20:07, 27.58s/it] {'loss': 0.003, 'grad_norm': 0.9966504348526226, 'learning_rate': 8.007999999999999e-07, 'completion_length': 72.68750381469727, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.0758056640625, 'epoch': 0.2} 20%|█▉ | 498/2500 [4:12:05<15:20:07, 27.58s/it] 20%|█▉ | 499/2500 [4:12:30<15:02:20, 27.06s/it] {'loss': 0.0044, 'grad_norm': 0.137798773001816, 'learning_rate': 8.004e-07, 'completion_length': 63.642860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.110107421875, 'epoch': 0.2} 20%|█▉ | 499/2500 [4:12:30<15:02:20, 27.06s/it] 20%|██ | 500/2500 [4:13:05<16:15:09, 29.25s/it] {'loss': 0.0033, 'grad_norm': 4.342121752608402, 'learning_rate': 8e-07, 'completion_length': 69.39286041259766, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07124518603086472, 'kl': 0.083740234375, 'epoch': 0.2} 20%|██ | 500/2500 [4:13:05<16:15:09, 29.25s/it] 20%|██ | 501/2500 [4:14:43<27:44:04, 49.95s/it] {'loss': 0.0034, 'grad_norm': 0.16312256785313006, 'learning_rate': 7.995999999999999e-07, 'completion_length': 64.65178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.085205078125, 'epoch': 0.2} 20%|██ | 501/2500 [4:14:43<27:44:04, 49.95s/it] 20%|██ | 502/2500 [4:15:05<23:04:08, 41.57s/it] {'loss': 0.0035, 'grad_norm': 3.6833331070605584, 'learning_rate': 7.992e-07, 'completion_length': 66.67857360839844, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.10882645472884178, 'kl': 0.088623046875, 'epoch': 0.2} 20%|██ | 502/2500 [4:15:05<23:04:08, 41.57s/it] 20%|██ | 503/2500 [4:15:58<25:01:09, 45.10s/it] {'loss': 0.0037, 'grad_norm': 0.3296179795176758, 'learning_rate': 7.987999999999999e-07, 'completion_length': 66.06250190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.09375, 'epoch': 0.2} 20%|██ | 503/2500 [4:15:58<25:01:09, 45.10s/it] 20%|██ | 504/2500 [4:16:34<23:25:39, 42.25s/it] {'loss': 0.0033, 'grad_norm': 0.16884928491302448, 'learning_rate': 7.984e-07, 'completion_length': 62.87500190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.083740234375, 'epoch': 0.2} 20%|██ | 504/2500 [4:16:34<23:25:39, 42.25s/it] 20%|██ | 505/2500 [4:17:03<21:11:39, 38.25s/it] {'loss': 0.0028, 'grad_norm': 0.8486601537194525, 'learning_rate': 7.98e-07, 'completion_length': 66.20536041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.070556640625, 'epoch': 0.2} 20%|██ | 505/2500 [4:17:03<21:11:39, 38.25s/it] 20%|██ | 506/2500 [4:17:41<21:11:17, 38.25s/it] {'loss': 0.0047, 'grad_norm': 3.8242765483666323, 'learning_rate': 7.975999999999999e-07, 'completion_length': 62.714290618896484, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1162109375, 'epoch': 0.2} 20%|██ | 506/2500 [4:17:41<21:11:17, 38.25s/it] 20%|██ | 507/2500 [4:18:16<20:37:18, 37.25s/it] {'loss': 0.003, 'grad_norm': 0.14490900695109216, 'learning_rate': 7.972e-07, 'completion_length': 69.0535774230957, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.074951171875, 'epoch': 0.2} 20%|██ | 507/2500 [4:18:16<20:37:18, 37.25s/it] 20%|██ | 508/2500 [4:18:44<19:03:36, 34.45s/it] {'loss': 0.0038, 'grad_norm': 0.15593133618278626, 'learning_rate': 7.967999999999999e-07, 'completion_length': 63.92857551574707, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.094970703125, 'epoch': 0.2} 20%|██ | 508/2500 [4:18:44<19:03:36, 34.45s/it] 20%|██ | 509/2500 [4:19:11<17:44:10, 32.07s/it] {'loss': 0.0037, 'grad_norm': 1.344533361480379, 'learning_rate': 7.964e-07, 'completion_length': 60.67857551574707, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.091552734375, 'epoch': 0.2} 20%|██ | 509/2500 [4:19:11<17:44:10, 32.07s/it] 20%|██ | 510/2500 [4:20:42<27:31:00, 49.78s/it] {'loss': 0.0043, 'grad_norm': 2.48516260470734, 'learning_rate': 7.96e-07, 'completion_length': 67.69643020629883, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.08868780359625816, 'kl': 0.10693359375, 'epoch': 0.2} 20%|██ | 510/2500 [4:20:42<27:31:00, 49.78s/it] 20%|██ | 511/2500 [4:21:07<23:25:27, 42.40s/it] {'loss': 0.0032, 'grad_norm': 5.5378596365800465, 'learning_rate': 7.956e-07, 'completion_length': 65.91964721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.079833984375, 'epoch': 0.2} 20%|██ | 511/2500 [4:21:07<23:25:27, 42.40s/it] 20%|██ | 512/2500 [4:21:35<21:00:14, 38.04s/it] {'loss': 0.0027, 'grad_norm': 2.92403639596813, 'learning_rate': 7.952e-07, 'completion_length': 65.18750190734863, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.0835726372897625, 'kl': 0.068603515625, 'epoch': 0.2} 20%|██ | 512/2500 [4:21:35<21:00:14, 38.04s/it] 21%|██ | 513/2500 [4:22:49<27:04:17, 49.05s/it] {'loss': 0.0033, 'grad_norm': 0.11179488371227797, 'learning_rate': 7.947999999999999e-07, 'completion_length': 64.75000190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.083740234375, 'epoch': 0.21} 21%|██ | 513/2500 [4:22:49<27:04:17, 49.05s/it] 21%|██ | 514/2500 [4:24:12<32:36:01, 59.09s/it] {'loss': 0.0048, 'grad_norm': 1.9746965524635134, 'learning_rate': 7.944e-07, 'completion_length': 57.84821701049805, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.121337890625, 'epoch': 0.21} 21%|██ | 514/2500 [4:24:12<32:36:01, 59.09s/it] 21%|██ | 515/2500 [4:25:04<31:26:04, 57.01s/it] {'loss': 0.0036, 'grad_norm': 2.7274946898568158, 'learning_rate': 7.94e-07, 'completion_length': 68.60714721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.0908203125, 'epoch': 0.21} 21%|██ | 515/2500 [4:25:04<31:26:04, 57.01s/it] 21%|██ | 516/2500 [4:25:31<26:26:59, 47.99s/it] {'loss': 0.0034, 'grad_norm': 95.82005134080728, 'learning_rate': 7.935999999999999e-07, 'completion_length': 65.40178680419922, 'rewards/accuracy_reward': 0.9017857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9017857313156128, 'reward_std': 0.03696779906749725, 'kl': 0.084228515625, 'epoch': 0.21} 21%|██ | 516/2500 [4:25:31<26:26:59, 47.99s/it] 21%|██ | 517/2500 [4:25:57<22:52:23, 41.52s/it] {'loss': 0.0037, 'grad_norm': 0.8391642806982759, 'learning_rate': 7.932e-07, 'completion_length': 56.58035850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.093017578125, 'epoch': 0.21} 21%|██ | 517/2500 [4:25:57<22:52:23, 41.52s/it] 21%|██ | 518/2500 [4:26:21<19:57:54, 36.26s/it] {'loss': 0.0036, 'grad_norm': 1.2518991188064308, 'learning_rate': 7.928e-07, 'completion_length': 68.94643020629883, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.090087890625, 'epoch': 0.21} 21%|██ | 518/2500 [4:26:21<19:57:54, 36.26s/it] 21%|██ | 519/2500 [4:26:48<18:18:40, 33.28s/it] {'loss': 0.0057, 'grad_norm': 0.525625957113775, 'learning_rate': 7.923999999999999e-07, 'completion_length': 67.49107360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.14306640625, 'epoch': 0.21} 21%|██ | 519/2500 [4:26:48<18:18:40, 33.28s/it] 21%|██ | 520/2500 [4:27:13<16:56:04, 30.79s/it] {'loss': 0.0044, 'grad_norm': 0.16047024435953727, 'learning_rate': 7.92e-07, 'completion_length': 63.535715103149414, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.109619140625, 'epoch': 0.21} 21%|██ | 520/2500 [4:27:13<16:56:04, 30.79s/it] 21%|██ | 521/2500 [4:27:40<16:23:04, 29.80s/it] {'loss': 0.0054, 'grad_norm': 3.099149263028631, 'learning_rate': 7.916e-07, 'completion_length': 57.41964530944824, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.883928656578064, 'reward_std': 0.08747542649507523, 'kl': 0.13427734375, 'epoch': 0.21} 21%|██ | 521/2500 [4:27:40<16:23:04, 29.80s/it] 21%|██ | 522/2500 [4:28:05<15:34:45, 28.35s/it] {'loss': 0.0038, 'grad_norm': 0.14134937363103026, 'learning_rate': 7.911999999999999e-07, 'completion_length': 68.3035774230957, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.095703125, 'epoch': 0.21} 21%|██ | 522/2500 [4:28:05<15:34:45, 28.35s/it] 21%|██ | 523/2500 [4:28:29<14:51:33, 27.06s/it] {'loss': 0.0045, 'grad_norm': 1.4082220788479936, 'learning_rate': 7.907999999999999e-07, 'completion_length': 54.78571701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.11279296875, 'epoch': 0.21} 21%|██ | 523/2500 [4:28:29<14:51:33, 27.06s/it] 21%|██ | 524/2500 [4:28:57<15:01:20, 27.37s/it] {'loss': 0.0045, 'grad_norm': 0.4749219933479059, 'learning_rate': 7.904e-07, 'completion_length': 60.77678871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.112060546875, 'epoch': 0.21} 21%|██ | 524/2500 [4:28:57<15:01:20, 27.37s/it] 21%|██ | 525/2500 [4:30:17<23:39:01, 43.11s/it] {'loss': 0.0065, 'grad_norm': 0.21872482023714798, 'learning_rate': 7.9e-07, 'completion_length': 57.31250190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.16259765625, 'epoch': 0.21} 21%|██ | 525/2500 [4:30:17<23:39:01, 43.11s/it] 21%|██ | 526/2500 [4:30:41<20:29:05, 37.36s/it] {'loss': 0.0053, 'grad_norm': 0.16263405281795412, 'learning_rate': 7.895999999999999e-07, 'completion_length': 54.41071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.133056640625, 'epoch': 0.21} 21%|██ | 526/2500 [4:30:41<20:29:05, 37.36s/it] 21%|██ | 527/2500 [4:31:07<18:39:10, 34.03s/it] {'loss': 0.004, 'grad_norm': 1.1291386536549746, 'learning_rate': 7.892e-07, 'completion_length': 65.92857551574707, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.05050762742757797, 'kl': 0.099853515625, 'epoch': 0.21} 21%|██ | 527/2500 [4:31:07<18:39:10, 34.03s/it] 21%|██ | 528/2500 [4:31:34<17:27:55, 31.88s/it] {'loss': 0.0048, 'grad_norm': 1.6236647345088053, 'learning_rate': 7.887999999999999e-07, 'completion_length': 51.660715103149414, 'rewards/accuracy_reward': 0.9375000596046448, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.05831881985068321, 'kl': 0.118896484375, 'epoch': 0.21} 21%|██ | 528/2500 [4:31:34<17:27:55, 31.88s/it] 21%|██ | 529/2500 [4:32:00<16:25:50, 30.01s/it] {'loss': 0.0048, 'grad_norm': 2.3166896694161547, 'learning_rate': 7.883999999999999e-07, 'completion_length': 57.955360412597656, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.06222161278128624, 'kl': 0.1201171875, 'epoch': 0.21} 21%|██ | 529/2500 [4:32:00<16:25:50, 30.01s/it] 21%|██ | 530/2500 [4:32:27<15:53:16, 29.03s/it] {'loss': 0.0063, 'grad_norm': 0.7592782443821764, 'learning_rate': 7.88e-07, 'completion_length': 64.89286041259766, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.15869140625, 'epoch': 0.21} 21%|██ | 530/2500 [4:32:27<15:53:16, 29.03s/it] 21%|██ | 531/2500 [4:33:27<21:02:45, 38.48s/it] {'loss': 0.0047, 'grad_norm': 0.10981941497370051, 'learning_rate': 7.875999999999999e-07, 'completion_length': 62.392860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1171875, 'epoch': 0.21} 21%|██ | 531/2500 [4:33:27<21:02:45, 38.48s/it] 21%|██▏ | 532/2500 [4:34:06<21:07:10, 38.63s/it] {'loss': 0.0059, 'grad_norm': 0.7290466358915926, 'learning_rate': 7.872e-07, 'completion_length': 57.55357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.146240234375, 'epoch': 0.21} 21%|██▏ | 532/2500 [4:34:06<21:07:10, 38.63s/it] 21%|██▏ | 533/2500 [4:34:30<18:42:27, 34.24s/it] {'loss': 0.0051, 'grad_norm': 0.12321088050913767, 'learning_rate': 7.868e-07, 'completion_length': 62.50893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12841796875, 'epoch': 0.21} 21%|██▏ | 533/2500 [4:34:30<18:42:27, 34.24s/it] 21%|██▏ | 534/2500 [4:34:55<17:14:00, 31.56s/it] {'loss': 0.0042, 'grad_norm': 2.0179134740948963, 'learning_rate': 7.864e-07, 'completion_length': 68.26786041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.0835726335644722, 'kl': 0.106201171875, 'epoch': 0.21} 21%|██▏ | 534/2500 [4:34:56<17:14:00, 31.56s/it] 21%|██▏ | 535/2500 [4:35:22<16:27:12, 30.14s/it] {'loss': 0.0046, 'grad_norm': 0.6938877841765441, 'learning_rate': 7.86e-07, 'completion_length': 68.60714530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.115234375, 'epoch': 0.21} 21%|██▏ | 535/2500 [4:35:22<16:27:12, 30.14s/it] 21%|██▏ | 536/2500 [4:35:48<15:44:23, 28.85s/it] {'loss': 0.0053, 'grad_norm': 0.7905444997899138, 'learning_rate': 7.855999999999999e-07, 'completion_length': 58.57143020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.1318359375, 'epoch': 0.21} 21%|██▏ | 536/2500 [4:35:48<15:44:23, 28.85s/it] 21%|██▏ | 537/2500 [4:36:53<21:39:34, 39.72s/it] {'loss': 0.0056, 'grad_norm': 1.288359660001425, 'learning_rate': 7.852e-07, 'completion_length': 61.687503814697266, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.05050762742757797, 'kl': 0.1396484375, 'epoch': 0.21} 21%|██▏ | 537/2500 [4:36:53<21:39:34, 39.72s/it] 22%|██▏ | 538/2500 [4:37:20<19:35:41, 35.95s/it] {'loss': 0.005, 'grad_norm': 0.5192213799160317, 'learning_rate': 7.848e-07, 'completion_length': 71.67857360839844, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9107143878936768, 'reward_std': 0.05050762742757797, 'kl': 0.124267578125, 'epoch': 0.22} 22%|██▏ | 538/2500 [4:37:20<19:35:41, 35.95s/it] 22%|██▏ | 539/2500 [4:37:46<17:52:00, 32.80s/it] {'loss': 0.0047, 'grad_norm': 0.16278961435877098, 'learning_rate': 7.844e-07, 'completion_length': 57.13393020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1181640625, 'epoch': 0.22} 22%|██▏ | 539/2500 [4:37:46<17:52:00, 32.80s/it] 22%|██▏ | 540/2500 [4:38:11<16:33:42, 30.42s/it] {'loss': 0.0058, 'grad_norm': 1.78368861877792, 'learning_rate': 7.84e-07, 'completion_length': 59.517860412597656, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.14404296875, 'epoch': 0.22} 22%|██▏ | 540/2500 [4:38:11<16:33:42, 30.42s/it] 22%|██▏ | 541/2500 [4:38:38<16:05:36, 29.57s/it] {'loss': 0.0046, 'grad_norm': 2.7769973213470247, 'learning_rate': 7.835999999999999e-07, 'completion_length': 66.30357360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.09138382598757744, 'kl': 0.115478515625, 'epoch': 0.22} 22%|██▏ | 541/2500 [4:38:38<16:05:36, 29.57s/it] 22%|██▏ | 542/2500 [4:39:02<15:08:17, 27.83s/it] {'loss': 0.0059, 'grad_norm': 1.659986592438335, 'learning_rate': 7.832e-07, 'completion_length': 61.705360412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.14794921875, 'epoch': 0.22} 22%|██▏ | 542/2500 [4:39:02<15:08:17, 27.83s/it] 22%|██▏ | 543/2500 [4:39:27<14:35:12, 26.83s/it] {'loss': 0.0057, 'grad_norm': 0.17014045075092363, 'learning_rate': 7.828e-07, 'completion_length': 57.82143211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.142333984375, 'epoch': 0.22} 22%|██▏ | 543/2500 [4:39:27<14:35:12, 26.83s/it] 22%|██▏ | 544/2500 [4:39:52<14:21:47, 26.44s/it] {'loss': 0.0053, 'grad_norm': 0.14404576754262152, 'learning_rate': 7.823999999999999e-07, 'completion_length': 62.633934020996094, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13330078125, 'epoch': 0.22} 22%|██▏ | 544/2500 [4:39:52<14:21:47, 26.44s/it] 22%|██▏ | 545/2500 [4:40:17<14:04:42, 25.92s/it] {'loss': 0.0056, 'grad_norm': 0.8606780781494646, 'learning_rate': 7.82e-07, 'completion_length': 69.04464721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.140625, 'epoch': 0.22} 22%|██▏ | 545/2500 [4:40:17<14:04:42, 25.92s/it] 22%|██▏ | 546/2500 [4:40:42<13:55:29, 25.65s/it] {'loss': 0.0037, 'grad_norm': 0.16981543455419454, 'learning_rate': 7.816e-07, 'completion_length': 58.23214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09228515625, 'epoch': 0.22} 22%|██▏ | 546/2500 [4:40:42<13:55:29, 25.65s/it] 22%|██▏ | 547/2500 [4:41:06<13:44:27, 25.33s/it] {'loss': 0.004, 'grad_norm': 1.7835419050208807, 'learning_rate': 7.811999999999999e-07, 'completion_length': 73.20536041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.06613001227378845, 'kl': 0.10009765625, 'epoch': 0.22} 22%|██▏ | 547/2500 [4:41:06<13:44:27, 25.33s/it] 22%|██▏ | 548/2500 [4:41:33<13:52:21, 25.59s/it] {'loss': 0.0044, 'grad_norm': 2.457572924195223, 'learning_rate': 7.808e-07, 'completion_length': 74.41964340209961, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1103515625, 'epoch': 0.22} 22%|██▏ | 548/2500 [4:41:33<13:52:21, 25.59s/it] 22%|██▏ | 549/2500 [4:42:02<14:32:00, 26.82s/it] {'loss': 0.0043, 'grad_norm': 0.8305603953894307, 'learning_rate': 7.804e-07, 'completion_length': 77.14286041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.106689453125, 'epoch': 0.22} 22%|██▏ | 549/2500 [4:42:02<14:32:00, 26.82s/it] 22%|██▏ | 550/2500 [4:42:27<14:11:59, 26.21s/it] {'loss': 0.0031, 'grad_norm': 1.909729885668658, 'learning_rate': 7.799999999999999e-07, 'completion_length': 72.33036041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.0780029296875, 'epoch': 0.22} 22%|██▏ | 550/2500 [4:42:27<14:11:59, 26.21s/it] 22%|██▏ | 551/2500 [4:42:52<13:58:31, 25.81s/it] {'loss': 0.0039, 'grad_norm': 1.269432665634723, 'learning_rate': 7.795999999999999e-07, 'completion_length': 82.21429061889648, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.096435546875, 'epoch': 0.22} 22%|██▏ | 551/2500 [4:42:52<13:58:31, 25.81s/it] 22%|██▏ | 552/2500 [4:43:23<14:45:49, 27.28s/it] {'loss': 0.0046, 'grad_norm': 0.33830136490816265, 'learning_rate': 7.792e-07, 'completion_length': 81.0089340209961, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.114013671875, 'epoch': 0.22} 22%|██▏ | 552/2500 [4:43:23<14:45:49, 27.28s/it] 22%|██▏ | 553/2500 [4:43:49<14:36:50, 27.02s/it] {'loss': 0.0044, 'grad_norm': 9.55224032370448, 'learning_rate': 7.788000000000001e-07, 'completion_length': 71.83036041259766, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9017857313156128, 'reward_std': 0.05831882357597351, 'kl': 0.111083984375, 'epoch': 0.22} 22%|██▏ | 553/2500 [4:43:49<14:36:50, 27.02s/it] 22%|██▏ | 554/2500 [4:44:15<14:27:26, 26.75s/it] {'loss': 0.0038, 'grad_norm': 0.8653842174161179, 'learning_rate': 7.783999999999999e-07, 'completion_length': 78.19643020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.09521484375, 'epoch': 0.22} 22%|██▏ | 554/2500 [4:44:15<14:27:26, 26.75s/it] 22%|██▏ | 555/2500 [4:44:41<14:19:57, 26.53s/it] {'loss': 0.0058, 'grad_norm': 1.9844115601993417, 'learning_rate': 7.78e-07, 'completion_length': 67.01786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.14599609375, 'epoch': 0.22} 22%|██▏ | 555/2500 [4:44:41<14:19:57, 26.53s/it] 22%|██▏ | 556/2500 [4:45:22<16:38:10, 30.81s/it] {'loss': 0.0042, 'grad_norm': 2.362728685620734, 'learning_rate': 7.776e-07, 'completion_length': 73.10714721679688, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.13858196884393692, 'kl': 0.106201171875, 'epoch': 0.22} 22%|██▏ | 556/2500 [4:45:22<16:38:10, 30.81s/it] 22%|██▏ | 557/2500 [4:45:49<15:57:28, 29.57s/it] {'loss': 0.0036, 'grad_norm': 1.616118811711699, 'learning_rate': 7.771999999999999e-07, 'completion_length': 76.27678680419922, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9375001192092896, 'reward_std': 0.11394161731004715, 'kl': 0.0908203125, 'epoch': 0.22} 22%|██▏ | 557/2500 [4:45:49<15:57:28, 29.57s/it] 22%|██▏ | 558/2500 [4:46:14<15:16:25, 28.31s/it] {'loss': 0.004, 'grad_norm': 1.3400085464489193, 'learning_rate': 7.768e-07, 'completion_length': 72.83036041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.08747542649507523, 'kl': 0.100341796875, 'epoch': 0.22} 22%|██▏ | 558/2500 [4:46:14<15:16:25, 28.31s/it] 22%|██▏ | 559/2500 [4:46:39<14:41:01, 27.23s/it] {'loss': 0.004, 'grad_norm': 1.8382979410210327, 'learning_rate': 7.764e-07, 'completion_length': 68.25893020629883, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.099853515625, 'epoch': 0.22} 22%|██▏ | 559/2500 [4:46:39<14:41:01, 27.23s/it] 22%|██▏ | 560/2500 [4:47:06<14:43:01, 27.31s/it] {'loss': 0.0042, 'grad_norm': 2.951291477653612, 'learning_rate': 7.76e-07, 'completion_length': 71.67857360839844, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.8928572535514832, 'reward_std': 0.13408026099205017, 'kl': 0.10498046875, 'epoch': 0.22} 22%|██▏ | 560/2500 [4:47:06<14:43:01, 27.31s/it] 22%|██▏ | 561/2500 [4:47:30<14:10:03, 26.30s/it] {'loss': 0.004, 'grad_norm': 0.18927844850314804, 'learning_rate': 7.755999999999999e-07, 'completion_length': 65.15179061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.099853515625, 'epoch': 0.22} 22%|██▏ | 561/2500 [4:47:30<14:10:03, 26.30s/it] 22%|██▏ | 562/2500 [4:47:56<14:00:03, 26.01s/it] {'loss': 0.0042, 'grad_norm': 2.988259260142577, 'learning_rate': 7.752e-07, 'completion_length': 77.9375, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.1379830539226532, 'kl': 0.1044921875, 'epoch': 0.22} 22%|██▏ | 562/2500 [4:47:56<14:00:03, 26.01s/it] 23%|██▎ | 563/2500 [4:48:21<13:55:50, 25.89s/it] {'loss': 0.0051, 'grad_norm': 2.7587751929548694, 'learning_rate': 7.748e-07, 'completion_length': 62.01785850524902, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.128173828125, 'epoch': 0.23} 23%|██▎ | 563/2500 [4:48:21<13:55:50, 25.89s/it] 23%|██▎ | 564/2500 [4:48:47<13:56:07, 25.91s/it] {'loss': 0.0041, 'grad_norm': 0.12696266396078282, 'learning_rate': 7.743999999999999e-07, 'completion_length': 67.2410774230957, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.102294921875, 'epoch': 0.23} 23%|██▎ | 564/2500 [4:48:47<13:56:07, 25.91s/it] 23%|██▎ | 565/2500 [4:49:15<14:10:38, 26.38s/it] {'loss': 0.0044, 'grad_norm': 0.4216449050062815, 'learning_rate': 7.74e-07, 'completion_length': 62.44643211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.109619140625, 'epoch': 0.23} 23%|██▎ | 565/2500 [4:49:15<14:10:38, 26.38s/it] 23%|██▎ | 566/2500 [4:49:41<14:09:46, 26.36s/it] {'loss': 0.0051, 'grad_norm': 1.6557980899817035, 'learning_rate': 7.735999999999999e-07, 'completion_length': 61.017860412597656, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.12841796875, 'epoch': 0.23} 23%|██▎ | 566/2500 [4:49:41<14:09:46, 26.36s/it] 23%|██▎ | 567/2500 [4:50:07<14:07:25, 26.30s/it] {'loss': 0.0043, 'grad_norm': 2.052184342517326, 'learning_rate': 7.732e-07, 'completion_length': 61.09821701049805, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.06222161650657654, 'kl': 0.107177734375, 'epoch': 0.23} 23%|██▎ | 567/2500 [4:50:07<14:07:25, 26.30s/it] 23%|██▎ | 568/2500 [4:50:29<13:29:07, 25.13s/it] {'loss': 0.0044, 'grad_norm': 0.8325381782642977, 'learning_rate': 7.728e-07, 'completion_length': 58.607147216796875, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.1103515625, 'epoch': 0.23} 23%|██▎ | 568/2500 [4:50:30<13:29:07, 25.13s/it] 23%|██▎ | 569/2500 [4:50:54<13:27:29, 25.09s/it] {'loss': 0.0058, 'grad_norm': 3.380237288863942, 'learning_rate': 7.723999999999999e-07, 'completion_length': 57.09821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.145263671875, 'epoch': 0.23} 23%|██▎ | 569/2500 [4:50:54<13:27:29, 25.09s/it] 23%|██▎ | 570/2500 [4:51:20<13:31:09, 25.22s/it] {'loss': 0.0053, 'grad_norm': 2.2809720353480256, 'learning_rate': 7.72e-07, 'completion_length': 57.07143020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1337890625, 'epoch': 0.23} 23%|██▎ | 570/2500 [4:51:20<13:31:09, 25.22s/it] 23%|██▎ | 571/2500 [4:51:55<15:05:54, 28.18s/it] {'loss': 0.0046, 'grad_norm': 2.3531481305054958, 'learning_rate': 7.716e-07, 'completion_length': 63.09821891784668, 'rewards/accuracy_reward': 0.928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.05050762742757797, 'kl': 0.115234375, 'epoch': 0.23} 23%|██▎ | 571/2500 [4:51:55<15:05:54, 28.18s/it] 23%|██▎ | 572/2500 [4:52:20<14:29:56, 27.07s/it] {'loss': 0.0041, 'grad_norm': 1.6866796087406901, 'learning_rate': 7.711999999999999e-07, 'completion_length': 57.294647216796875, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.103759765625, 'epoch': 0.23} 23%|██▎ | 572/2500 [4:52:20<14:29:56, 27.07s/it] 23%|██▎ | 573/2500 [4:52:47<14:34:09, 27.22s/it] {'loss': 0.0043, 'grad_norm': 3.359131110395958, 'learning_rate': 7.708e-07, 'completion_length': 59.80357360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.06222161650657654, 'kl': 0.107421875, 'epoch': 0.23} 23%|██▎ | 573/2500 [4:52:47<14:34:09, 27.22s/it] 23%|██▎ | 574/2500 [4:53:12<14:08:52, 26.44s/it] {'loss': 0.005, 'grad_norm': 0.15302046957676566, 'learning_rate': 7.704e-07, 'completion_length': 51.87500190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12451171875, 'epoch': 0.23} 23%|██▎ | 574/2500 [4:53:12<14:08:52, 26.44s/it] 23%|██▎ | 575/2500 [4:53:40<14:28:27, 27.07s/it] {'loss': 0.0052, 'grad_norm': 0.46924758202999817, 'learning_rate': 7.699999999999999e-07, 'completion_length': 56.63393211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.130859375, 'epoch': 0.23} 23%|██▎ | 575/2500 [4:53:40<14:28:27, 27.07s/it] 23%|██▎ | 576/2500 [4:54:04<14:00:07, 26.20s/it] {'loss': 0.0068, 'grad_norm': 0.2179166929143585, 'learning_rate': 7.695999999999999e-07, 'completion_length': 53.660715103149414, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.17041015625, 'epoch': 0.23} 23%|██▎ | 576/2500 [4:54:04<14:00:07, 26.20s/it] 23%|██▎ | 577/2500 [4:54:30<13:56:12, 26.09s/it] {'loss': 0.0045, 'grad_norm': 0.17818867885645803, 'learning_rate': 7.692e-07, 'completion_length': 54.23214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.111572265625, 'epoch': 0.23} 23%|██▎ | 577/2500 [4:54:30<13:56:12, 26.09s/it] 23%|██▎ | 578/2500 [4:54:56<13:53:47, 26.03s/it] {'loss': 0.0059, 'grad_norm': 0.7655429385581387, 'learning_rate': 7.688000000000001e-07, 'completion_length': 52.40178680419922, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.147216796875, 'epoch': 0.23} 23%|██▎ | 578/2500 [4:54:56<13:53:47, 26.03s/it] 23%|██▎ | 579/2500 [4:55:20<13:36:11, 25.49s/it] {'loss': 0.0055, 'grad_norm': 0.9667846367551798, 'learning_rate': 7.683999999999999e-07, 'completion_length': 50.89285850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13671875, 'epoch': 0.23} 23%|██▎ | 579/2500 [4:55:20<13:36:11, 25.49s/it] 23%|██▎ | 580/2500 [4:55:47<13:48:17, 25.88s/it] {'loss': 0.0078, 'grad_norm': 1.694994191987051, 'learning_rate': 7.68e-07, 'completion_length': 41.18750190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1943359375, 'epoch': 0.23} 23%|██▎ | 580/2500 [4:55:47<13:48:17, 25.88s/it] 23%|██▎ | 581/2500 [4:56:11<13:26:58, 25.23s/it] {'loss': 0.0061, 'grad_norm': 3.9719061340877255, 'learning_rate': 7.676e-07, 'completion_length': 49.535715103149414, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.15234375, 'epoch': 0.23} 23%|██▎ | 581/2500 [4:56:11<13:26:58, 25.23s/it] 23%|██▎ | 582/2500 [4:56:36<13:27:57, 25.27s/it] {'loss': 0.0049, 'grad_norm': 0.2507482288501035, 'learning_rate': 7.671999999999999e-07, 'completion_length': 53.78571701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12353515625, 'epoch': 0.23} 23%|██▎ | 582/2500 [4:56:36<13:27:57, 25.27s/it] 23%|██▎ | 583/2500 [4:57:02<13:28:49, 25.32s/it] {'loss': 0.0046, 'grad_norm': 3.8318670442911293, 'learning_rate': 7.668e-07, 'completion_length': 47.21428871154785, 'rewards/accuracy_reward': 0.928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0835726372897625, 'kl': 0.114990234375, 'epoch': 0.23} 23%|██▎ | 583/2500 [4:57:02<13:28:49, 25.32s/it] 23%|██▎ | 584/2500 [4:57:25<13:05:17, 24.59s/it] {'loss': 0.0063, 'grad_norm': 4.541335439396905, 'learning_rate': 7.664e-07, 'completion_length': 48.39285850524902, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.06613001227378845, 'kl': 0.158203125, 'epoch': 0.23} 23%|██▎ | 584/2500 [4:57:25<13:05:17, 24.59s/it] 23%|██▎ | 585/2500 [4:57:51<13:17:33, 24.99s/it] {'loss': 0.0058, 'grad_norm': 9.280877919951552, 'learning_rate': 7.66e-07, 'completion_length': 52.62500190734863, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07124518603086472, 'kl': 0.144287109375, 'epoch': 0.23} 23%|██▎ | 585/2500 [4:57:51<13:17:33, 24.99s/it] 23%|██▎ | 586/2500 [4:58:16<13:23:12, 25.18s/it] {'loss': 0.006, 'grad_norm': 0.22528780616234373, 'learning_rate': 7.655999999999999e-07, 'completion_length': 53.35714530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1494140625, 'epoch': 0.23} 23%|██▎ | 586/2500 [4:58:16<13:23:12, 25.18s/it] 23%|██▎ | 587/2500 [4:58:42<13:28:44, 25.37s/it] {'loss': 0.0044, 'grad_norm': 2.7235781905690204, 'learning_rate': 7.652e-07, 'completion_length': 55.910715103149414, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0835726335644722, 'kl': 0.109130859375, 'epoch': 0.23} 23%|██▎ | 587/2500 [4:58:42<13:28:44, 25.37s/it] 24%|██▎ | 588/2500 [4:59:07<13:29:03, 25.39s/it] {'loss': 0.0057, 'grad_norm': 3.392014285176546, 'learning_rate': 7.648e-07, 'completion_length': 62.99107360839844, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.14306640625, 'epoch': 0.24} 24%|██▎ | 588/2500 [4:59:07<13:29:03, 25.39s/it] 24%|██▎ | 589/2500 [4:59:30<13:06:22, 24.69s/it] {'loss': 0.0047, 'grad_norm': 0.2632634247840357, 'learning_rate': 7.643999999999999e-07, 'completion_length': 56.60714530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.116455078125, 'epoch': 0.24} 24%|██▎ | 589/2500 [4:59:30<13:06:22, 24.69s/it] 24%|██▎ | 590/2500 [4:59:59<13:41:31, 25.81s/it] {'loss': 0.0046, 'grad_norm': 0.527852433760217, 'learning_rate': 7.64e-07, 'completion_length': 62.83928871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.115478515625, 'epoch': 0.24} 24%|██▎ | 590/2500 [4:59:59<13:41:31, 25.81s/it] 24%|██▎ | 591/2500 [5:00:23<13:26:44, 25.36s/it] {'loss': 0.0046, 'grad_norm': 1.9161505584548035, 'learning_rate': 7.635999999999999e-07, 'completion_length': 61.55357551574707, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.11474609375, 'epoch': 0.24} 24%|██▎ | 591/2500 [5:00:23<13:26:44, 25.36s/it] 24%|██▎ | 592/2500 [5:00:49<13:29:13, 25.45s/it] {'loss': 0.0047, 'grad_norm': 3.3442194472950724, 'learning_rate': 7.632e-07, 'completion_length': 62.67857360839844, 'rewards/accuracy_reward': 0.9017857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9017858505249023, 'reward_std': 0.1418914571404457, 'kl': 0.1181640625, 'epoch': 0.24} 24%|██▎ | 592/2500 [5:00:49<13:29:13, 25.45s/it] 24%|██▎ | 593/2500 [5:01:13<13:19:37, 25.16s/it] {'loss': 0.0049, 'grad_norm': 0.2958624215974155, 'learning_rate': 7.628e-07, 'completion_length': 60.42857551574707, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.122314453125, 'epoch': 0.24} 24%|██▎ | 593/2500 [5:01:13<13:19:37, 25.16s/it] 24%|██▍ | 594/2500 [5:01:58<16:24:13, 30.98s/it] {'loss': 0.0038, 'grad_norm': 1.0149679400214195, 'learning_rate': 7.623999999999999e-07, 'completion_length': 68.31250190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.094482421875, 'epoch': 0.24} 24%|██▍ | 594/2500 [5:01:58<16:24:13, 30.98s/it] 24%|██▍ | 595/2500 [5:02:50<19:48:47, 37.44s/it] {'loss': 0.0052, 'grad_norm': 2.560559056660854, 'learning_rate': 7.62e-07, 'completion_length': 66.30357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.12890625, 'epoch': 0.24} 24%|██▍ | 595/2500 [5:02:50<19:48:47, 37.44s/it] 24%|██▍ | 596/2500 [5:03:15<17:48:07, 33.66s/it] {'loss': 0.0037, 'grad_norm': 2.565767660721495, 'learning_rate': 7.616e-07, 'completion_length': 63.45535850524902, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.09326171875, 'epoch': 0.24} 24%|██▍ | 596/2500 [5:03:15<17:48:07, 33.66s/it] 24%|██▍ | 597/2500 [5:03:49<17:46:43, 33.63s/it] {'loss': 0.0037, 'grad_norm': 0.47101859138341157, 'learning_rate': 7.611999999999999e-07, 'completion_length': 70.80357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.09375, 'epoch': 0.24} 24%|██▍ | 597/2500 [5:03:49<17:46:43, 33.63s/it] 24%|██▍ | 598/2500 [5:04:39<20:23:44, 38.60s/it] {'loss': 0.0044, 'grad_norm': 0.1547000633209554, 'learning_rate': 7.608e-07, 'completion_length': 69.2589340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.109130859375, 'epoch': 0.24} 24%|██▍ | 598/2500 [5:04:39<20:23:44, 38.60s/it] 24%|██▍ | 599/2500 [5:05:22<21:08:19, 40.03s/it] {'loss': 0.0055, 'grad_norm': 0.3712925742623604, 'learning_rate': 7.604e-07, 'completion_length': 60.517860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.136962890625, 'epoch': 0.24} 24%|██▍ | 599/2500 [5:05:22<21:08:19, 40.03s/it] 24%|██▍ | 600/2500 [5:06:44<27:39:11, 52.40s/it] {'loss': 0.0046, 'grad_norm': 1.1265697041457758, 'learning_rate': 7.599999999999999e-07, 'completion_length': 61.84821701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.114501953125, 'epoch': 0.24} 24%|██▍ | 600/2500 [5:06:44<27:39:11, 52.40s/it] 24%|██▍ | 601/2500 [5:08:21<34:42:09, 65.79s/it] {'loss': 0.0041, 'grad_norm': 1.3594564940580809, 'learning_rate': 7.596e-07, 'completion_length': 69.04464721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.103515625, 'epoch': 0.24} 24%|██▍ | 601/2500 [5:08:21<34:42:09, 65.79s/it] 24%|██▍ | 602/2500 [5:08:42<27:37:58, 52.41s/it] {'loss': 0.0035, 'grad_norm': 3.7907638088287827, 'learning_rate': 7.592e-07, 'completion_length': 70.68750381469727, 'rewards/accuracy_reward': 0.8482142984867096, 'rewards/format_reward': 0.973214328289032, 'reward': 1.821428656578064, 'reward_std': 0.150909423828125, 'kl': 0.088134765625, 'epoch': 0.24} 24%|██▍ | 602/2500 [5:08:42<27:37:58, 52.41s/it] 24%|██▍ | 603/2500 [5:09:04<22:47:12, 43.24s/it] {'loss': 0.0071, 'grad_norm': 3.315263889278843, 'learning_rate': 7.588e-07, 'completion_length': 54.77678871154785, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.17626953125, 'epoch': 0.24} 24%|██▍ | 603/2500 [5:09:04<22:47:12, 43.24s/it] 24%|██▍ | 604/2500 [5:09:54<23:52:25, 45.33s/it] {'loss': 0.0042, 'grad_norm': 0.7423300036910584, 'learning_rate': 7.583999999999999e-07, 'completion_length': 59.65178680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.10595703125, 'epoch': 0.24} 24%|██▍ | 604/2500 [5:09:54<23:52:25, 45.33s/it] 24%|██▍ | 605/2500 [5:10:20<20:45:01, 39.42s/it] {'loss': 0.0035, 'grad_norm': 1.4919278409001422, 'learning_rate': 7.58e-07, 'completion_length': 67.7410774230957, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9017857313156128, 'reward_std': 0.07576143741607666, 'kl': 0.087158203125, 'epoch': 0.24} 24%|██▍ | 605/2500 [5:10:20<20:45:01, 39.42s/it] 24%|██▍ | 606/2500 [5:10:46<18:41:55, 35.54s/it] {'loss': 0.0053, 'grad_norm': 2.990763541618017, 'learning_rate': 7.576000000000001e-07, 'completion_length': 59.82143020629883, 'rewards/accuracy_reward': 0.8839286267757416, 'rewards/format_reward': 1.0, 'reward': 1.883928656578064, 'reward_std': 0.08747542649507523, 'kl': 0.1337890625, 'epoch': 0.24} 24%|██▍ | 606/2500 [5:10:46<18:41:55, 35.54s/it] 24%|██▍ | 607/2500 [5:11:11<17:03:39, 32.45s/it] {'loss': 0.005, 'grad_norm': 1.1004390457093745, 'learning_rate': 7.571999999999999e-07, 'completion_length': 53.73214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1259765625, 'epoch': 0.24} 24%|██▍ | 607/2500 [5:11:11<17:03:39, 32.45s/it] 24%|██▍ | 608/2500 [5:11:36<15:48:22, 30.08s/it] {'loss': 0.0043, 'grad_norm': 1.6536172315052673, 'learning_rate': 7.568e-07, 'completion_length': 58.07143211364746, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.06222161650657654, 'kl': 0.107666015625, 'epoch': 0.24} 24%|██▍ | 608/2500 [5:11:36<15:48:22, 30.08s/it] 24%|██▍ | 609/2500 [5:12:01<15:01:00, 28.59s/it] {'loss': 0.004, 'grad_norm': 0.1783298857166465, 'learning_rate': 7.564e-07, 'completion_length': 57.16071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.100830078125, 'epoch': 0.24} 24%|██▍ | 609/2500 [5:12:01<15:01:00, 28.59s/it] 24%|██▍ | 610/2500 [5:12:29<14:56:34, 28.46s/it] {'loss': 0.0059, 'grad_norm': 0.13775926953153556, 'learning_rate': 7.559999999999999e-07, 'completion_length': 54.98214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.14794921875, 'epoch': 0.24} 24%|██▍ | 610/2500 [5:12:29<14:56:34, 28.46s/it] 24%|██▍ | 611/2500 [5:12:53<14:08:33, 26.95s/it] {'loss': 0.0049, 'grad_norm': 0.1875298606132924, 'learning_rate': 7.556e-07, 'completion_length': 48.11607360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.121826171875, 'epoch': 0.24} 24%|██▍ | 611/2500 [5:12:53<14:08:33, 26.95s/it] 24%|██▍ | 612/2500 [5:13:23<14:41:04, 28.00s/it] {'loss': 0.0046, 'grad_norm': 0.14793213552548454, 'learning_rate': 7.552e-07, 'completion_length': 50.41964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1142578125, 'epoch': 0.24} 24%|██▍ | 612/2500 [5:13:23<14:41:04, 28.00s/it] 25%|██▍ | 613/2500 [5:13:48<14:10:57, 27.06s/it] {'loss': 0.005, 'grad_norm': 1.2816644337388996, 'learning_rate': 7.548e-07, 'completion_length': 57.08928871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1259765625, 'epoch': 0.25} 25%|██▍ | 613/2500 [5:13:48<14:10:57, 27.06s/it] 25%|██▍ | 614/2500 [5:14:36<17:28:29, 33.36s/it] {'loss': 0.0067, 'grad_norm': 0.28953130700803736, 'learning_rate': 7.543999999999999e-07, 'completion_length': 54.34821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1669921875, 'epoch': 0.25} 25%|██▍ | 614/2500 [5:14:36<17:28:29, 33.36s/it] 25%|██▍ | 615/2500 [5:15:02<16:23:02, 31.29s/it] {'loss': 0.0054, 'grad_norm': 0.8063414942636685, 'learning_rate': 7.54e-07, 'completion_length': 57.66071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13427734375, 'epoch': 0.25} 25%|██▍ | 615/2500 [5:15:02<16:23:02, 31.29s/it] 25%|██▍ | 616/2500 [5:15:30<15:50:28, 30.27s/it] {'loss': 0.004, 'grad_norm': 4.663891676028994, 'learning_rate': 7.536e-07, 'completion_length': 57.54464530944824, 'rewards/accuracy_reward': 0.866071492433548, 'rewards/format_reward': 1.0, 'reward': 1.8660715222358704, 'reward_std': 0.10882644727826118, 'kl': 0.100830078125, 'epoch': 0.25} 25%|██▍ | 616/2500 [5:15:30<15:50:28, 30.27s/it] 25%|██▍ | 617/2500 [5:16:17<18:27:03, 35.28s/it] {'loss': 0.0067, 'grad_norm': 1.1850224782440202, 'learning_rate': 7.531999999999999e-07, 'completion_length': 47.21428680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.16748046875, 'epoch': 0.25} 25%|██▍ | 617/2500 [5:16:17<18:27:03, 35.28s/it] 25%|██▍ | 618/2500 [5:17:01<19:43:34, 37.73s/it] {'loss': 0.0053, 'grad_norm': 7.042580374184345, 'learning_rate': 7.528e-07, 'completion_length': 56.13393211364746, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.0739355981349945, 'kl': 0.13330078125, 'epoch': 0.25} 25%|██▍ | 618/2500 [5:17:01<19:43:34, 37.73s/it] 25%|██▍ | 619/2500 [5:18:24<26:48:32, 51.31s/it] {'loss': 0.0059, 'grad_norm': 1.3413771019300678, 'learning_rate': 7.523999999999999e-07, 'completion_length': 51.63393211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.14794921875, 'epoch': 0.25} 25%|██▍ | 619/2500 [5:18:24<26:48:32, 51.31s/it] 25%|██▍ | 620/2500 [5:19:05<25:15:31, 48.37s/it] {'loss': 0.0045, 'grad_norm': 0.8140219877877168, 'learning_rate': 7.52e-07, 'completion_length': 63.09821701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.1123046875, 'epoch': 0.25} 25%|██▍ | 620/2500 [5:19:05<25:15:31, 48.37s/it] 25%|██▍ | 621/2500 [5:19:29<21:25:13, 41.04s/it] {'loss': 0.0054, 'grad_norm': 0.9077838956874532, 'learning_rate': 7.516e-07, 'completion_length': 57.71428871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.13623046875, 'epoch': 0.25} 25%|██▍ | 621/2500 [5:19:29<21:25:13, 41.04s/it] 25%|██▍ | 622/2500 [5:19:55<19:01:37, 36.47s/it] {'loss': 0.0064, 'grad_norm': 2.1548235208801496, 'learning_rate': 7.511999999999999e-07, 'completion_length': 53.91071701049805, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.16064453125, 'epoch': 0.25} 25%|██▍ | 622/2500 [5:19:55<19:01:37, 36.47s/it] 25%|██▍ | 623/2500 [5:20:21<17:20:11, 33.25s/it] {'loss': 0.0059, 'grad_norm': 1.7168488933993564, 'learning_rate': 7.508e-07, 'completion_length': 59.51785850524902, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.10101525112986565, 'kl': 0.14697265625, 'epoch': 0.25} 25%|██▍ | 623/2500 [5:20:21<17:20:11, 33.25s/it] 25%|██▍ | 624/2500 [5:20:46<16:06:43, 30.92s/it] {'loss': 0.0049, 'grad_norm': 1.8678191304140386, 'learning_rate': 7.503999999999999e-07, 'completion_length': 54.82143020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.123046875, 'epoch': 0.25} 25%|██▍ | 624/2500 [5:20:46<16:06:43, 30.92s/it] 25%|██▌ | 625/2500 [5:21:10<14:58:52, 28.76s/it] {'loss': 0.0062, 'grad_norm': 2.2269951325204995, 'learning_rate': 7.5e-07, 'completion_length': 58.60714530944824, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.910714328289032, 'reward_std': 0.05050762742757797, 'kl': 0.1552734375, 'epoch': 0.25} 25%|██▌ | 625/2500 [5:21:10<14:58:52, 28.76s/it] 25%|██▌ | 626/2500 [5:21:39<14:59:25, 28.80s/it] {'loss': 0.0054, 'grad_norm': 0.9904007624455599, 'learning_rate': 7.496e-07, 'completion_length': 60.98214530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13427734375, 'epoch': 0.25} 25%|██▌ | 626/2500 [5:21:39<14:59:25, 28.80s/it] 25%|██▌ | 627/2500 [5:22:09<15:11:41, 29.21s/it] {'loss': 0.0054, 'grad_norm': 0.9855475564622558, 'learning_rate': 7.492e-07, 'completion_length': 58.99107360839844, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.13525390625, 'epoch': 0.25} 25%|██▌ | 627/2500 [5:22:09<15:11:41, 29.21s/it] 25%|██▌ | 628/2500 [5:23:08<19:52:04, 38.21s/it] {'loss': 0.0049, 'grad_norm': 1.3414599840033918, 'learning_rate': 7.488e-07, 'completion_length': 66.06250190734863, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9464285969734192, 'reward_std': 0.08868780732154846, 'kl': 0.12353515625, 'epoch': 0.25} 25%|██▌ | 628/2500 [5:23:08<19:52:04, 38.21s/it] 25%|██▌ | 629/2500 [5:23:33<17:48:31, 34.27s/it] {'loss': 0.0056, 'grad_norm': 4.811416428157867, 'learning_rate': 7.483999999999999e-07, 'completion_length': 55.61607360839844, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.1396484375, 'epoch': 0.25} 25%|██▌ | 629/2500 [5:23:33<17:48:31, 34.27s/it] 25%|██▌ | 630/2500 [5:23:58<16:19:05, 31.41s/it] {'loss': 0.0046, 'grad_norm': 0.20845219153883682, 'learning_rate': 7.48e-07, 'completion_length': 62.47321701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11474609375, 'epoch': 0.25} 25%|██▌ | 630/2500 [5:23:58<16:19:05, 31.41s/it] 25%|██▌ | 631/2500 [5:24:23<15:15:13, 29.38s/it] {'loss': 0.0042, 'grad_norm': 0.5758096341805021, 'learning_rate': 7.476e-07, 'completion_length': 60.017860412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.105712890625, 'epoch': 0.25} 25%|██▌ | 631/2500 [5:24:23<15:15:13, 29.38s/it] 25%|██▌ | 632/2500 [5:25:13<18:33:20, 35.76s/it] {'loss': 0.0059, 'grad_norm': 3.873902793334282, 'learning_rate': 7.471999999999999e-07, 'completion_length': 58.982147216796875, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1484375, 'epoch': 0.25} 25%|██▌ | 632/2500 [5:25:13<18:33:20, 35.76s/it] 25%|██▌ | 633/2500 [5:26:30<24:57:35, 48.13s/it] {'loss': 0.005, 'grad_norm': 0.15814986974555478, 'learning_rate': 7.468e-07, 'completion_length': 62.696434020996094, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.124267578125, 'epoch': 0.25} 25%|██▌ | 633/2500 [5:26:30<24:57:35, 48.13s/it] 25%|██▌ | 634/2500 [5:26:59<21:56:35, 42.33s/it] {'loss': 0.0048, 'grad_norm': 0.19039228098271718, 'learning_rate': 7.464e-07, 'completion_length': 67.25000190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.119873046875, 'epoch': 0.25} 25%|██▌ | 634/2500 [5:26:59<21:56:35, 42.33s/it] 25%|██▌ | 635/2500 [5:28:36<30:24:04, 58.68s/it] {'loss': 0.0054, 'grad_norm': 0.45358444701198386, 'learning_rate': 7.459999999999999e-07, 'completion_length': 69.08036041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.13427734375, 'epoch': 0.25} 25%|██▌ | 635/2500 [5:28:36<30:24:04, 58.68s/it] 25%|██▌ | 636/2500 [5:29:02<25:20:08, 48.93s/it] {'loss': 0.0045, 'grad_norm': 0.1534715824886015, 'learning_rate': 7.456e-07, 'completion_length': 67.84821891784668, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11328125, 'epoch': 0.25} 25%|██▌ | 636/2500 [5:29:02<25:20:08, 48.93s/it] 25%|██▌ | 637/2500 [5:29:26<21:22:16, 41.30s/it] {'loss': 0.0048, 'grad_norm': 0.8626790099276872, 'learning_rate': 7.452e-07, 'completion_length': 59.02678680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.119384765625, 'epoch': 0.25} 25%|██▌ | 637/2500 [5:29:26<21:22:16, 41.30s/it] 26%|██▌ | 638/2500 [5:29:50<18:47:40, 36.34s/it] {'loss': 0.0061, 'grad_norm': 2.2862942770157457, 'learning_rate': 7.447999999999999e-07, 'completion_length': 58.39285850524902, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.1533203125, 'epoch': 0.26} 26%|██▌ | 638/2500 [5:29:50<18:47:40, 36.34s/it] 26%|██▌ | 639/2500 [5:30:14<16:47:21, 32.48s/it] {'loss': 0.0044, 'grad_norm': 1.0242841300273438, 'learning_rate': 7.443999999999999e-07, 'completion_length': 61.51786231994629, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1103515625, 'epoch': 0.26} 26%|██▌ | 639/2500 [5:30:14<16:47:21, 32.48s/it] 26%|██▌ | 640/2500 [5:30:39<15:36:11, 30.20s/it] {'loss': 0.0051, 'grad_norm': 1.551794069198712, 'learning_rate': 7.44e-07, 'completion_length': 67.66071701049805, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.12646484375, 'epoch': 0.26} 26%|██▌ | 640/2500 [5:30:39<15:36:11, 30.20s/it] 26%|██▌ | 641/2500 [5:31:03<14:38:17, 28.35s/it] {'loss': 0.0045, 'grad_norm': 2.14086155865173, 'learning_rate': 7.436e-07, 'completion_length': 65.41964721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.111572265625, 'epoch': 0.26} 26%|██▌ | 641/2500 [5:31:03<14:38:17, 28.35s/it] 26%|██▌ | 642/2500 [5:31:25<13:42:49, 26.57s/it] {'loss': 0.0037, 'grad_norm': 0.7552818091707122, 'learning_rate': 7.431999999999999e-07, 'completion_length': 67.84821701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.091796875, 'epoch': 0.26} 26%|██▌ | 642/2500 [5:31:25<13:42:49, 26.57s/it] 26%|██▌ | 643/2500 [5:31:49<13:15:36, 25.71s/it] {'loss': 0.0042, 'grad_norm': 0.27647863934776895, 'learning_rate': 7.428e-07, 'completion_length': 63.205360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1044921875, 'epoch': 0.26} 26%|██▌ | 643/2500 [5:31:49<13:15:36, 25.71s/it] 26%|██▌ | 644/2500 [5:32:12<12:53:14, 25.00s/it] {'loss': 0.0042, 'grad_norm': 0.22880301730487995, 'learning_rate': 7.423999999999999e-07, 'completion_length': 71.8035774230957, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.105712890625, 'epoch': 0.26} 26%|██▌ | 644/2500 [5:32:12<12:53:14, 25.00s/it] 26%|██▌ | 645/2500 [5:32:37<12:53:33, 25.02s/it] {'loss': 0.004, 'grad_norm': 0.9094067925507875, 'learning_rate': 7.42e-07, 'completion_length': 64.46429061889648, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.053144559264183044, 'kl': 0.1005859375, 'epoch': 0.26} 26%|██▌ | 645/2500 [5:32:37<12:53:33, 25.02s/it] 26%|██▌ | 646/2500 [5:33:03<13:03:06, 25.34s/it] {'loss': 0.0046, 'grad_norm': 0.14589315489588278, 'learning_rate': 7.416e-07, 'completion_length': 70.53571701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.116455078125, 'epoch': 0.26} 26%|██▌ | 646/2500 [5:33:03<13:03:06, 25.34s/it] 26%|██▌ | 647/2500 [5:33:29<13:02:44, 25.35s/it] {'loss': 0.0029, 'grad_norm': 0.7834657246542958, 'learning_rate': 7.411999999999999e-07, 'completion_length': 65.04464721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.07373046875, 'epoch': 0.26} 26%|██▌ | 647/2500 [5:33:29<13:02:44, 25.35s/it] 26%|██▌ | 648/2500 [5:33:57<13:26:03, 26.11s/it] {'loss': 0.0041, 'grad_norm': 1.296456136674511, 'learning_rate': 7.408e-07, 'completion_length': 67.90178871154785, 'rewards/accuracy_reward': 0.8839285969734192, 'rewards/format_reward': 1.0, 'reward': 1.8839285969734192, 'reward_std': 0.03696779906749725, 'kl': 0.101318359375, 'epoch': 0.26} 26%|██▌ | 648/2500 [5:33:57<13:26:03, 26.11s/it] 26%|██▌ | 649/2500 [5:34:49<17:28:29, 33.99s/it] {'loss': 0.0026, 'grad_norm': 0.1782115793973299, 'learning_rate': 7.403999999999999e-07, 'completion_length': 66.51786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0662841796875, 'epoch': 0.26} 26%|██▌ | 649/2500 [5:34:49<17:28:29, 33.99s/it] 26%|██▌ | 650/2500 [5:35:14<16:06:44, 31.35s/it] {'loss': 0.004, 'grad_norm': 1.8355818141490254, 'learning_rate': 7.4e-07, 'completion_length': 73.91072082519531, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9375001192092896, 'reward_std': 0.12054043263196945, 'kl': 0.098876953125, 'epoch': 0.26} 26%|██▌ | 650/2500 [5:35:14<16:06:44, 31.35s/it] 26%|██▌ | 651/2500 [5:35:46<16:11:41, 31.53s/it] {'loss': 0.0052, 'grad_norm': 1.9568671892152125, 'learning_rate': 7.396e-07, 'completion_length': 62.68750190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.1298828125, 'epoch': 0.26} 26%|██▌ | 651/2500 [5:35:46<16:11:41, 31.53s/it] 26%|██▌ | 652/2500 [5:36:12<15:17:22, 29.79s/it] {'loss': 0.0042, 'grad_norm': 1.0401119469433762, 'learning_rate': 7.392e-07, 'completion_length': 57.58928871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.106201171875, 'epoch': 0.26} 26%|██▌ | 652/2500 [5:36:12<15:17:22, 29.79s/it] 26%|██▌ | 653/2500 [5:36:37<14:32:55, 28.36s/it] {'loss': 0.0054, 'grad_norm': 0.8434335291802756, 'learning_rate': 7.388e-07, 'completion_length': 63.44643020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13427734375, 'epoch': 0.26} 26%|██▌ | 653/2500 [5:36:37<14:32:55, 28.36s/it] 26%|██▌ | 654/2500 [5:37:02<14:05:32, 27.48s/it] {'loss': 0.0051, 'grad_norm': 0.8157657629027566, 'learning_rate': 7.383999999999999e-07, 'completion_length': 68.11607551574707, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.12646484375, 'epoch': 0.26} 26%|██▌ | 654/2500 [5:37:02<14:05:32, 27.48s/it] 26%|██▌ | 655/2500 [5:37:27<13:35:32, 26.52s/it] {'loss': 0.0042, 'grad_norm': 2.4162049929254734, 'learning_rate': 7.38e-07, 'completion_length': 69.47322082519531, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.946428656578064, 'reward_std': 0.11146338284015656, 'kl': 0.1044921875, 'epoch': 0.26} 26%|██▌ | 655/2500 [5:37:27<13:35:32, 26.52s/it] 26%|██▌ | 656/2500 [5:37:52<13:30:00, 26.36s/it] {'loss': 0.0035, 'grad_norm': 0.9536904607974351, 'learning_rate': 7.376e-07, 'completion_length': 70.29464721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.087158203125, 'epoch': 0.26} 26%|██▌ | 656/2500 [5:37:53<13:30:00, 26.36s/it] 26%|██▋ | 657/2500 [5:38:40<16:43:32, 32.67s/it] {'loss': 0.0034, 'grad_norm': 0.7458288431783882, 'learning_rate': 7.371999999999999e-07, 'completion_length': 77.91964721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.086181640625, 'epoch': 0.26} 26%|██▋ | 657/2500 [5:38:40<16:43:32, 32.67s/it] 26%|██▋ | 658/2500 [5:39:07<15:52:19, 31.02s/it] {'loss': 0.0046, 'grad_norm': 2.033085579969769, 'learning_rate': 7.368e-07, 'completion_length': 68.54464721679688, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857909202576, 'reward_std': 0.10101525112986565, 'kl': 0.11474609375, 'epoch': 0.26} 26%|██▋ | 658/2500 [5:39:07<15:52:19, 31.02s/it] 26%|██▋ | 659/2500 [5:39:31<14:43:39, 28.80s/it] {'loss': 0.0047, 'grad_norm': 0.89021375369969, 'learning_rate': 7.364000000000001e-07, 'completion_length': 66.99107360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.116455078125, 'epoch': 0.26} 26%|██▋ | 659/2500 [5:39:31<14:43:39, 28.80s/it] 26%|██▋ | 660/2500 [5:39:56<14:10:10, 27.72s/it] {'loss': 0.0037, 'grad_norm': 0.17399719520599327, 'learning_rate': 7.359999999999999e-07, 'completion_length': 64.90178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.091552734375, 'epoch': 0.26} 26%|██▋ | 660/2500 [5:39:56<14:10:10, 27.72s/it] 26%|██▋ | 661/2500 [5:40:21<13:47:03, 26.98s/it] {'loss': 0.0037, 'grad_norm': 0.8528232165228836, 'learning_rate': 7.356e-07, 'completion_length': 65.87500190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.093505859375, 'epoch': 0.26} 26%|██▋ | 661/2500 [5:40:21<13:47:03, 26.98s/it] 26%|██▋ | 662/2500 [5:40:48<13:44:36, 26.92s/it] {'loss': 0.0039, 'grad_norm': 1.0334677866446897, 'learning_rate': 7.352e-07, 'completion_length': 71.20536041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.0986328125, 'epoch': 0.26} 26%|██▋ | 662/2500 [5:40:48<13:44:36, 26.92s/it] 27%|██▋ | 663/2500 [5:41:13<13:27:40, 26.38s/it] {'loss': 0.0031, 'grad_norm': 0.1892740300767693, 'learning_rate': 7.347999999999999e-07, 'completion_length': 68.81250381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.07666015625, 'epoch': 0.27} 27%|██▋ | 663/2500 [5:41:13<13:27:40, 26.38s/it] 27%|██▋ | 664/2500 [5:41:36<12:58:41, 25.45s/it] {'loss': 0.0055, 'grad_norm': 3.753827920892067, 'learning_rate': 7.344e-07, 'completion_length': 54.84821701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13818359375, 'epoch': 0.27} 27%|██▋ | 664/2500 [5:41:36<12:58:41, 25.45s/it] 27%|██▋ | 665/2500 [5:42:00<12:43:01, 24.95s/it] {'loss': 0.0046, 'grad_norm': 0.17092375683105815, 'learning_rate': 7.34e-07, 'completion_length': 56.267860412597656, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.115966796875, 'epoch': 0.27} 27%|██▋ | 665/2500 [5:42:00<12:43:01, 24.95s/it] 27%|██▋ | 666/2500 [5:42:25<12:42:07, 24.93s/it] {'loss': 0.004, 'grad_norm': 4.6565003907268805, 'learning_rate': 7.336e-07, 'completion_length': 57.40178871154785, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.098876953125, 'epoch': 0.27} 27%|██▋ | 666/2500 [5:42:25<12:42:07, 24.93s/it] 27%|██▋ | 667/2500 [5:42:51<12:55:34, 25.39s/it] {'loss': 0.0049, 'grad_norm': 0.20018676438889577, 'learning_rate': 7.331999999999999e-07, 'completion_length': 59.642860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.123046875, 'epoch': 0.27} 27%|██▋ | 667/2500 [5:42:52<12:55:34, 25.39s/it] 27%|██▋ | 668/2500 [5:43:17<12:55:36, 25.40s/it] {'loss': 0.0044, 'grad_norm': 0.2199783179513218, 'learning_rate': 7.328e-07, 'completion_length': 61.312503814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10986328125, 'epoch': 0.27} 27%|██▋ | 668/2500 [5:43:17<12:55:36, 25.40s/it] 27%|██▋ | 669/2500 [5:43:40<12:36:49, 24.80s/it] {'loss': 0.0052, 'grad_norm': 1.892082590439097, 'learning_rate': 7.324e-07, 'completion_length': 47.00893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.129150390625, 'epoch': 0.27} 27%|██▋ | 669/2500 [5:43:40<12:36:49, 24.80s/it] 27%|██▋ | 670/2500 [5:44:04<12:24:35, 24.41s/it] {'loss': 0.0047, 'grad_norm': 0.17928767150267452, 'learning_rate': 7.319999999999999e-07, 'completion_length': 61.80357551574707, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.116455078125, 'epoch': 0.27} 27%|██▋ | 670/2500 [5:44:04<12:24:35, 24.41s/it] 27%|██▋ | 671/2500 [5:44:27<12:16:42, 24.17s/it] {'loss': 0.0051, 'grad_norm': 3.133905028204787, 'learning_rate': 7.316e-07, 'completion_length': 54.36607360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.12646484375, 'epoch': 0.27} 27%|██▋ | 671/2500 [5:44:27<12:16:42, 24.17s/it] 27%|██▋ | 672/2500 [5:44:52<12:20:04, 24.29s/it] {'loss': 0.0042, 'grad_norm': 2.3915673148193277, 'learning_rate': 7.311999999999999e-07, 'completion_length': 58.91071701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1044921875, 'epoch': 0.27} 27%|██▋ | 672/2500 [5:44:52<12:20:04, 24.29s/it] 27%|██▋ | 673/2500 [5:45:17<12:24:20, 24.44s/it] {'loss': 0.0047, 'grad_norm': 3.041520493091361, 'learning_rate': 7.308e-07, 'completion_length': 53.57143020629883, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.03818017989397049, 'kl': 0.117919921875, 'epoch': 0.27} 27%|██▋ | 673/2500 [5:45:17<12:24:20, 24.44s/it] 27%|██▋ | 674/2500 [5:45:42<12:28:05, 24.58s/it] {'loss': 0.0051, 'grad_norm': 1.4460636954504973, 'learning_rate': 7.304e-07, 'completion_length': 59.17857360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.12841796875, 'epoch': 0.27} 27%|██▋ | 674/2500 [5:45:42<12:28:05, 24.58s/it] 27%|██▋ | 675/2500 [5:46:06<12:23:51, 24.46s/it] {'loss': 0.0046, 'grad_norm': 0.17724832163100174, 'learning_rate': 7.3e-07, 'completion_length': 60.86607360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.114013671875, 'epoch': 0.27} 27%|██▋ | 675/2500 [5:46:06<12:23:51, 24.46s/it] 27%|██▋ | 676/2500 [5:46:34<12:56:59, 25.56s/it] {'loss': 0.0051, 'grad_norm': 0.5399730864496779, 'learning_rate': 7.296e-07, 'completion_length': 66.6964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.06613001227378845, 'kl': 0.12646484375, 'epoch': 0.27} 27%|██▋ | 676/2500 [5:46:34<12:56:59, 25.56s/it] 27%|██▋ | 677/2500 [5:47:01<13:14:10, 26.14s/it] {'loss': 0.0027, 'grad_norm': 2.5959905797474017, 'learning_rate': 7.291999999999999e-07, 'completion_length': 65.57143020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.066162109375, 'epoch': 0.27} 27%|██▋ | 677/2500 [5:47:01<13:14:10, 26.14s/it] 27%|██▋ | 678/2500 [5:47:28<13:14:58, 26.18s/it] {'loss': 0.0041, 'grad_norm': 0.95527415816714, 'learning_rate': 7.288e-07, 'completion_length': 67.16964721679688, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.10205078125, 'epoch': 0.27} 27%|██▋ | 678/2500 [5:47:28<13:14:58, 26.18s/it] 27%|██▋ | 679/2500 [5:47:58<13:56:16, 27.55s/it] {'loss': 0.005, 'grad_norm': 1.697278600115378, 'learning_rate': 7.284e-07, 'completion_length': 61.75893211364746, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.125, 'epoch': 0.27} 27%|██▋ | 679/2500 [5:47:58<13:56:16, 27.55s/it] 27%|██▋ | 680/2500 [5:48:22<13:21:54, 26.44s/it] {'loss': 0.0042, 'grad_norm': 0.9274813026961827, 'learning_rate': 7.28e-07, 'completion_length': 65.12500190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.104248046875, 'epoch': 0.27} 27%|██▋ | 680/2500 [5:48:22<13:21:54, 26.44s/it] 27%|██▋ | 681/2500 [5:48:50<13:28:22, 26.66s/it] {'loss': 0.0053, 'grad_norm': 0.14037967347578012, 'learning_rate': 7.276e-07, 'completion_length': 57.000003814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1328125, 'epoch': 0.27} 27%|██▋ | 681/2500 [5:48:50<13:28:22, 26.66s/it] 27%|██▋ | 682/2500 [5:49:15<13:13:28, 26.19s/it] {'loss': 0.0054, 'grad_norm': 0.12561981097961797, 'learning_rate': 7.271999999999999e-07, 'completion_length': 63.33928871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.134521484375, 'epoch': 0.27} 27%|██▋ | 682/2500 [5:49:15<13:13:28, 26.19s/it] 27%|██▋ | 683/2500 [5:49:40<13:02:05, 25.83s/it] {'loss': 0.0048, 'grad_norm': 0.13348700621131854, 'learning_rate': 7.268e-07, 'completion_length': 61.51785850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.119873046875, 'epoch': 0.27} 27%|██▋ | 683/2500 [5:49:40<13:02:05, 25.83s/it] 27%|██▋ | 684/2500 [5:50:05<13:00:44, 25.80s/it] {'loss': 0.0047, 'grad_norm': 2.0684513097346997, 'learning_rate': 7.264e-07, 'completion_length': 57.40178871154785, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.117431640625, 'epoch': 0.27} 27%|██▋ | 684/2500 [5:50:05<13:00:44, 25.80s/it] 27%|██▋ | 685/2500 [5:50:32<13:05:02, 25.95s/it] {'loss': 0.0046, 'grad_norm': 1.3591367037376905, 'learning_rate': 7.259999999999999e-07, 'completion_length': 64.19643020629883, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.114501953125, 'epoch': 0.27} 27%|██▋ | 685/2500 [5:50:32<13:05:02, 25.95s/it] 27%|██▋ | 686/2500 [5:50:56<12:53:17, 25.58s/it] {'loss': 0.005, 'grad_norm': 1.8168190327946303, 'learning_rate': 7.256e-07, 'completion_length': 61.32143211364746, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.1259765625, 'epoch': 0.27} 27%|██▋ | 686/2500 [5:50:56<12:53:17, 25.58s/it] 27%|██▋ | 687/2500 [5:51:21<12:49:20, 25.46s/it] {'loss': 0.005, 'grad_norm': 3.170830096490732, 'learning_rate': 7.252e-07, 'completion_length': 65.94643211364746, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553572535514832, 'reward_std': 0.10882645100355148, 'kl': 0.125, 'epoch': 0.27} 27%|██▋ | 687/2500 [5:51:22<12:49:20, 25.46s/it] 28%|██▊ | 688/2500 [5:51:47<12:48:27, 25.45s/it] {'loss': 0.0048, 'grad_norm': 0.14412859555750135, 'learning_rate': 7.247999999999999e-07, 'completion_length': 62.19643211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.120361328125, 'epoch': 0.28} 28%|██▊ | 688/2500 [5:51:47<12:48:27, 25.45s/it] 28%|██▊ | 689/2500 [5:52:12<12:43:59, 25.31s/it] {'loss': 0.005, 'grad_norm': 0.20692283269554704, 'learning_rate': 7.244e-07, 'completion_length': 63.839290618896484, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.125732421875, 'epoch': 0.28} 28%|██▊ | 689/2500 [5:52:12<12:43:59, 25.31s/it] 28%|██▊ | 690/2500 [5:52:37<12:45:08, 25.36s/it] {'loss': 0.005, 'grad_norm': 2.786552290315813, 'learning_rate': 7.24e-07, 'completion_length': 59.29464530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9553572535514832, 'reward_std': 0.10365218669176102, 'kl': 0.125244140625, 'epoch': 0.28} 28%|██▊ | 690/2500 [5:52:37<12:45:08, 25.36s/it] 28%|██▊ | 691/2500 [5:53:02<12:35:15, 25.05s/it] {'loss': 0.0051, 'grad_norm': 1.12228676537341, 'learning_rate': 7.235999999999999e-07, 'completion_length': 60.50893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1279296875, 'epoch': 0.28} 28%|██▊ | 691/2500 [5:53:02<12:35:15, 25.05s/it] 28%|██▊ | 692/2500 [5:53:30<13:05:15, 26.06s/it] {'loss': 0.0049, 'grad_norm': 0.11763101496910745, 'learning_rate': 7.231999999999999e-07, 'completion_length': 66.50893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12109375, 'epoch': 0.28} 28%|██▊ | 692/2500 [5:53:30<13:05:15, 26.06s/it] 28%|██▊ | 693/2500 [5:53:56<13:06:32, 26.12s/it] {'loss': 0.0054, 'grad_norm': 2.5760956693524935, 'learning_rate': 7.228e-07, 'completion_length': 63.526790618896484, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.13623046875, 'epoch': 0.28} 28%|██▊ | 693/2500 [5:53:56<13:06:32, 26.12s/it] 28%|██▊ | 694/2500 [5:54:27<13:43:51, 27.37s/it] {'loss': 0.0049, 'grad_norm': 0.11149626472444206, 'learning_rate': 7.224e-07, 'completion_length': 64.39285850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12255859375, 'epoch': 0.28} 28%|██▊ | 694/2500 [5:54:27<13:43:51, 27.37s/it] 28%|██▊ | 695/2500 [5:54:55<13:53:57, 27.72s/it] {'loss': 0.0046, 'grad_norm': 3.0576745707449837, 'learning_rate': 7.219999999999999e-07, 'completion_length': 59.46428871154785, 'rewards/accuracy_reward': 0.8839286267757416, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.8750000596046448, 'reward_std': 0.10040178522467613, 'kl': 0.115234375, 'epoch': 0.28} 28%|██▊ | 695/2500 [5:54:55<13:53:57, 27.72s/it] 28%|██▊ | 696/2500 [5:55:20<13:22:50, 26.70s/it] {'loss': 0.0047, 'grad_norm': 0.8600266968951289, 'learning_rate': 7.216e-07, 'completion_length': 59.03571701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.116455078125, 'epoch': 0.28} 28%|██▊ | 696/2500 [5:55:20<13:22:50, 26.70s/it] 28%|██▊ | 697/2500 [5:55:49<13:49:07, 27.59s/it] {'loss': 0.0039, 'grad_norm': 1.7769188101752917, 'learning_rate': 7.211999999999999e-07, 'completion_length': 61.87500190734863, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.919642984867096, 'reward_std': 0.12054043263196945, 'kl': 0.09814453125, 'epoch': 0.28} 28%|██▊ | 697/2500 [5:55:49<13:49:07, 27.59s/it] 28%|██▊ | 698/2500 [5:56:18<14:03:03, 28.07s/it] {'loss': 0.0058, 'grad_norm': 1.3916020318794073, 'learning_rate': 7.207999999999999e-07, 'completion_length': 66.71428871154785, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.14404296875, 'epoch': 0.28} 28%|██▊ | 698/2500 [5:56:18<14:03:03, 28.07s/it] 28%|██▊ | 699/2500 [5:56:47<14:05:56, 28.18s/it] {'loss': 0.0049, 'grad_norm': 0.10477488922519493, 'learning_rate': 7.204e-07, 'completion_length': 59.71428871154785, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.12255859375, 'epoch': 0.28} 28%|██▊ | 699/2500 [5:56:47<14:05:56, 28.18s/it] 28%|██▊ | 700/2500 [5:57:13<13:46:15, 27.54s/it] {'loss': 0.0047, 'grad_norm': 0.12522542136406223, 'learning_rate': 7.2e-07, 'completion_length': 54.285715103149414, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1181640625, 'epoch': 0.28} 28%|██▊ | 700/2500 [5:57:13<13:46:15, 27.54s/it] 28%|██▊ | 701/2500 [5:58:09<18:04:19, 36.16s/it] {'loss': 0.0059, 'grad_norm': 0.15767969867626036, 'learning_rate': 7.196e-07, 'completion_length': 59.330360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1474609375, 'epoch': 0.28} 28%|██▊ | 701/2500 [5:58:09<18:04:19, 36.16s/it] 28%|██▊ | 702/2500 [5:59:03<20:45:37, 41.57s/it] {'loss': 0.0068, 'grad_norm': 0.6159710487603166, 'learning_rate': 7.191999999999999e-07, 'completion_length': 58.82143211364746, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.053144559264183044, 'kl': 0.16943359375, 'epoch': 0.28} 28%|██▊ | 702/2500 [5:59:03<20:45:37, 41.57s/it] 28%|██▊ | 703/2500 [5:59:38<19:39:42, 39.39s/it] {'loss': 0.0058, 'grad_norm': 0.47835112841188115, 'learning_rate': 7.188e-07, 'completion_length': 60.65178871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1455078125, 'epoch': 0.28} 28%|██▊ | 703/2500 [5:59:38<19:39:42, 39.39s/it] 28%|██▊ | 704/2500 [6:00:44<23:43:13, 47.55s/it] {'loss': 0.0052, 'grad_norm': 2.3990739056197175, 'learning_rate': 7.184e-07, 'completion_length': 60.49107551574707, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.12890625, 'epoch': 0.28} 28%|██▊ | 704/2500 [6:00:45<23:43:13, 47.55s/it] 28%|██▊ | 705/2500 [6:01:29<23:17:07, 46.70s/it] {'loss': 0.0049, 'grad_norm': 0.3006778996339289, 'learning_rate': 7.179999999999999e-07, 'completion_length': 63.35714530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.121337890625, 'epoch': 0.28} 28%|██▊ | 705/2500 [6:01:29<23:17:07, 46.70s/it] 28%|██▊ | 706/2500 [6:02:52<28:45:22, 57.70s/it] {'loss': 0.0066, 'grad_norm': 1.2289723603261604, 'learning_rate': 7.176e-07, 'completion_length': 55.21428871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1650390625, 'epoch': 0.28} 28%|██▊ | 706/2500 [6:02:52<28:45:22, 57.70s/it] 28%|██▊ | 707/2500 [6:03:24<24:51:43, 49.92s/it] {'loss': 0.0063, 'grad_norm': 0.15223981974012094, 'learning_rate': 7.171999999999999e-07, 'completion_length': 60.205360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.15771484375, 'epoch': 0.28} 28%|██▊ | 707/2500 [6:03:24<24:51:43, 49.92s/it] 28%|██▊ | 708/2500 [6:03:55<22:02:38, 44.28s/it] {'loss': 0.0054, 'grad_norm': 2.3002303065428924, 'learning_rate': 7.168e-07, 'completion_length': 48.348215103149414, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.134765625, 'epoch': 0.28} 28%|██▊ | 708/2500 [6:03:55<22:02:38, 44.28s/it] 28%|██▊ | 709/2500 [6:04:21<19:16:28, 38.74s/it] {'loss': 0.0049, 'grad_norm': 4.860088291833584, 'learning_rate': 7.164e-07, 'completion_length': 58.03571701049805, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.12353515625, 'epoch': 0.28} 28%|██▊ | 709/2500 [6:04:21<19:16:28, 38.74s/it] 28%|██▊ | 710/2500 [6:05:06<20:12:06, 40.63s/it] {'loss': 0.0048, 'grad_norm': 2.156242275918807, 'learning_rate': 7.159999999999999e-07, 'completion_length': 56.87500190734863, 'rewards/accuracy_reward': 0.928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0835726372897625, 'kl': 0.119873046875, 'epoch': 0.28} 28%|██▊ | 710/2500 [6:05:06<20:12:06, 40.63s/it] 28%|██▊ | 711/2500 [6:05:31<17:51:14, 35.93s/it] {'loss': 0.0048, 'grad_norm': 11.010305420993749, 'learning_rate': 7.156e-07, 'completion_length': 57.25000190734863, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.09528662264347076, 'kl': 0.119873046875, 'epoch': 0.28} 28%|██▊ | 711/2500 [6:05:31<17:51:14, 35.93s/it] 28%|██▊ | 712/2500 [6:05:55<16:03:51, 32.34s/it] {'loss': 0.0044, 'grad_norm': 0.13735734686469905, 'learning_rate': 7.151999999999999e-07, 'completion_length': 59.625003814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.109130859375, 'epoch': 0.28} 28%|██▊ | 712/2500 [6:05:55<16:03:51, 32.34s/it] 29%|██▊ | 713/2500 [6:06:20<15:01:07, 30.26s/it] {'loss': 0.0049, 'grad_norm': 2.0218015882567726, 'learning_rate': 7.147999999999999e-07, 'completion_length': 63.017860412597656, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.07003280520439148, 'kl': 0.1220703125, 'epoch': 0.29} 29%|██▊ | 713/2500 [6:06:20<15:01:07, 30.26s/it] 29%|██▊ | 714/2500 [6:07:44<22:55:42, 46.22s/it] {'loss': 0.0055, 'grad_norm': 1.6381690397396451, 'learning_rate': 7.144e-07, 'completion_length': 58.50000190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.13818359375, 'epoch': 0.29} 29%|██▊ | 714/2500 [6:07:44<22:55:42, 46.22s/it] 29%|██▊ | 715/2500 [6:08:11<20:02:06, 40.41s/it] {'loss': 0.0048, 'grad_norm': 0.8307594351558061, 'learning_rate': 7.14e-07, 'completion_length': 59.63393211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.119873046875, 'epoch': 0.29} 29%|██▊ | 715/2500 [6:08:11<20:02:06, 40.41s/it] 29%|██▊ | 716/2500 [6:08:34<17:25:03, 35.15s/it] {'loss': 0.0042, 'grad_norm': 1.7764014069957963, 'learning_rate': 7.135999999999999e-07, 'completion_length': 61.16964530944824, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.106201171875, 'epoch': 0.29} 29%|██▊ | 716/2500 [6:08:34<17:25:03, 35.15s/it] 29%|██▊ | 717/2500 [6:10:08<26:11:00, 52.87s/it] {'loss': 0.0045, 'grad_norm': 2.746201480469909, 'learning_rate': 7.131999999999999e-07, 'completion_length': 59.562503814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.112548828125, 'epoch': 0.29} 29%|██▊ | 717/2500 [6:10:08<26:11:00, 52.87s/it] 29%|██▊ | 718/2500 [6:11:26<29:51:46, 60.33s/it] {'loss': 0.0069, 'grad_norm': 1.6247720135552595, 'learning_rate': 7.128e-07, 'completion_length': 57.71428871154785, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.17333984375, 'epoch': 0.29} 29%|██▊ | 718/2500 [6:11:26<29:51:46, 60.33s/it] 29%|██▉ | 719/2500 [6:12:50<33:26:51, 67.61s/it] {'loss': 0.0037, 'grad_norm': 0.792925479143896, 'learning_rate': 7.124e-07, 'completion_length': 68.16964530944824, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.091796875, 'epoch': 0.29} 29%|██▉ | 719/2500 [6:12:50<33:26:51, 67.61s/it] 29%|██▉ | 720/2500 [6:13:17<27:26:59, 55.52s/it] {'loss': 0.0042, 'grad_norm': 0.27946481999823314, 'learning_rate': 7.119999999999999e-07, 'completion_length': 62.88393020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10595703125, 'epoch': 0.29} 29%|██▉ | 720/2500 [6:13:17<27:26:59, 55.52s/it] 29%|██▉ | 721/2500 [6:13:43<23:02:56, 46.64s/it] {'loss': 0.0037, 'grad_norm': 2.36184899813257, 'learning_rate': 7.116e-07, 'completion_length': 67.65178680419922, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.919642984867096, 'reward_std': 0.05831881985068321, 'kl': 0.093017578125, 'epoch': 0.29} 29%|██▉ | 721/2500 [6:13:43<23:02:56, 46.64s/it] 29%|██▉ | 722/2500 [6:14:10<20:00:42, 40.52s/it] {'loss': 0.0048, 'grad_norm': 1.9269800987083996, 'learning_rate': 7.112000000000001e-07, 'completion_length': 68.48214340209961, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.11962890625, 'epoch': 0.29} 29%|██▉ | 722/2500 [6:14:10<20:00:42, 40.52s/it] 29%|██▉ | 723/2500 [6:15:21<24:35:27, 49.82s/it] {'loss': 0.0051, 'grad_norm': 0.17027886357309258, 'learning_rate': 7.107999999999999e-07, 'completion_length': 66.94643020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12646484375, 'epoch': 0.29} 29%|██▉ | 723/2500 [6:15:21<24:35:27, 49.82s/it] 29%|██▉ | 724/2500 [6:15:46<20:49:30, 42.21s/it] {'loss': 0.0041, 'grad_norm': 1.8710583209960927, 'learning_rate': 7.104e-07, 'completion_length': 62.76785850524902, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.1025390625, 'epoch': 0.29} 29%|██▉ | 724/2500 [6:15:46<20:49:30, 42.21s/it] 29%|██▉ | 725/2500 [6:16:10<18:07:36, 36.76s/it] {'loss': 0.005, 'grad_norm': 3.344489091531492, 'learning_rate': 7.1e-07, 'completion_length': 67.81250190734863, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.124755859375, 'epoch': 0.29} 29%|██▉ | 725/2500 [6:16:10<18:07:36, 36.76s/it] 29%|██▉ | 726/2500 [6:16:33<16:11:54, 32.87s/it] {'loss': 0.0034, 'grad_norm': 0.15680679615581553, 'learning_rate': 7.096e-07, 'completion_length': 65.35714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.083984375, 'epoch': 0.29} 29%|██▉ | 726/2500 [6:16:33<16:11:54, 32.87s/it] 29%|██▉ | 727/2500 [6:17:02<15:36:12, 31.68s/it] {'loss': 0.0033, 'grad_norm': 0.9984738133318263, 'learning_rate': 7.092e-07, 'completion_length': 78.66071701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.08203125, 'epoch': 0.29} 29%|██▉ | 727/2500 [6:17:02<15:36:12, 31.68s/it] 29%|██▉ | 728/2500 [6:17:27<14:30:01, 29.46s/it] {'loss': 0.0035, 'grad_norm': 0.12884553577312838, 'learning_rate': 7.088e-07, 'completion_length': 70.76786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.087890625, 'epoch': 0.29} 29%|██▉ | 728/2500 [6:17:27<14:30:01, 29.46s/it] 29%|██▉ | 729/2500 [6:18:39<20:50:21, 42.36s/it] {'loss': 0.004, 'grad_norm': 0.7808553469074817, 'learning_rate': 7.084e-07, 'completion_length': 63.06250190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.10009765625, 'epoch': 0.29} 29%|██▉ | 729/2500 [6:18:39<20:50:21, 42.36s/it] 29%|██▉ | 730/2500 [6:19:05<18:20:31, 37.31s/it] {'loss': 0.003, 'grad_norm': 2.108447121592364, 'learning_rate': 7.079999999999999e-07, 'completion_length': 67.36607551574707, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.073974609375, 'epoch': 0.29} 29%|██▉ | 730/2500 [6:19:05<18:20:31, 37.31s/it] 29%|██▉ | 731/2500 [6:19:30<16:33:17, 33.69s/it] {'loss': 0.0062, 'grad_norm': 0.8980266101040206, 'learning_rate': 7.076e-07, 'completion_length': 62.34821701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.15478515625, 'epoch': 0.29} 29%|██▉ | 731/2500 [6:19:30<16:33:17, 33.69s/it] 29%|██▉ | 732/2500 [6:19:55<15:15:45, 31.08s/it] {'loss': 0.0037, 'grad_norm': 0.4245103126703503, 'learning_rate': 7.072e-07, 'completion_length': 61.65178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.093994140625, 'epoch': 0.29} 29%|██▉ | 732/2500 [6:19:55<15:15:45, 31.08s/it] 29%|██▉ | 733/2500 [6:20:19<14:13:50, 28.99s/it] {'loss': 0.0036, 'grad_norm': 1.683354664506075, 'learning_rate': 7.068e-07, 'completion_length': 64.56250381469727, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.090576171875, 'epoch': 0.29} 29%|██▉ | 733/2500 [6:20:19<14:13:50, 28.99s/it] 29%|██▉ | 734/2500 [6:20:43<13:33:57, 27.65s/it] {'loss': 0.005, 'grad_norm': 0.14656889823174146, 'learning_rate': 7.064e-07, 'completion_length': 64.59821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.124755859375, 'epoch': 0.29} 29%|██▉ | 734/2500 [6:20:43<13:33:57, 27.65s/it] 29%|██▉ | 735/2500 [6:21:06<12:51:04, 26.21s/it] {'loss': 0.0051, 'grad_norm': 1.9219577410759259, 'learning_rate': 7.059999999999999e-07, 'completion_length': 60.571434020996094, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.12841796875, 'epoch': 0.29} 29%|██▉ | 735/2500 [6:21:06<12:51:04, 26.21s/it] 29%|██▉ | 736/2500 [6:21:31<12:38:39, 25.80s/it] {'loss': 0.0041, 'grad_norm': 2.052336495937886, 'learning_rate': 7.056e-07, 'completion_length': 62.28571701049805, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857909202576, 'reward_std': 0.10101525112986565, 'kl': 0.1015625, 'epoch': 0.29} 29%|██▉ | 736/2500 [6:21:31<12:38:39, 25.80s/it] 29%|██▉ | 737/2500 [6:22:14<15:09:42, 30.96s/it] {'loss': 0.0055, 'grad_norm': 0.9627891100171809, 'learning_rate': 7.052e-07, 'completion_length': 58.07143211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13818359375, 'epoch': 0.29} 29%|██▉ | 737/2500 [6:22:14<15:09:42, 30.96s/it] 30%|██▉ | 738/2500 [6:22:42<14:39:43, 29.96s/it] {'loss': 0.0045, 'grad_norm': 1.1600340418696717, 'learning_rate': 7.047999999999999e-07, 'completion_length': 68.08928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.11328125, 'epoch': 0.3} 30%|██▉ | 738/2500 [6:22:42<14:39:43, 29.96s/it] 30%|██▉ | 739/2500 [6:23:08<14:09:12, 28.93s/it] {'loss': 0.0039, 'grad_norm': 1.0871145214576794, 'learning_rate': 7.044e-07, 'completion_length': 63.000003814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.09765625, 'epoch': 0.3} 30%|██▉ | 739/2500 [6:23:08<14:09:12, 28.93s/it] 30%|██▉ | 740/2500 [6:23:37<14:10:45, 29.00s/it] {'loss': 0.0031, 'grad_norm': 0.453192064113605, 'learning_rate': 7.04e-07, 'completion_length': 64.22321510314941, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.910714328289032, 'reward_std': 0.05050762742757797, 'kl': 0.076904296875, 'epoch': 0.3} 30%|██▉ | 740/2500 [6:23:37<14:10:45, 29.00s/it] 30%|██▉ | 741/2500 [6:24:04<13:45:50, 28.17s/it] {'loss': 0.0047, 'grad_norm': 2.1608442516053588, 'learning_rate': 7.035999999999999e-07, 'completion_length': 61.06250190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.116455078125, 'epoch': 0.3} 30%|██▉ | 741/2500 [6:24:04<13:45:50, 28.17s/it] 30%|██▉ | 742/2500 [6:24:30<13:26:49, 27.54s/it] {'loss': 0.0048, 'grad_norm': 1.516036792314347, 'learning_rate': 7.032e-07, 'completion_length': 64.71428680419922, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.118896484375, 'epoch': 0.3} 30%|██▉ | 742/2500 [6:24:30<13:26:49, 27.54s/it] 30%|██▉ | 743/2500 [6:24:55<13:10:05, 26.98s/it] {'loss': 0.004, 'grad_norm': 1.8065862311602652, 'learning_rate': 7.028e-07, 'completion_length': 60.36607551574707, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.100830078125, 'epoch': 0.3} 30%|██▉ | 743/2500 [6:24:55<13:10:05, 26.98s/it] 30%|██▉ | 744/2500 [6:25:25<13:35:13, 27.86s/it] {'loss': 0.0046, 'grad_norm': 3.074827723258825, 'learning_rate': 7.024e-07, 'completion_length': 74.23214721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.115966796875, 'epoch': 0.3} 30%|██▉ | 744/2500 [6:25:25<13:35:13, 27.86s/it] 30%|██▉ | 745/2500 [6:26:27<18:34:25, 38.10s/it] {'loss': 0.0042, 'grad_norm': 0.4428733242337692, 'learning_rate': 7.019999999999999e-07, 'completion_length': 69.44643020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1044921875, 'epoch': 0.3} 30%|██▉ | 745/2500 [6:26:27<18:34:25, 38.10s/it] 30%|██▉ | 746/2500 [6:27:02<18:01:02, 36.98s/it] {'loss': 0.0039, 'grad_norm': 0.5408382177592187, 'learning_rate': 7.016e-07, 'completion_length': 70.37500381469727, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.096435546875, 'epoch': 0.3} 30%|██▉ | 746/2500 [6:27:02<18:01:02, 36.98s/it] 30%|██▉ | 747/2500 [6:27:28<16:29:48, 33.88s/it] {'loss': 0.0032, 'grad_norm': 0.7912454773155423, 'learning_rate': 7.012000000000001e-07, 'completion_length': 67.33036041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.080322265625, 'epoch': 0.3} 30%|██▉ | 747/2500 [6:27:28<16:29:48, 33.88s/it] 30%|██▉ | 748/2500 [6:27:56<15:36:56, 32.09s/it] {'loss': 0.005, 'grad_norm': 1.4498614953408027, 'learning_rate': 7.007999999999999e-07, 'completion_length': 64.53571891784668, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642858505249023, 'reward_std': 0.0835726335644722, 'kl': 0.125, 'epoch': 0.3} 30%|██▉ | 748/2500 [6:27:56<15:36:56, 32.09s/it] 30%|██▉ | 749/2500 [6:28:41<17:24:41, 35.80s/it] {'loss': 0.0034, 'grad_norm': 2.7791547858373526, 'learning_rate': 7.004e-07, 'completion_length': 72.98214721679688, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.0849609375, 'epoch': 0.3} 30%|██▉ | 749/2500 [6:28:41<17:24:41, 35.80s/it] 30%|███ | 750/2500 [6:29:05<15:39:59, 32.23s/it] {'loss': 0.0039, 'grad_norm': 0.6644301611230392, 'learning_rate': 7e-07, 'completion_length': 59.62500190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.098388671875, 'epoch': 0.3} 30%|███ | 750/2500 [6:29:05<15:39:59, 32.23s/it] 30%|███ | 751/2500 [6:29:31<14:52:24, 30.61s/it] {'loss': 0.0042, 'grad_norm': 0.21981079707994297, 'learning_rate': 6.995999999999999e-07, 'completion_length': 68.18750381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10595703125, 'epoch': 0.3} 30%|███ | 751/2500 [6:29:31<14:52:24, 30.61s/it] 30%|███ | 752/2500 [6:29:58<14:13:04, 29.28s/it] {'loss': 0.0046, 'grad_norm': 1.4612330387429615, 'learning_rate': 6.992e-07, 'completion_length': 68.9285774230957, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.11474609375, 'epoch': 0.3} 30%|███ | 752/2500 [6:29:58<14:13:04, 29.28s/it] 30%|███ | 753/2500 [6:30:25<13:52:54, 28.61s/it] {'loss': 0.0038, 'grad_norm': 0.18109351401787313, 'learning_rate': 6.988e-07, 'completion_length': 67.78571701049805, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.09619140625, 'epoch': 0.3} 30%|███ | 753/2500 [6:30:25<13:52:54, 28.61s/it] 30%|███ | 754/2500 [6:30:50<13:22:31, 27.58s/it] {'loss': 0.0045, 'grad_norm': 2.004744299648545, 'learning_rate': 6.984e-07, 'completion_length': 62.67857551574707, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9375000596046448, 'reward_std': 0.08747543022036552, 'kl': 0.113037109375, 'epoch': 0.3} 30%|███ | 754/2500 [6:30:50<13:22:31, 27.58s/it] 30%|███ | 755/2500 [6:31:15<13:01:17, 26.86s/it] {'loss': 0.0053, 'grad_norm': 1.7456380210447326, 'learning_rate': 6.979999999999999e-07, 'completion_length': 61.000003814697266, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9553572535514832, 'reward_std': 0.08747542649507523, 'kl': 0.13330078125, 'epoch': 0.3} 30%|███ | 755/2500 [6:31:15<13:01:17, 26.86s/it] 30%|███ | 756/2500 [6:31:39<12:31:31, 25.86s/it] {'loss': 0.0039, 'grad_norm': 0.1744274583585548, 'learning_rate': 6.976e-07, 'completion_length': 59.473215103149414, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09765625, 'epoch': 0.3} 30%|███ | 756/2500 [6:31:39<12:31:31, 25.86s/it] 30%|███ | 757/2500 [6:32:04<12:24:14, 25.62s/it] {'loss': 0.0032, 'grad_norm': 0.162053729640565, 'learning_rate': 6.972e-07, 'completion_length': 69.82143020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.08056640625, 'epoch': 0.3} 30%|███ | 757/2500 [6:32:04<12:24:14, 25.62s/it] 30%|███ | 758/2500 [6:32:28<12:10:25, 25.16s/it] {'loss': 0.0037, 'grad_norm': 0.49823151463384296, 'learning_rate': 6.967999999999999e-07, 'completion_length': 66.35714340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.025253813713788986, 'kl': 0.09326171875, 'epoch': 0.3} 30%|███ | 758/2500 [6:32:28<12:10:25, 25.16s/it] 30%|███ | 759/2500 [6:32:55<12:28:38, 25.80s/it] {'loss': 0.0044, 'grad_norm': 2.2765712436957073, 'learning_rate': 6.964e-07, 'completion_length': 67.20535850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.109375, 'epoch': 0.3} 30%|███ | 759/2500 [6:32:55<12:28:38, 25.80s/it] 30%|███ | 760/2500 [6:33:18<12:07:50, 25.10s/it] {'loss': 0.0047, 'grad_norm': 0.923063109352983, 'learning_rate': 6.959999999999999e-07, 'completion_length': 57.76785850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.117431640625, 'epoch': 0.3} 30%|███ | 760/2500 [6:33:18<12:07:50, 25.10s/it] 30%|███ | 761/2500 [6:34:20<17:23:57, 36.02s/it] {'loss': 0.0037, 'grad_norm': 2.9539255870705032, 'learning_rate': 6.956e-07, 'completion_length': 63.142860412597656, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.093017578125, 'epoch': 0.3} 30%|███ | 761/2500 [6:34:20<17:23:57, 36.02s/it] 30%|███ | 762/2500 [6:35:11<19:30:07, 40.40s/it] {'loss': 0.0046, 'grad_norm': 14.129238326155193, 'learning_rate': 6.952e-07, 'completion_length': 64.26785850524902, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.11572265625, 'epoch': 0.3} 30%|███ | 762/2500 [6:35:11<19:30:07, 40.40s/it] 31%|███ | 763/2500 [6:35:37<17:26:31, 36.15s/it] {'loss': 0.0053, 'grad_norm': 0.8205470780104892, 'learning_rate': 6.947999999999999e-07, 'completion_length': 61.78571701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.131591796875, 'epoch': 0.31} 31%|███ | 763/2500 [6:35:37<17:26:31, 36.15s/it] 31%|███ | 764/2500 [6:36:02<15:49:24, 32.81s/it] {'loss': 0.0035, 'grad_norm': 4.883083202849835, 'learning_rate': 6.944e-07, 'completion_length': 62.73214530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.08642578125, 'epoch': 0.31} 31%|███ | 764/2500 [6:36:02<15:49:24, 32.81s/it] 31%|███ | 765/2500 [6:36:26<14:31:31, 30.14s/it] {'loss': 0.005, 'grad_norm': 0.18162317631043776, 'learning_rate': 6.939999999999999e-07, 'completion_length': 59.66071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.125, 'epoch': 0.31} 31%|███ | 765/2500 [6:36:26<14:31:31, 30.14s/it] 31%|███ | 766/2500 [6:36:53<14:02:07, 29.14s/it] {'loss': 0.0044, 'grad_norm': 0.15767009469663826, 'learning_rate': 6.935999999999999e-07, 'completion_length': 57.732147216796875, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10986328125, 'epoch': 0.31} 31%|███ | 766/2500 [6:36:53<14:02:07, 29.14s/it] 31%|███ | 767/2500 [6:37:17<13:25:19, 27.88s/it] {'loss': 0.0059, 'grad_norm': 0.981471252583831, 'learning_rate': 6.932e-07, 'completion_length': 57.79464530944824, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285714626312256, 'reward_std': 0.03818017989397049, 'kl': 0.148193359375, 'epoch': 0.31} 31%|███ | 767/2500 [6:37:17<13:25:19, 27.88s/it] 31%|███ | 768/2500 [6:37:42<12:58:24, 26.97s/it] {'loss': 0.0041, 'grad_norm': 1.9433982921274136, 'learning_rate': 6.928e-07, 'completion_length': 58.90178871154785, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.103759765625, 'epoch': 0.31} 31%|███ | 768/2500 [6:37:42<12:58:24, 26.97s/it] 31%|███ | 769/2500 [6:38:07<12:35:29, 26.19s/it] {'loss': 0.0058, 'grad_norm': 1.4022627907153538, 'learning_rate': 6.924e-07, 'completion_length': 56.65178871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.14404296875, 'epoch': 0.31} 31%|███ | 769/2500 [6:38:07<12:35:29, 26.19s/it] 31%|███ | 770/2500 [6:38:32<12:28:18, 25.95s/it] {'loss': 0.0055, 'grad_norm': 0.8793527948442387, 'learning_rate': 6.919999999999999e-07, 'completion_length': 64.21428871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13671875, 'epoch': 0.31} 31%|███ | 770/2500 [6:38:32<12:28:18, 25.95s/it] 31%|███ | 771/2500 [6:38:57<12:19:04, 25.65s/it] {'loss': 0.0045, 'grad_norm': 1.8302925776532697, 'learning_rate': 6.916e-07, 'completion_length': 64.41964530944824, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.11376953125, 'epoch': 0.31} 31%|███ | 771/2500 [6:38:57<12:19:04, 25.65s/it] 31%|███ | 772/2500 [6:39:22<12:11:27, 25.40s/it] {'loss': 0.0047, 'grad_norm': 0.9305465915628754, 'learning_rate': 6.912e-07, 'completion_length': 66.3839340209961, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.11669921875, 'epoch': 0.31} 31%|███ | 772/2500 [6:39:22<12:11:27, 25.40s/it] 31%|███ | 773/2500 [6:39:49<12:30:00, 26.06s/it] {'loss': 0.0054, 'grad_norm': 0.16375534943889233, 'learning_rate': 6.907999999999999e-07, 'completion_length': 61.17857551574707, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1357421875, 'epoch': 0.31} 31%|███ | 773/2500 [6:39:49<12:30:00, 26.06s/it] 31%|███ | 774/2500 [6:40:15<12:24:31, 25.88s/it] {'loss': 0.0058, 'grad_norm': 1.5329108491184111, 'learning_rate': 6.904e-07, 'completion_length': 61.50893020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9642857909202576, 'reward_std': 0.0835726372897625, 'kl': 0.14501953125, 'epoch': 0.31} 31%|███ | 774/2500 [6:40:15<12:24:31, 25.88s/it] 31%|███ | 775/2500 [6:40:42<12:30:20, 26.10s/it] {'loss': 0.0061, 'grad_norm': 0.6012201837851958, 'learning_rate': 6.9e-07, 'completion_length': 63.28571891784668, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1513671875, 'epoch': 0.31} 31%|███ | 775/2500 [6:40:42<12:30:20, 26.10s/it] 31%|███ | 776/2500 [6:41:17<13:54:27, 29.04s/it] {'loss': 0.0041, 'grad_norm': 3.989162327747306, 'learning_rate': 6.895999999999999e-07, 'completion_length': 65.44643211364746, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.102294921875, 'epoch': 0.31} 31%|███ | 776/2500 [6:41:17<13:54:27, 29.04s/it] 31%|███ | 777/2500 [6:41:52<14:45:23, 30.83s/it] {'loss': 0.0047, 'grad_norm': 0.18845498362087923, 'learning_rate': 6.892e-07, 'completion_length': 58.07143020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11669921875, 'epoch': 0.31} 31%|███ | 777/2500 [6:41:53<14:45:23, 30.83s/it] 31%|███ | 778/2500 [6:42:28<15:21:32, 32.11s/it] {'loss': 0.0035, 'grad_norm': 1.354780314317606, 'learning_rate': 6.888e-07, 'completion_length': 65.25893211364746, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285714626312256, 'reward_std': 0.05050762742757797, 'kl': 0.088134765625, 'epoch': 0.31} 31%|███ | 778/2500 [6:42:28<15:21:32, 32.11s/it] 31%|███ | 779/2500 [6:42:52<14:12:09, 29.71s/it] {'loss': 0.0053, 'grad_norm': 0.3015856305602608, 'learning_rate': 6.883999999999999e-07, 'completion_length': 58.01785850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1328125, 'epoch': 0.31} 31%|███ | 779/2500 [6:42:52<14:12:09, 29.71s/it] 31%|███ | 780/2500 [6:43:18<13:44:23, 28.76s/it] {'loss': 0.0054, 'grad_norm': 1.9976051988620063, 'learning_rate': 6.879999999999999e-07, 'completion_length': 67.59821701049805, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285714626312256, 'reward_std': 0.0835726335644722, 'kl': 0.134521484375, 'epoch': 0.31} 31%|███ | 780/2500 [6:43:18<13:44:23, 28.76s/it] 31%|███ | 781/2500 [6:43:48<13:50:15, 28.98s/it] {'loss': 0.0055, 'grad_norm': 1.5638187082277153, 'learning_rate': 6.876e-07, 'completion_length': 80.99107360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9285715222358704, 'reward_std': 0.16197100281715393, 'kl': 0.13720703125, 'epoch': 0.31} 31%|███ | 781/2500 [6:43:48<13:50:15, 28.98s/it] 31%|███▏ | 782/2500 [6:44:15<13:38:02, 28.57s/it] {'loss': 0.0055, 'grad_norm': 13.547473518611211, 'learning_rate': 6.872e-07, 'completion_length': 64.59821891784668, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.138427734375, 'epoch': 0.31} 31%|███▏ | 782/2500 [6:44:15<13:38:02, 28.57s/it] 31%|███▏ | 783/2500 [6:44:40<13:03:44, 27.39s/it] {'loss': 0.0054, 'grad_norm': 0.1608236667984219, 'learning_rate': 6.867999999999999e-07, 'completion_length': 59.66071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13525390625, 'epoch': 0.31} 31%|███▏ | 783/2500 [6:44:40<13:03:44, 27.39s/it] 31%|███▏ | 784/2500 [6:45:05<12:44:35, 26.73s/it] {'loss': 0.0056, 'grad_norm': 0.7369644097318124, 'learning_rate': 6.864e-07, 'completion_length': 60.32143020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.139892578125, 'epoch': 0.31} 31%|███▏ | 784/2500 [6:45:05<12:44:35, 26.73s/it] 31%|███▏ | 785/2500 [6:45:37<13:25:40, 28.19s/it] {'loss': 0.0061, 'grad_norm': 1.2974629762241439, 'learning_rate': 6.86e-07, 'completion_length': 61.58928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.152587890625, 'epoch': 0.31} 31%|███▏ | 785/2500 [6:45:37<13:25:40, 28.19s/it] 31%|███▏ | 786/2500 [6:46:07<13:42:57, 28.81s/it] {'loss': 0.004, 'grad_norm': 0.8476828756734063, 'learning_rate': 6.855999999999999e-07, 'completion_length': 67.31250381469727, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.06613001227378845, 'kl': 0.099365234375, 'epoch': 0.31} 31%|███▏ | 786/2500 [6:46:07<13:42:57, 28.81s/it] 31%|███▏ | 787/2500 [6:46:34<13:24:44, 28.19s/it] {'loss': 0.0063, 'grad_norm': 1.7996421249147887, 'learning_rate': 6.852e-07, 'completion_length': 60.40178871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.15673828125, 'epoch': 0.31} 31%|███▏ | 787/2500 [6:46:34<13:24:44, 28.19s/it] 32%|███▏ | 788/2500 [6:47:00<13:06:58, 27.58s/it] {'loss': 0.0057, 'grad_norm': 0.5402292751642965, 'learning_rate': 6.847999999999999e-07, 'completion_length': 51.10714530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.14208984375, 'epoch': 0.32} 32%|███▏ | 788/2500 [6:47:00<13:06:58, 27.58s/it] 32%|███▏ | 789/2500 [6:47:25<12:47:48, 26.92s/it] {'loss': 0.0048, 'grad_norm': 0.14477573613588474, 'learning_rate': 6.844e-07, 'completion_length': 63.29464530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11865234375, 'epoch': 0.32} 32%|███▏ | 789/2500 [6:47:25<12:47:48, 26.92s/it] 32%|███▏ | 790/2500 [6:47:52<12:46:04, 26.88s/it] {'loss': 0.0047, 'grad_norm': 0.17118370702020752, 'learning_rate': 6.84e-07, 'completion_length': 56.96428871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1171875, 'epoch': 0.32} 32%|███▏ | 790/2500 [6:47:52<12:46:04, 26.88s/it] 32%|███▏ | 791/2500 [6:48:15<12:11:08, 25.67s/it] {'loss': 0.0045, 'grad_norm': 1.5086095692346602, 'learning_rate': 6.836e-07, 'completion_length': 56.99107360839844, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.11328125, 'epoch': 0.32} 32%|███▏ | 791/2500 [6:48:15<12:11:08, 25.67s/it] 32%|███▏ | 792/2500 [6:48:40<12:02:09, 25.37s/it] {'loss': 0.0051, 'grad_norm': 0.15061518503005208, 'learning_rate': 6.832e-07, 'completion_length': 66.20536041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12841796875, 'epoch': 0.32} 32%|███▏ | 792/2500 [6:48:40<12:02:09, 25.37s/it] 32%|███▏ | 793/2500 [6:49:07<12:18:25, 25.96s/it] {'loss': 0.0053, 'grad_norm': 0.14903503836267445, 'learning_rate': 6.827999999999999e-07, 'completion_length': 67.10714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13232421875, 'epoch': 0.32} 32%|███▏ | 793/2500 [6:49:07<12:18:25, 25.96s/it] 32%|███▏ | 794/2500 [6:49:31<12:00:13, 25.33s/it] {'loss': 0.006, 'grad_norm': 2.1684886922860596, 'learning_rate': 6.824e-07, 'completion_length': 58.169647216796875, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.910714328289032, 'reward_std': 0.05050762742757797, 'kl': 0.14990234375, 'epoch': 0.32} 32%|███▏ | 794/2500 [6:49:31<12:00:13, 25.33s/it] 32%|███▏ | 795/2500 [6:50:06<13:25:06, 28.33s/it] {'loss': 0.0048, 'grad_norm': 1.884103617123289, 'learning_rate': 6.82e-07, 'completion_length': 63.55357360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.119384765625, 'epoch': 0.32} 32%|███▏ | 795/2500 [6:50:06<13:25:06, 28.33s/it] 32%|███▏ | 796/2500 [6:50:31<12:52:51, 27.21s/it] {'loss': 0.0064, 'grad_norm': 2.6187714598224328, 'learning_rate': 6.816e-07, 'completion_length': 59.50000190734863, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.1611328125, 'epoch': 0.32} 32%|███▏ | 796/2500 [6:50:31<12:52:51, 27.21s/it] 32%|███▏ | 797/2500 [6:50:56<12:37:47, 26.70s/it] {'loss': 0.0051, 'grad_norm': 1.480639330752583, 'learning_rate': 6.812e-07, 'completion_length': 60.830360412597656, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.0835726410150528, 'kl': 0.127197265625, 'epoch': 0.32} 32%|███▏ | 797/2500 [6:50:56<12:37:47, 26.70s/it] 32%|███▏ | 798/2500 [6:51:23<12:36:20, 26.66s/it] {'loss': 0.0051, 'grad_norm': 2.3879504665163913, 'learning_rate': 6.807999999999999e-07, 'completion_length': 62.267860412597656, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857313156128, 'reward_std': 0.0835726335644722, 'kl': 0.12646484375, 'epoch': 0.32} 32%|███▏ | 798/2500 [6:51:23<12:36:20, 26.66s/it] 32%|███▏ | 799/2500 [6:51:47<12:16:59, 26.00s/it] {'loss': 0.0047, 'grad_norm': 0.22930016424474128, 'learning_rate': 6.804e-07, 'completion_length': 66.41071510314941, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11767578125, 'epoch': 0.32} 32%|███▏ | 799/2500 [6:51:47<12:16:59, 26.00s/it] 32%|███▏ | 800/2500 [6:52:12<12:06:22, 25.64s/it] {'loss': 0.0045, 'grad_norm': 0.7602012028252689, 'learning_rate': 6.800000000000001e-07, 'completion_length': 71.46429061889648, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.11279296875, 'epoch': 0.32} 32%|███▏ | 800/2500 [6:52:12<12:06:22, 25.64s/it] 32%|███▏ | 801/2500 [6:53:22<18:22:11, 38.92s/it] {'loss': 0.0054, 'grad_norm': 0.8365508944886197, 'learning_rate': 6.795999999999999e-07, 'completion_length': 69.23214721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.134033203125, 'epoch': 0.32} 32%|███▏ | 801/2500 [6:53:22<18:22:11, 38.92s/it] 32%|███▏ | 802/2500 [6:53:43<15:51:35, 33.63s/it] {'loss': 0.0045, 'grad_norm': 1.0186062235629063, 'learning_rate': 6.792e-07, 'completion_length': 59.86607360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.111572265625, 'epoch': 0.32} 32%|███▏ | 802/2500 [6:53:43<15:51:35, 33.63s/it] 32%|███▏ | 803/2500 [6:54:06<14:16:06, 30.27s/it] {'loss': 0.0048, 'grad_norm': 0.2387581992621646, 'learning_rate': 6.788e-07, 'completion_length': 61.29464530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.120361328125, 'epoch': 0.32} 32%|███▏ | 803/2500 [6:54:06<14:16:06, 30.27s/it] 32%|███▏ | 804/2500 [6:54:30<13:23:00, 28.41s/it] {'loss': 0.0049, 'grad_norm': 0.15780634000342392, 'learning_rate': 6.783999999999999e-07, 'completion_length': 57.71428871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.121337890625, 'epoch': 0.32} 32%|███▏ | 804/2500 [6:54:30<13:23:00, 28.41s/it] 32%|███▏ | 805/2500 [6:54:57<13:16:20, 28.19s/it] {'loss': 0.004, 'grad_norm': 0.6678190548414402, 'learning_rate': 6.78e-07, 'completion_length': 71.98214721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.099609375, 'epoch': 0.32} 32%|███▏ | 805/2500 [6:54:57<13:16:20, 28.19s/it] 32%|███▏ | 806/2500 [6:55:21<12:41:07, 26.96s/it] {'loss': 0.0038, 'grad_norm': 0.16846029427717377, 'learning_rate': 6.776e-07, 'completion_length': 69.0714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.095703125, 'epoch': 0.32} 32%|███▏ | 806/2500 [6:55:21<12:41:07, 26.96s/it] 32%|███▏ | 807/2500 [6:55:47<12:31:37, 26.64s/it] {'loss': 0.0048, 'grad_norm': 3.084908362848447, 'learning_rate': 6.772e-07, 'completion_length': 67.12500381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1201171875, 'epoch': 0.32} 32%|███▏ | 807/2500 [6:55:47<12:31:37, 26.64s/it] 32%|███▏ | 808/2500 [6:56:13<12:20:19, 26.25s/it] {'loss': 0.0048, 'grad_norm': 0.7359650602521352, 'learning_rate': 6.767999999999999e-07, 'completion_length': 72.30357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.053144559264183044, 'kl': 0.12060546875, 'epoch': 0.32} 32%|███▏ | 808/2500 [6:56:13<12:20:19, 26.25s/it] 32%|███▏ | 809/2500 [6:56:40<12:31:02, 26.65s/it] {'loss': 0.0055, 'grad_norm': 0.5842733179786399, 'learning_rate': 6.764e-07, 'completion_length': 61.55357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.137939453125, 'epoch': 0.32} 32%|███▏ | 809/2500 [6:56:40<12:31:02, 26.65s/it] 32%|███▏ | 810/2500 [6:57:08<12:40:04, 26.99s/it] {'loss': 0.0047, 'grad_norm': 1.4920955574816692, 'learning_rate': 6.76e-07, 'completion_length': 69.16071891784668, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.118896484375, 'epoch': 0.32} 32%|███▏ | 810/2500 [6:57:08<12:40:04, 26.99s/it] 32%|███▏ | 811/2500 [6:57:52<15:02:43, 32.07s/it] {'loss': 0.0074, 'grad_norm': 1.7785870697989208, 'learning_rate': 6.755999999999999e-07, 'completion_length': 63.508934020996094, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.18505859375, 'epoch': 0.32} 32%|███▏ | 811/2500 [6:57:52<15:02:43, 32.07s/it] 32%|███▏ | 812/2500 [6:58:30<15:49:33, 33.75s/it] {'loss': 0.0049, 'grad_norm': 2.3903478529601787, 'learning_rate': 6.752e-07, 'completion_length': 69.08928680419922, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9464285969734192, 'reward_std': 0.06222161650657654, 'kl': 0.12353515625, 'epoch': 0.32} 32%|███▏ | 812/2500 [6:58:30<15:49:33, 33.75s/it] 33%|███▎ | 813/2500 [6:59:36<20:19:55, 43.39s/it] {'loss': 0.0039, 'grad_norm': 2.1957351767031876, 'learning_rate': 6.747999999999999e-07, 'completion_length': 64.18750381469727, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.098388671875, 'epoch': 0.33} 33%|███▎ | 813/2500 [6:59:36<20:19:55, 43.39s/it] 33%|███▎ | 814/2500 [7:00:08<18:46:05, 40.07s/it] {'loss': 0.0049, 'grad_norm': 1.188879369864058, 'learning_rate': 6.744e-07, 'completion_length': 60.31250190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1220703125, 'epoch': 0.33} 33%|███▎ | 814/2500 [7:00:08<18:46:05, 40.07s/it] 33%|███▎ | 815/2500 [7:01:20<23:13:05, 49.61s/it] {'loss': 0.0046, 'grad_norm': 0.11759195253984088, 'learning_rate': 6.74e-07, 'completion_length': 67.08929061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.114990234375, 'epoch': 0.33} 33%|███▎ | 815/2500 [7:01:20<23:13:05, 49.61s/it] 33%|███▎ | 816/2500 [7:01:56<21:23:10, 45.72s/it] {'loss': 0.0046, 'grad_norm': 0.12005846624977162, 'learning_rate': 6.736e-07, 'completion_length': 58.16964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.114990234375, 'epoch': 0.33} 33%|███▎ | 816/2500 [7:01:56<21:23:10, 45.72s/it] 33%|███▎ | 817/2500 [7:02:28<19:27:09, 41.61s/it] {'loss': 0.0041, 'grad_norm': 1.6600528798448524, 'learning_rate': 6.732e-07, 'completion_length': 69.04464721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.08747542649507523, 'kl': 0.101806640625, 'epoch': 0.33} 33%|███▎ | 817/2500 [7:02:28<19:27:09, 41.61s/it] 33%|███▎ | 818/2500 [7:03:26<21:42:45, 46.47s/it] {'loss': 0.0052, 'grad_norm': 0.1224978739783324, 'learning_rate': 6.727999999999999e-07, 'completion_length': 56.84821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.130615234375, 'epoch': 0.33} 33%|███▎ | 818/2500 [7:03:26<21:42:45, 46.47s/it] 33%|███▎ | 819/2500 [7:03:52<18:46:26, 40.21s/it] {'loss': 0.0041, 'grad_norm': 0.15487242087633213, 'learning_rate': 6.724e-07, 'completion_length': 64.58036231994629, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1025390625, 'epoch': 0.33} 33%|███▎ | 819/2500 [7:03:52<18:46:26, 40.21s/it] 33%|███▎ | 820/2500 [7:04:29<18:18:59, 39.25s/it] {'loss': 0.0052, 'grad_norm': 0.17462847359182745, 'learning_rate': 6.72e-07, 'completion_length': 53.41964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13037109375, 'epoch': 0.33} 33%|███▎ | 820/2500 [7:04:29<18:18:59, 39.25s/it] 33%|███▎ | 821/2500 [7:05:05<17:50:14, 38.25s/it] {'loss': 0.0046, 'grad_norm': 3.4732128761689784, 'learning_rate': 6.716e-07, 'completion_length': 60.81250190734863, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.115234375, 'epoch': 0.33} 33%|███▎ | 821/2500 [7:05:05<17:50:14, 38.25s/it] 33%|███▎ | 822/2500 [7:06:06<21:03:04, 45.16s/it] {'loss': 0.004, 'grad_norm': 0.13399594959206818, 'learning_rate': 6.712e-07, 'completion_length': 56.41964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.099853515625, 'epoch': 0.33} 33%|███▎ | 822/2500 [7:06:06<21:03:04, 45.16s/it] 33%|███▎ | 823/2500 [7:06:33<18:30:15, 39.72s/it] {'loss': 0.0042, 'grad_norm': 1.3657322403662955, 'learning_rate': 6.707999999999999e-07, 'completion_length': 57.830360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.104248046875, 'epoch': 0.33} 33%|███▎ | 823/2500 [7:06:33<18:30:15, 39.72s/it] 33%|███▎ | 824/2500 [7:07:10<18:05:05, 38.85s/it] {'loss': 0.0054, 'grad_norm': 0.7715752621753393, 'learning_rate': 6.704e-07, 'completion_length': 57.74107360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.134521484375, 'epoch': 0.33} 33%|███▎ | 824/2500 [7:07:10<18:05:05, 38.85s/it] 33%|███▎ | 825/2500 [7:07:36<16:14:12, 34.90s/it] {'loss': 0.0037, 'grad_norm': 1.1959155835622868, 'learning_rate': 6.7e-07, 'completion_length': 64.92857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.093017578125, 'epoch': 0.33} 33%|███▎ | 825/2500 [7:07:36<16:14:12, 34.90s/it] 33%|███▎ | 826/2500 [7:08:17<17:11:28, 36.97s/it] {'loss': 0.0052, 'grad_norm': 1.13212138595896, 'learning_rate': 6.695999999999999e-07, 'completion_length': 63.28571891784668, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13134765625, 'epoch': 0.33} 33%|███▎ | 826/2500 [7:08:17<17:11:28, 36.97s/it] 33%|███▎ | 827/2500 [7:08:45<15:51:24, 34.12s/it] {'loss': 0.0048, 'grad_norm': 1.3398216297259917, 'learning_rate': 6.692e-07, 'completion_length': 66.43750381469727, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9375000596046448, 'reward_std': 0.08747543022036552, 'kl': 0.12109375, 'epoch': 0.33} 33%|███▎ | 827/2500 [7:08:45<15:51:24, 34.12s/it] 33%|███▎ | 828/2500 [7:09:11<14:41:07, 31.62s/it] {'loss': 0.0061, 'grad_norm': 3.1824784747669033, 'learning_rate': 6.688e-07, 'completion_length': 61.339290618896484, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1533203125, 'epoch': 0.33} 33%|███▎ | 828/2500 [7:09:11<14:41:07, 31.62s/it] 33%|███▎ | 829/2500 [7:09:42<14:38:32, 31.55s/it] {'loss': 0.0051, 'grad_norm': 2.7856000654979036, 'learning_rate': 6.683999999999999e-07, 'completion_length': 67.41964721679688, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9375000596046448, 'reward_std': 0.154159814119339, 'kl': 0.128662109375, 'epoch': 0.33} 33%|███▎ | 829/2500 [7:09:42<14:38:32, 31.55s/it] 33%|███▎ | 830/2500 [7:10:09<13:56:57, 30.07s/it] {'loss': 0.0045, 'grad_norm': 1.4503536216138682, 'learning_rate': 6.68e-07, 'completion_length': 68.7410774230957, 'rewards/accuracy_reward': 0.8660714328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.8392857909202576, 'reward_std': 0.10101525112986565, 'kl': 0.113525390625, 'epoch': 0.33} 33%|███▎ | 830/2500 [7:10:09<13:56:57, 30.07s/it] 33%|███▎ | 831/2500 [7:11:15<18:55:52, 40.83s/it] {'loss': 0.004, 'grad_norm': 1.5653309205486856, 'learning_rate': 6.676e-07, 'completion_length': 70.93750381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9642857313156128, 'reward_std': 0.0835726335644722, 'kl': 0.100341796875, 'epoch': 0.33} 33%|███▎ | 831/2500 [7:11:15<18:55:52, 40.83s/it] 33%|███▎ | 832/2500 [7:11:44<17:16:52, 37.30s/it] {'loss': 0.0045, 'grad_norm': 1.6048994859588943, 'learning_rate': 6.671999999999999e-07, 'completion_length': 74.40178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9553572535514832, 'reward_std': 0.12626906484365463, 'kl': 0.11181640625, 'epoch': 0.33} 33%|███▎ | 832/2500 [7:11:44<17:16:52, 37.30s/it] 33%|███▎ | 833/2500 [7:12:25<17:49:43, 38.50s/it] {'loss': 0.0032, 'grad_norm': 1.0466028329210277, 'learning_rate': 6.667999999999999e-07, 'completion_length': 77.25893020629883, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9553572535514832, 'reward_std': 0.12626906484365463, 'kl': 0.079345703125, 'epoch': 0.33} 33%|███▎ | 833/2500 [7:12:25<17:49:43, 38.50s/it] 33%|███▎ | 834/2500 [7:12:51<16:07:24, 34.84s/it] {'loss': 0.0039, 'grad_norm': 1.0486006418905482, 'learning_rate': 6.664e-07, 'completion_length': 63.098215103149414, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.09765625, 'epoch': 0.33} 33%|███▎ | 834/2500 [7:12:51<16:07:24, 34.84s/it] 33%|███▎ | 835/2500 [7:13:25<16:01:24, 34.65s/it] {'loss': 0.0039, 'grad_norm': 1.7487605876234034, 'learning_rate': 6.66e-07, 'completion_length': 80.9910774230957, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.9553571939468384, 'reward_std': 0.12626906856894493, 'kl': 0.097412109375, 'epoch': 0.33} 33%|███▎ | 835/2500 [7:13:25<16:01:24, 34.65s/it] 33%|███▎ | 836/2500 [7:14:14<17:58:37, 38.89s/it] {'loss': 0.0053, 'grad_norm': 2.3171507958882014, 'learning_rate': 6.655999999999999e-07, 'completion_length': 86.32143020629883, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.9017857909202576, 'reward_std': 0.2133290022611618, 'kl': 0.132568359375, 'epoch': 0.33} 33%|███▎ | 836/2500 [7:14:14<17:58:37, 38.89s/it] 33%|███▎ | 837/2500 [7:14:59<18:44:45, 40.58s/it] {'loss': 0.004, 'grad_norm': 0.6422922355146586, 'learning_rate': 6.652e-07, 'completion_length': 74.37500381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.099853515625, 'epoch': 0.33} 33%|███▎ | 837/2500 [7:14:59<18:44:45, 40.58s/it] 34%|███▎ | 838/2500 [7:15:42<19:06:36, 41.39s/it] {'loss': 0.0042, 'grad_norm': 0.7575255485368768, 'learning_rate': 6.647999999999999e-07, 'completion_length': 72.52678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.10400390625, 'epoch': 0.34} 34%|███▎ | 838/2500 [7:15:42<19:06:36, 41.39s/it] 34%|███▎ | 839/2500 [7:16:10<17:12:37, 37.30s/it] {'loss': 0.0039, 'grad_norm': 0.19955956558634858, 'learning_rate': 6.643999999999999e-07, 'completion_length': 67.80357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.096435546875, 'epoch': 0.34} 34%|███▎ | 839/2500 [7:16:10<17:12:37, 37.30s/it] 34%|███▎ | 840/2500 [7:16:36<15:39:16, 33.95s/it] {'loss': 0.0038, 'grad_norm': 1.2214137341974458, 'learning_rate': 6.64e-07, 'completion_length': 71.9910774230957, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.094970703125, 'epoch': 0.34} 34%|███▎ | 840/2500 [7:16:36<15:39:16, 33.95s/it] 34%|███▎ | 841/2500 [7:17:00<14:21:13, 31.15s/it] {'loss': 0.005, 'grad_norm': 9.541738619232063, 'learning_rate': 6.636e-07, 'completion_length': 69.38393020629883, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.12548828125, 'epoch': 0.34} 34%|███▎ | 841/2500 [7:17:00<14:21:13, 31.15s/it] 34%|███▎ | 842/2500 [7:17:26<13:34:48, 29.49s/it] {'loss': 0.0047, 'grad_norm': 1.0893137056983593, 'learning_rate': 6.632e-07, 'completion_length': 65.14286041259766, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.1162109375, 'epoch': 0.34} 34%|███▎ | 842/2500 [7:17:26<13:34:48, 29.49s/it] 34%|███▎ | 843/2500 [7:17:51<12:53:28, 28.01s/it] {'loss': 0.0038, 'grad_norm': 0.37135786771695023, 'learning_rate': 6.627999999999999e-07, 'completion_length': 71.22321701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0947265625, 'epoch': 0.34} 34%|███▎ | 843/2500 [7:17:51<12:53:28, 28.01s/it] 34%|███▍ | 844/2500 [7:18:18<12:45:41, 27.74s/it] {'loss': 0.0047, 'grad_norm': 2.7337423362149655, 'learning_rate': 6.624e-07, 'completion_length': 68.51785850524902, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.07636035978794098, 'kl': 0.118408203125, 'epoch': 0.34} 34%|███▍ | 844/2500 [7:18:18<12:45:41, 27.74s/it] 34%|███▍ | 845/2500 [7:18:44<12:28:39, 27.14s/it] {'loss': 0.0045, 'grad_norm': 0.3018884845245033, 'learning_rate': 6.62e-07, 'completion_length': 66.83928871154785, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.111328125, 'epoch': 0.34} 34%|███▍ | 845/2500 [7:18:44<12:28:39, 27.14s/it] 34%|███▍ | 846/2500 [7:19:11<12:30:35, 27.23s/it] {'loss': 0.0057, 'grad_norm': 0.5358998196719567, 'learning_rate': 6.615999999999999e-07, 'completion_length': 73.4464340209961, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.14208984375, 'epoch': 0.34} 34%|███▍ | 846/2500 [7:19:11<12:30:35, 27.23s/it] 34%|███▍ | 847/2500 [7:19:36<12:14:35, 26.66s/it] {'loss': 0.0048, 'grad_norm': 0.7958765265547122, 'learning_rate': 6.612e-07, 'completion_length': 65.51786041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.119140625, 'epoch': 0.34} 34%|███▍ | 847/2500 [7:19:36<12:14:35, 26.66s/it] 34%|███▍ | 848/2500 [7:20:04<12:26:28, 27.11s/it] {'loss': 0.0048, 'grad_norm': 0.5039590845357279, 'learning_rate': 6.608e-07, 'completion_length': 72.44643020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.119873046875, 'epoch': 0.34} 34%|███▍ | 848/2500 [7:20:04<12:26:28, 27.11s/it] 34%|███▍ | 849/2500 [7:20:29<12:01:27, 26.22s/it] {'loss': 0.006, 'grad_norm': 1.4752051776509703, 'learning_rate': 6.604e-07, 'completion_length': 57.88393020629883, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.150390625, 'epoch': 0.34} 34%|███▍ | 849/2500 [7:20:29<12:01:27, 26.22s/it] 34%|███▍ | 850/2500 [7:20:55<12:00:04, 26.18s/it] {'loss': 0.005, 'grad_norm': 1.8041924617113616, 'learning_rate': 6.6e-07, 'completion_length': 66.36607360839844, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.06343398988246918, 'kl': 0.124755859375, 'epoch': 0.34} 34%|███▍ | 850/2500 [7:20:55<12:00:04, 26.18s/it] 34%|███▍ | 851/2500 [7:21:19<11:46:14, 25.70s/it] {'loss': 0.005, 'grad_norm': 0.2627694874550169, 'learning_rate': 6.595999999999999e-07, 'completion_length': 68.96428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.126220703125, 'epoch': 0.34} 34%|███▍ | 851/2500 [7:21:19<11:46:14, 25.70s/it] 34%|███▍ | 852/2500 [7:21:45<11:46:55, 25.74s/it] {'loss': 0.0047, 'grad_norm': 0.21319730433693704, 'learning_rate': 6.592e-07, 'completion_length': 66.34821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1181640625, 'epoch': 0.34} 34%|███▍ | 852/2500 [7:21:45<11:46:55, 25.74s/it] 34%|███▍ | 853/2500 [7:22:15<12:19:31, 26.94s/it] {'loss': 0.0039, 'grad_norm': 0.23219560745439924, 'learning_rate': 6.588e-07, 'completion_length': 60.71428871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0986328125, 'epoch': 0.34} 34%|███▍ | 853/2500 [7:22:15<12:19:31, 26.94s/it] 34%|███▍ | 854/2500 [7:22:41<12:15:12, 26.80s/it] {'loss': 0.0045, 'grad_norm': 0.24386416688177193, 'learning_rate': 6.583999999999999e-07, 'completion_length': 69.2589340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11328125, 'epoch': 0.34} 34%|███▍ | 854/2500 [7:22:41<12:15:12, 26.80s/it] 34%|███▍ | 855/2500 [7:23:06<11:56:04, 26.12s/it] {'loss': 0.0061, 'grad_norm': 1.8291585723727952, 'learning_rate': 6.58e-07, 'completion_length': 63.008934020996094, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285714626312256, 'reward_std': 0.11272925138473511, 'kl': 0.15087890625, 'epoch': 0.34} 34%|███▍ | 855/2500 [7:23:06<11:56:04, 26.12s/it] 34%|███▍ | 856/2500 [7:23:36<12:25:16, 27.20s/it] {'loss': 0.0047, 'grad_norm': 0.48218393711031693, 'learning_rate': 6.576e-07, 'completion_length': 72.88393020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.116943359375, 'epoch': 0.34} 34%|███▍ | 856/2500 [7:23:36<12:25:16, 27.20s/it] 34%|███▍ | 857/2500 [7:24:02<12:21:59, 27.10s/it] {'loss': 0.0049, 'grad_norm': 1.5508112159442078, 'learning_rate': 6.571999999999999e-07, 'completion_length': 66.48214340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.07839837670326233, 'kl': 0.12109375, 'epoch': 0.34} 34%|███▍ | 857/2500 [7:24:02<12:21:59, 27.10s/it] 34%|███▍ | 858/2500 [7:24:28<12:07:43, 26.59s/it] {'loss': 0.0051, 'grad_norm': 0.9396746409462742, 'learning_rate': 6.568e-07, 'completion_length': 57.09821701049805, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.127197265625, 'epoch': 0.34} 34%|███▍ | 858/2500 [7:24:28<12:07:43, 26.59s/it] 34%|███▍ | 859/2500 [7:24:54<12:05:52, 26.54s/it] {'loss': 0.0061, 'grad_norm': 0.9697190217894573, 'learning_rate': 6.564e-07, 'completion_length': 55.06250190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1533203125, 'epoch': 0.34} 34%|███▍ | 859/2500 [7:24:54<12:05:52, 26.54s/it] 34%|███▍ | 860/2500 [7:25:47<15:42:06, 34.47s/it] {'loss': 0.0051, 'grad_norm': 1.7132546446623973, 'learning_rate': 6.56e-07, 'completion_length': 65.81250190734863, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.127197265625, 'epoch': 0.34} 34%|███▍ | 860/2500 [7:25:47<15:42:06, 34.47s/it] 34%|███▍ | 861/2500 [7:26:16<14:56:10, 32.81s/it] {'loss': 0.0051, 'grad_norm': 2.752128520297034, 'learning_rate': 6.555999999999999e-07, 'completion_length': 60.38393211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.126708984375, 'epoch': 0.34} 34%|███▍ | 861/2500 [7:26:16<14:56:10, 32.81s/it] 34%|███▍ | 862/2500 [7:26:39<13:34:05, 29.82s/it] {'loss': 0.0059, 'grad_norm': 1.3346345313357422, 'learning_rate': 6.552e-07, 'completion_length': 55.10714530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1474609375, 'epoch': 0.34} 34%|███▍ | 862/2500 [7:26:39<13:34:05, 29.82s/it] 35%|███▍ | 863/2500 [7:27:05<13:03:13, 28.71s/it] {'loss': 0.0045, 'grad_norm': 0.8508474394205572, 'learning_rate': 6.548000000000001e-07, 'completion_length': 63.20535850524902, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9107143878936768, 'reward_std': 0.05050762742757797, 'kl': 0.113037109375, 'epoch': 0.35} 35%|███▍ | 863/2500 [7:27:05<13:03:13, 28.71s/it] 35%|███▍ | 864/2500 [7:27:34<13:02:13, 28.69s/it] {'loss': 0.0054, 'grad_norm': 0.8565091033710359, 'learning_rate': 6.543999999999999e-07, 'completion_length': 70.43750381469727, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.13525390625, 'epoch': 0.35} 35%|███▍ | 864/2500 [7:27:34<13:02:13, 28.69s/it] 35%|███▍ | 865/2500 [7:27:59<12:30:37, 27.55s/it] {'loss': 0.0045, 'grad_norm': 0.21783547251960014, 'learning_rate': 6.54e-07, 'completion_length': 60.43750190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.113525390625, 'epoch': 0.35} 35%|███▍ | 865/2500 [7:27:59<12:30:37, 27.55s/it] 35%|███▍ | 866/2500 [7:28:23<12:05:14, 26.63s/it] {'loss': 0.0063, 'grad_norm': 1.0564077614374394, 'learning_rate': 6.536e-07, 'completion_length': 60.48214530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.15625, 'epoch': 0.35} 35%|███▍ | 866/2500 [7:28:23<12:05:14, 26.63s/it] 35%|███▍ | 867/2500 [7:28:49<12:02:45, 26.56s/it] {'loss': 0.0063, 'grad_norm': 1.3421406979237318, 'learning_rate': 6.531999999999999e-07, 'completion_length': 52.76785850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.15673828125, 'epoch': 0.35} 35%|███▍ | 867/2500 [7:28:50<12:02:45, 26.56s/it] 35%|███▍ | 868/2500 [7:29:56<17:26:59, 38.49s/it] {'loss': 0.0047, 'grad_norm': 0.19597178297431617, 'learning_rate': 6.528e-07, 'completion_length': 64.66964721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.117431640625, 'epoch': 0.35} 35%|███▍ | 868/2500 [7:29:56<17:26:59, 38.49s/it] 35%|███▍ | 869/2500 [7:30:24<15:59:09, 35.28s/it] {'loss': 0.0056, 'grad_norm': 0.5980780952162853, 'learning_rate': 6.524e-07, 'completion_length': 64.32143020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.139892578125, 'epoch': 0.35} 35%|███▍ | 869/2500 [7:30:24<15:59:09, 35.28s/it] 35%|███▍ | 870/2500 [7:30:48<14:30:12, 32.03s/it] {'loss': 0.0046, 'grad_norm': 0.21845737137125695, 'learning_rate': 6.52e-07, 'completion_length': 52.55357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1142578125, 'epoch': 0.35} 35%|███▍ | 870/2500 [7:30:48<14:30:12, 32.03s/it] 35%|███▍ | 871/2500 [7:31:14<13:40:50, 30.23s/it] {'loss': 0.0049, 'grad_norm': 2.1322004701438417, 'learning_rate': 6.515999999999999e-07, 'completion_length': 55.60714530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.12158203125, 'epoch': 0.35} 35%|███▍ | 871/2500 [7:31:14<13:40:50, 30.23s/it] 35%|███▍ | 872/2500 [7:31:40<13:04:31, 28.91s/it] {'loss': 0.0045, 'grad_norm': 0.252762217065911, 'learning_rate': 6.512e-07, 'completion_length': 58.90178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11181640625, 'epoch': 0.35} 35%|███▍ | 872/2500 [7:31:40<13:04:31, 28.91s/it] 35%|███▍ | 873/2500 [7:32:05<12:35:05, 27.85s/it] {'loss': 0.0055, 'grad_norm': 1.1103045982406516, 'learning_rate': 6.508e-07, 'completion_length': 62.25893211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.13720703125, 'epoch': 0.35} 35%|███▍ | 873/2500 [7:32:05<12:35:05, 27.85s/it] 35%|███▍ | 874/2500 [7:32:29<12:04:47, 26.75s/it] {'loss': 0.0056, 'grad_norm': 0.24146524935269953, 'learning_rate': 6.504e-07, 'completion_length': 60.88393211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13916015625, 'epoch': 0.35} 35%|███▍ | 874/2500 [7:32:30<12:04:47, 26.75s/it] 35%|███▌ | 875/2500 [7:32:53<11:37:33, 25.76s/it] {'loss': 0.0051, 'grad_norm': 0.13500565614111226, 'learning_rate': 6.5e-07, 'completion_length': 59.00893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12841796875, 'epoch': 0.35} 35%|███▌ | 875/2500 [7:32:53<11:37:33, 25.76s/it] 35%|███▌ | 876/2500 [7:33:16<11:19:21, 25.10s/it] {'loss': 0.0044, 'grad_norm': 0.13305853331011977, 'learning_rate': 6.495999999999999e-07, 'completion_length': 56.75000190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.109375, 'epoch': 0.35} 35%|███▌ | 876/2500 [7:33:17<11:19:21, 25.10s/it] 35%|███▌ | 877/2500 [7:33:40<11:08:59, 24.73s/it] {'loss': 0.0042, 'grad_norm': 2.1860790679112845, 'learning_rate': 6.492e-07, 'completion_length': 53.84821701049805, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.106201171875, 'epoch': 0.35} 35%|███▌ | 877/2500 [7:33:40<11:08:59, 24.73s/it] 35%|███▌ | 878/2500 [7:34:06<11:18:24, 25.10s/it] {'loss': 0.0038, 'grad_norm': 0.8986350267367716, 'learning_rate': 6.488e-07, 'completion_length': 66.72322082519531, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.095703125, 'epoch': 0.35} 35%|███▌ | 878/2500 [7:34:06<11:18:24, 25.10s/it] 35%|███▌ | 879/2500 [7:34:46<13:16:02, 29.46s/it] {'loss': 0.0047, 'grad_norm': 0.5799814678621245, 'learning_rate': 6.483999999999999e-07, 'completion_length': 59.91964530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.116943359375, 'epoch': 0.35} 35%|███▌ | 879/2500 [7:34:46<13:16:02, 29.46s/it] 35%|███▌ | 880/2500 [7:35:52<18:13:41, 40.51s/it] {'loss': 0.0042, 'grad_norm': 2.0630396900780417, 'learning_rate': 6.48e-07, 'completion_length': 63.60714530944824, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.10400390625, 'epoch': 0.35} 35%|███▌ | 880/2500 [7:35:52<18:13:41, 40.51s/it] 35%|███▌ | 881/2500 [7:36:18<16:16:22, 36.18s/it] {'loss': 0.0044, 'grad_norm': 2.641310260404875, 'learning_rate': 6.476e-07, 'completion_length': 58.16071891784668, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.110595703125, 'epoch': 0.35} 35%|███▌ | 881/2500 [7:36:18<16:16:22, 36.18s/it] 35%|███▌ | 882/2500 [7:36:45<14:56:34, 33.25s/it] {'loss': 0.0052, 'grad_norm': 1.0634429560311234, 'learning_rate': 6.471999999999999e-07, 'completion_length': 63.625003814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.131103515625, 'epoch': 0.35} 35%|███▌ | 882/2500 [7:36:45<14:56:34, 33.25s/it] 35%|███▌ | 883/2500 [7:37:12<14:06:05, 31.39s/it] {'loss': 0.0052, 'grad_norm': 0.1677864301430674, 'learning_rate': 6.468e-07, 'completion_length': 53.96428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.130859375, 'epoch': 0.35} 35%|███▌ | 883/2500 [7:37:12<14:06:05, 31.39s/it] 35%|███▌ | 884/2500 [7:37:39<13:28:57, 30.04s/it] {'loss': 0.0046, 'grad_norm': 1.1824322507342155, 'learning_rate': 6.464e-07, 'completion_length': 67.41071891784668, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.115478515625, 'epoch': 0.35} 35%|███▌ | 884/2500 [7:37:39<13:28:57, 30.04s/it] 35%|███▌ | 885/2500 [7:38:03<12:41:48, 28.30s/it] {'loss': 0.0048, 'grad_norm': 0.1431115346205113, 'learning_rate': 6.46e-07, 'completion_length': 59.34821891784668, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12060546875, 'epoch': 0.35} 35%|███▌ | 885/2500 [7:38:03<12:41:48, 28.30s/it] 35%|███▌ | 886/2500 [7:38:28<12:14:48, 27.32s/it] {'loss': 0.004, 'grad_norm': 11.682250562759954, 'learning_rate': 6.455999999999999e-07, 'completion_length': 59.17857551574707, 'rewards/accuracy_reward': 0.8839286267757416, 'rewards/format_reward': 1.0, 'reward': 1.883928656578064, 'reward_std': 0.03696779906749725, 'kl': 0.100830078125, 'epoch': 0.35} 35%|███▌ | 886/2500 [7:38:28<12:14:48, 27.32s/it] 35%|███▌ | 887/2500 [7:39:13<14:35:32, 32.57s/it] {'loss': 0.0055, 'grad_norm': 3.3379284019159536, 'learning_rate': 6.452e-07, 'completion_length': 65.65179061889648, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.138671875, 'epoch': 0.35} 35%|███▌ | 887/2500 [7:39:13<14:35:32, 32.57s/it] 36%|███▌ | 888/2500 [7:39:39<13:40:58, 30.56s/it] {'loss': 0.0051, 'grad_norm': 2.296269976401159, 'learning_rate': 6.448000000000001e-07, 'completion_length': 55.04464530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.12646484375, 'epoch': 0.36} 36%|███▌ | 888/2500 [7:39:39<13:40:58, 30.56s/it] 36%|███▌ | 889/2500 [7:40:34<16:58:12, 37.92s/it] {'loss': 0.0049, 'grad_norm': 2.556028845602254, 'learning_rate': 6.443999999999999e-07, 'completion_length': 57.830360412597656, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0739355981349945, 'kl': 0.122314453125, 'epoch': 0.36} 36%|███▌ | 889/2500 [7:40:34<16:58:12, 37.92s/it] 36%|███▌ | 890/2500 [7:41:47<21:40:40, 48.47s/it] {'loss': 0.0041, 'grad_norm': 2.673138741573796, 'learning_rate': 6.44e-07, 'completion_length': 57.04464530944824, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.07124518603086472, 'kl': 0.1025390625, 'epoch': 0.36} 36%|███▌ | 890/2500 [7:41:47<21:40:40, 48.47s/it] 36%|███▌ | 891/2500 [7:42:12<18:32:39, 41.49s/it] {'loss': 0.0052, 'grad_norm': 1.1316297413597334, 'learning_rate': 6.436e-07, 'completion_length': 60.50000190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.130859375, 'epoch': 0.36} 36%|███▌ | 891/2500 [7:42:12<18:32:39, 41.49s/it] 36%|███▌ | 892/2500 [7:42:43<17:10:41, 38.46s/it] {'loss': 0.0059, 'grad_norm': 4.493520655022079, 'learning_rate': 6.431999999999999e-07, 'completion_length': 57.80357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.148681640625, 'epoch': 0.36} 36%|███▌ | 892/2500 [7:42:43<17:10:41, 38.46s/it] 36%|███▌ | 893/2500 [7:43:14<16:06:52, 36.10s/it] {'loss': 0.0062, 'grad_norm': 0.5108638262628092, 'learning_rate': 6.428e-07, 'completion_length': 65.81250190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.15478515625, 'epoch': 0.36} 36%|███▌ | 893/2500 [7:43:14<16:06:52, 36.10s/it] 36%|███▌ | 894/2500 [7:43:44<15:17:15, 34.27s/it] {'loss': 0.0041, 'grad_norm': 2.19042144449741, 'learning_rate': 6.424e-07, 'completion_length': 67.73214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.103271484375, 'epoch': 0.36} 36%|███▌ | 894/2500 [7:43:44<15:17:15, 34.27s/it] 36%|███▌ | 895/2500 [7:44:58<20:34:55, 46.17s/it] {'loss': 0.005, 'grad_norm': 0.15550033589077356, 'learning_rate': 6.42e-07, 'completion_length': 53.36607360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.123779296875, 'epoch': 0.36} 36%|███▌ | 895/2500 [7:44:58<20:34:55, 46.17s/it] 36%|███▌ | 896/2500 [7:45:28<18:25:28, 41.35s/it] {'loss': 0.0039, 'grad_norm': 0.5913079688610526, 'learning_rate': 6.415999999999999e-07, 'completion_length': 67.54464721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.0986328125, 'epoch': 0.36} 36%|███▌ | 896/2500 [7:45:28<18:25:28, 41.35s/it] 36%|███▌ | 897/2500 [7:45:56<16:40:28, 37.45s/it] {'loss': 0.0046, 'grad_norm': 0.15756758075218696, 'learning_rate': 6.412e-07, 'completion_length': 58.65178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.114501953125, 'epoch': 0.36} 36%|███▌ | 897/2500 [7:45:56<16:40:28, 37.45s/it] 36%|███▌ | 898/2500 [7:46:45<18:06:44, 40.70s/it] {'loss': 0.0049, 'grad_norm': 2.9778318117908933, 'learning_rate': 6.408e-07, 'completion_length': 73.19643020629883, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9017857909202576, 'reward_std': 0.07576143741607666, 'kl': 0.122314453125, 'epoch': 0.36} 36%|███▌ | 898/2500 [7:46:45<18:06:44, 40.70s/it] 36%|███▌ | 899/2500 [7:47:20<17:20:31, 39.00s/it] {'loss': 0.0036, 'grad_norm': 1.070106149266456, 'learning_rate': 6.403999999999999e-07, 'completion_length': 80.2410774230957, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9553571939468384, 'reward_std': 0.12626906856894493, 'kl': 0.0888671875, 'epoch': 0.36} 36%|███▌ | 899/2500 [7:47:20<17:20:31, 39.00s/it] 36%|███▌ | 900/2500 [7:48:20<20:09:50, 45.37s/it] {'loss': 0.0045, 'grad_norm': 9.877648961936078, 'learning_rate': 6.4e-07, 'completion_length': 59.36607360839844, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.07003280520439148, 'kl': 0.111572265625, 'epoch': 0.36} 36%|███▌ | 900/2500 [7:48:20<20:09:50, 45.37s/it] 36%|███▌ | 901/2500 [7:49:15<21:25:56, 48.25s/it] {'loss': 0.0056, 'grad_norm': 0.9227049999671003, 'learning_rate': 6.395999999999999e-07, 'completion_length': 65.93750381469727, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.14013671875, 'epoch': 0.36} 36%|███▌ | 901/2500 [7:49:15<21:25:56, 48.25s/it] 36%|███▌ | 902/2500 [7:49:39<18:10:31, 40.95s/it] {'loss': 0.0052, 'grad_norm': 0.45560790840021725, 'learning_rate': 6.392e-07, 'completion_length': 71.02678680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.13037109375, 'epoch': 0.36} 36%|███▌ | 902/2500 [7:49:39<18:10:31, 40.95s/it] 36%|███▌ | 903/2500 [7:50:25<18:49:01, 42.42s/it] {'loss': 0.0038, 'grad_norm': 1.2110006116905296, 'learning_rate': 6.388e-07, 'completion_length': 73.00000381469727, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.09423828125, 'epoch': 0.36} 36%|███▌ | 903/2500 [7:50:25<18:49:01, 42.42s/it] 36%|███▌ | 904/2500 [7:52:05<26:28:54, 59.73s/it] {'loss': 0.0046, 'grad_norm': 1.1365644261631116, 'learning_rate': 6.383999999999999e-07, 'completion_length': 68.73214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1142578125, 'epoch': 0.36} 36%|███▌ | 904/2500 [7:52:05<26:28:54, 59.73s/it] 36%|███▌ | 905/2500 [7:52:31<22:01:38, 49.72s/it] {'loss': 0.0044, 'grad_norm': 1.2176353767679848, 'learning_rate': 6.38e-07, 'completion_length': 63.54464340209961, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.10986328125, 'epoch': 0.36} 36%|███▌ | 905/2500 [7:52:31<22:01:38, 49.72s/it] 36%|███▌ | 906/2500 [7:52:57<18:47:39, 42.45s/it] {'loss': 0.0054, 'grad_norm': 3.5140728135418553, 'learning_rate': 6.375999999999999e-07, 'completion_length': 57.57143020629883, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9107143878936768, 'reward_std': 0.12444323301315308, 'kl': 0.1357421875, 'epoch': 0.36} 36%|███▌ | 906/2500 [7:52:57<18:47:39, 42.45s/it] 36%|███▋ | 907/2500 [7:53:55<20:55:57, 47.31s/it] {'loss': 0.0042, 'grad_norm': 4.784207585497283, 'learning_rate': 6.371999999999999e-07, 'completion_length': 58.15178680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.128351628780365, 'kl': 0.10400390625, 'epoch': 0.36} 36%|███▋ | 907/2500 [7:53:55<20:55:57, 47.31s/it] 36%|███▋ | 908/2500 [7:55:26<26:41:35, 60.36s/it] {'loss': 0.0042, 'grad_norm': 2.724261343495715, 'learning_rate': 6.368e-07, 'completion_length': 69.10714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.10882644355297089, 'kl': 0.104736328125, 'epoch': 0.36} 36%|███▋ | 908/2500 [7:55:26<26:41:35, 60.36s/it] 36%|███▋ | 909/2500 [7:55:52<22:05:16, 49.98s/it] {'loss': 0.0047, 'grad_norm': 0.1354870814783343, 'learning_rate': 6.364e-07, 'completion_length': 58.90178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1171875, 'epoch': 0.36} 36%|███▋ | 909/2500 [7:55:52<22:05:16, 49.98s/it] 36%|███▋ | 910/2500 [7:56:16<18:40:34, 42.29s/it] {'loss': 0.0058, 'grad_norm': 0.19457849431586305, 'learning_rate': 6.36e-07, 'completion_length': 63.43750190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.14501953125, 'epoch': 0.36} 36%|███▋ | 910/2500 [7:56:16<18:40:34, 42.29s/it] 36%|███▋ | 911/2500 [7:56:44<16:41:42, 37.82s/it] {'loss': 0.0046, 'grad_norm': 2.992991478854065, 'learning_rate': 6.356e-07, 'completion_length': 70.66964530944824, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.8928571939468384, 'reward_std': 0.15152288228273392, 'kl': 0.115966796875, 'epoch': 0.36} 36%|███▋ | 911/2500 [7:56:44<16:41:42, 37.82s/it] 36%|███▋ | 912/2500 [7:57:10<15:14:06, 34.54s/it] {'loss': 0.0043, 'grad_norm': 1.4898153919906603, 'learning_rate': 6.352e-07, 'completion_length': 73.16964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.107666015625, 'epoch': 0.36} 36%|███▋ | 912/2500 [7:57:10<15:14:06, 34.54s/it] 37%|███▋ | 913/2500 [7:57:39<14:22:36, 32.61s/it] {'loss': 0.0043, 'grad_norm': 1.0115120652259315, 'learning_rate': 6.348e-07, 'completion_length': 65.84821701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.108154296875, 'epoch': 0.37} 37%|███▋ | 913/2500 [7:57:39<14:22:36, 32.61s/it] 37%|███▋ | 914/2500 [7:58:05<13:29:16, 30.62s/it] {'loss': 0.0052, 'grad_norm': 0.1529053127407988, 'learning_rate': 6.343999999999999e-07, 'completion_length': 69.09821510314941, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13134765625, 'epoch': 0.37} 37%|███▋ | 914/2500 [7:58:05<13:29:16, 30.62s/it] 37%|███▋ | 915/2500 [7:58:30<12:45:49, 28.99s/it] {'loss': 0.0034, 'grad_norm': 0.64482733141901, 'learning_rate': 6.34e-07, 'completion_length': 68.44643211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.08544921875, 'epoch': 0.37} 37%|███▋ | 915/2500 [7:58:30<12:45:49, 28.99s/it] 37%|███▋ | 916/2500 [7:58:56<12:27:33, 28.32s/it] {'loss': 0.005, 'grad_norm': 2.6353492607268953, 'learning_rate': 6.336000000000001e-07, 'completion_length': 66.22321891784668, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.946428656578064, 'reward_std': 0.08868780732154846, 'kl': 0.125244140625, 'epoch': 0.37} 37%|███▋ | 916/2500 [7:58:57<12:27:33, 28.32s/it] 37%|███▋ | 917/2500 [7:59:24<12:20:44, 28.08s/it] {'loss': 0.0048, 'grad_norm': 2.821003934388065, 'learning_rate': 6.331999999999999e-07, 'completion_length': 67.98214721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.11962890625, 'epoch': 0.37} 37%|███▋ | 917/2500 [7:59:24<12:20:44, 28.08s/it] 37%|███▋ | 918/2500 [7:59:52<12:15:40, 27.90s/it] {'loss': 0.0051, 'grad_norm': 2.3407398048802266, 'learning_rate': 6.328e-07, 'completion_length': 73.97321701049805, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9285715222358704, 'reward_std': 0.13756756484508514, 'kl': 0.128173828125, 'epoch': 0.37} 37%|███▋ | 918/2500 [7:59:52<12:15:40, 27.90s/it] 37%|███▋ | 919/2500 [8:00:17<11:54:56, 27.13s/it] {'loss': 0.005, 'grad_norm': 1.5428456742519536, 'learning_rate': 6.324e-07, 'completion_length': 67.85714721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9017857909202576, 'reward_std': 0.07576144114136696, 'kl': 0.125, 'epoch': 0.37} 37%|███▋ | 919/2500 [8:00:17<11:54:56, 27.13s/it] 37%|███▋ | 920/2500 [8:00:43<11:44:38, 26.76s/it] {'loss': 0.0048, 'grad_norm': 1.0674355254916708, 'learning_rate': 6.319999999999999e-07, 'completion_length': 68.90179061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.11865234375, 'epoch': 0.37} 37%|███▋ | 920/2500 [8:00:43<11:44:38, 26.76s/it] 37%|███▋ | 921/2500 [8:01:16<12:34:57, 28.69s/it] {'loss': 0.0056, 'grad_norm': 2.7295636028649644, 'learning_rate': 6.316e-07, 'completion_length': 67.1964340209961, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 0.973214328289032, 'reward': 1.8928572535514832, 'reward_std': 0.1289059966802597, 'kl': 0.139404296875, 'epoch': 0.37} 37%|███▋ | 921/2500 [8:01:16<12:34:57, 28.69s/it] 37%|███▋ | 922/2500 [8:03:09<23:36:28, 53.86s/it] {'loss': 0.0041, 'grad_norm': 1.4541006882254155, 'learning_rate': 6.312e-07, 'completion_length': 60.80357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.103515625, 'epoch': 0.37} 37%|███▋ | 922/2500 [8:03:09<23:36:28, 53.86s/it] 37%|███▋ | 923/2500 [8:04:50<29:48:20, 68.04s/it] {'loss': 0.0041, 'grad_norm': 1.0702662535077685, 'learning_rate': 6.308e-07, 'completion_length': 67.08928871154785, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.1015625, 'epoch': 0.37} 37%|███▋ | 923/2500 [8:04:50<29:48:20, 68.04s/it] 37%|███▋ | 924/2500 [8:06:19<32:31:53, 74.31s/it] {'loss': 0.0058, 'grad_norm': 1.3726193722137168, 'learning_rate': 6.303999999999999e-07, 'completion_length': 60.70535850524902, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.033065006136894226, 'kl': 0.14501953125, 'epoch': 0.37} 37%|███▋ | 924/2500 [8:06:19<32:31:53, 74.31s/it] 37%|███▋ | 925/2500 [8:06:44<26:05:01, 59.62s/it] {'loss': 0.0044, 'grad_norm': 0.16068953332844366, 'learning_rate': 6.3e-07, 'completion_length': 60.21428871154785, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.111083984375, 'epoch': 0.37} 37%|███▋ | 925/2500 [8:06:44<26:05:01, 59.62s/it] 37%|███▋ | 926/2500 [8:07:10<21:38:05, 49.48s/it] {'loss': 0.0048, 'grad_norm': 1.678682325171181, 'learning_rate': 6.296e-07, 'completion_length': 72.07143020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.120849609375, 'epoch': 0.37} 37%|███▋ | 926/2500 [8:07:10<21:38:05, 49.48s/it] 37%|███▋ | 927/2500 [8:07:40<19:07:02, 43.75s/it] {'loss': 0.0048, 'grad_norm': 1.1243938600416756, 'learning_rate': 6.291999999999999e-07, 'completion_length': 72.25000381469727, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.8750001192092896, 'reward_std': 0.17941363155841827, 'kl': 0.119384765625, 'epoch': 0.37} 37%|███▋ | 927/2500 [8:07:40<19:07:02, 43.75s/it] 37%|███▋ | 928/2500 [8:08:05<16:37:53, 38.09s/it] {'loss': 0.0043, 'grad_norm': 0.7418957822315123, 'learning_rate': 6.288e-07, 'completion_length': 67.85714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.107421875, 'epoch': 0.37} 37%|███▋ | 928/2500 [8:08:05<16:37:53, 38.09s/it] 37%|███▋ | 929/2500 [8:08:30<14:58:15, 34.31s/it] {'loss': 0.0041, 'grad_norm': 2.0222564157604417, 'learning_rate': 6.283999999999999e-07, 'completion_length': 66.71429061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.101806640625, 'epoch': 0.37} 37%|███▋ | 929/2500 [8:08:31<14:58:15, 34.31s/it] 37%|███▋ | 930/2500 [8:08:58<14:04:25, 32.27s/it] {'loss': 0.006, 'grad_norm': 1.741246506883032, 'learning_rate': 6.28e-07, 'completion_length': 66.88393020629883, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.946428656578064, 'reward_std': 0.08868780359625816, 'kl': 0.15087890625, 'epoch': 0.37} 37%|███▋ | 930/2500 [8:08:58<14:04:25, 32.27s/it] 37%|███▋ | 931/2500 [8:09:23<13:05:47, 30.05s/it] {'loss': 0.004, 'grad_norm': 2.2117320130462437, 'learning_rate': 6.276e-07, 'completion_length': 53.955360412597656, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9553571939468384, 'reward_std': 0.10882644355297089, 'kl': 0.10009765625, 'epoch': 0.37} 37%|███▋ | 931/2500 [8:09:23<13:05:47, 30.05s/it] 37%|███▋ | 932/2500 [8:09:50<12:41:19, 29.13s/it] {'loss': 0.0046, 'grad_norm': 1.2543687831060557, 'learning_rate': 6.271999999999999e-07, 'completion_length': 60.50893211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.11572265625, 'epoch': 0.37} 37%|███▋ | 932/2500 [8:09:50<12:41:19, 29.13s/it] 37%|███▋ | 933/2500 [8:10:14<11:57:47, 27.48s/it] {'loss': 0.0058, 'grad_norm': 0.7895903924560996, 'learning_rate': 6.268e-07, 'completion_length': 56.830360412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.14599609375, 'epoch': 0.37} 37%|███▋ | 933/2500 [8:10:14<11:57:47, 27.48s/it] 37%|███▋ | 934/2500 [8:10:41<11:59:13, 27.56s/it] {'loss': 0.0043, 'grad_norm': 0.8425701444893682, 'learning_rate': 6.263999999999999e-07, 'completion_length': 60.63393211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1083984375, 'epoch': 0.37} 37%|███▋ | 934/2500 [8:10:41<11:59:13, 27.56s/it] 37%|███▋ | 935/2500 [8:11:05<11:27:52, 26.37s/it] {'loss': 0.0046, 'grad_norm': 1.4829124827195783, 'learning_rate': 6.26e-07, 'completion_length': 60.62500190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.114013671875, 'epoch': 0.37} 37%|███▋ | 935/2500 [8:11:05<11:27:52, 26.37s/it] 37%|███▋ | 936/2500 [8:11:30<11:14:03, 25.86s/it] {'loss': 0.0053, 'grad_norm': 0.6992063161540576, 'learning_rate': 6.256e-07, 'completion_length': 65.60714721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.132568359375, 'epoch': 0.37} 37%|███▋ | 936/2500 [8:11:30<11:14:03, 25.86s/it] 37%|███▋ | 937/2500 [8:11:54<10:59:42, 25.32s/it] {'loss': 0.0072, 'grad_norm': 16.957167159533284, 'learning_rate': 6.252e-07, 'completion_length': 48.00000190734863, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.1787109375, 'epoch': 0.37} 37%|███▋ | 937/2500 [8:11:54<10:59:42, 25.32s/it] 38%|███▊ | 938/2500 [8:12:20<11:08:13, 25.67s/it] {'loss': 0.0045, 'grad_norm': 0.740243827204451, 'learning_rate': 6.248e-07, 'completion_length': 74.83928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.11376953125, 'epoch': 0.38} 38%|███▊ | 938/2500 [8:12:20<11:08:13, 25.67s/it] 38%|███▊ | 939/2500 [8:12:44<10:54:43, 25.17s/it] {'loss': 0.0059, 'grad_norm': 3.9612053358447117, 'learning_rate': 6.243999999999999e-07, 'completion_length': 56.60714530944824, 'rewards/accuracy_reward': 0.8839285969734192, 'rewards/format_reward': 1.0, 'reward': 1.883928656578064, 'reward_std': 0.1030978113412857, 'kl': 0.14892578125, 'epoch': 0.38} 38%|███▊ | 939/2500 [8:12:44<10:54:43, 25.17s/it] 38%|███▊ | 940/2500 [8:13:11<11:07:14, 25.66s/it] {'loss': 0.0054, 'grad_norm': 0.8185203437673767, 'learning_rate': 6.24e-07, 'completion_length': 65.88393211364746, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9017857313156128, 'reward_std': 0.07576143741607666, 'kl': 0.134033203125, 'epoch': 0.38} 38%|███▊ | 940/2500 [8:13:11<11:07:14, 25.66s/it] 38%|███▊ | 941/2500 [8:13:44<12:01:47, 27.78s/it] {'loss': 0.005, 'grad_norm': 0.8367399264617345, 'learning_rate': 6.236e-07, 'completion_length': 70.29464721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.12548828125, 'epoch': 0.38} 38%|███▊ | 941/2500 [8:13:44<12:01:47, 27.78s/it] 38%|███▊ | 942/2500 [8:14:08<11:38:43, 26.91s/it] {'loss': 0.0045, 'grad_norm': 0.8352433731989793, 'learning_rate': 6.231999999999999e-07, 'completion_length': 63.562503814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.11181640625, 'epoch': 0.38} 38%|███▊ | 942/2500 [8:14:08<11:38:43, 26.91s/it] 38%|███▊ | 943/2500 [8:14:34<11:24:29, 26.38s/it] {'loss': 0.006, 'grad_norm': 1.503057266243059, 'learning_rate': 6.228e-07, 'completion_length': 55.35714530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1494140625, 'epoch': 0.38} 38%|███▊ | 943/2500 [8:14:34<11:24:29, 26.38s/it] 38%|███▊ | 944/2500 [8:15:00<11:22:43, 26.33s/it] {'loss': 0.0062, 'grad_norm': 1.0662950122335517, 'learning_rate': 6.224e-07, 'completion_length': 56.017860412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.15576171875, 'epoch': 0.38} 38%|███▊ | 944/2500 [8:15:00<11:22:43, 26.33s/it] 38%|███▊ | 945/2500 [8:15:24<11:02:44, 25.57s/it] {'loss': 0.0048, 'grad_norm': 1.483760438212402, 'learning_rate': 6.219999999999999e-07, 'completion_length': 57.41071701049805, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.119140625, 'epoch': 0.38} 38%|███▊ | 945/2500 [8:15:24<11:02:44, 25.57s/it] 38%|███▊ | 946/2500 [8:15:49<10:57:13, 25.38s/it] {'loss': 0.004, 'grad_norm': 1.2423460710406407, 'learning_rate': 6.216e-07, 'completion_length': 58.973215103149414, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.099365234375, 'epoch': 0.38} 38%|███▊ | 946/2500 [8:15:49<10:57:13, 25.38s/it] 38%|███▊ | 947/2500 [8:16:11<10:35:41, 24.56s/it] {'loss': 0.0051, 'grad_norm': 1.8655975013176427, 'learning_rate': 6.212e-07, 'completion_length': 50.642860412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.127197265625, 'epoch': 0.38} 38%|███▊ | 947/2500 [8:16:11<10:35:41, 24.56s/it] 38%|███▊ | 948/2500 [8:16:38<10:49:10, 25.10s/it] {'loss': 0.0049, 'grad_norm': 2.8353115410434064, 'learning_rate': 6.208e-07, 'completion_length': 57.92857360839844, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.07514797896146774, 'kl': 0.123046875, 'epoch': 0.38} 38%|███▊ | 948/2500 [8:16:38<10:49:10, 25.10s/it] 38%|███▊ | 949/2500 [8:17:01<10:39:35, 24.74s/it] {'loss': 0.0046, 'grad_norm': 0.27148137633241964, 'learning_rate': 6.203999999999999e-07, 'completion_length': 57.85714530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.115234375, 'epoch': 0.38} 38%|███▊ | 949/2500 [8:17:01<10:39:35, 24.74s/it] 38%|███▊ | 950/2500 [8:17:29<11:03:43, 25.69s/it] {'loss': 0.0057, 'grad_norm': 2.098124050628664, 'learning_rate': 6.2e-07, 'completion_length': 48.43750190734863, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.14111328125, 'epoch': 0.38} 38%|███▊ | 950/2500 [8:17:29<11:03:43, 25.69s/it] 38%|███▊ | 951/2500 [8:18:10<12:59:56, 30.21s/it] {'loss': 0.0047, 'grad_norm': 2.8449079770789467, 'learning_rate': 6.196e-07, 'completion_length': 57.05357360839844, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.8660715222358704, 'reward_std': 0.08747542649507523, 'kl': 0.11767578125, 'epoch': 0.38} 38%|███▊ | 951/2500 [8:18:10<12:59:56, 30.21s/it] 38%|███▊ | 952/2500 [8:18:35<12:16:55, 28.56s/it] {'loss': 0.006, 'grad_norm': 2.6188994933189558, 'learning_rate': 6.191999999999999e-07, 'completion_length': 54.28571701049805, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.946428656578064, 'reward_std': 0.08868780359625816, 'kl': 0.14892578125, 'epoch': 0.38} 38%|███▊ | 952/2500 [8:18:35<12:16:55, 28.56s/it] 38%|███▊ | 953/2500 [8:19:00<11:48:42, 27.49s/it] {'loss': 0.0049, 'grad_norm': 1.0677717579582278, 'learning_rate': 6.188e-07, 'completion_length': 63.80357551574707, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.122314453125, 'epoch': 0.38} 38%|███▊ | 953/2500 [8:19:00<11:48:42, 27.49s/it] 38%|███▊ | 954/2500 [8:19:23<11:15:27, 26.21s/it] {'loss': 0.0054, 'grad_norm': 0.20032437201046596, 'learning_rate': 6.183999999999999e-07, 'completion_length': 60.48214530944824, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.13623046875, 'epoch': 0.38} 38%|███▊ | 954/2500 [8:19:23<11:15:27, 26.21s/it] 38%|███▊ | 955/2500 [8:19:50<11:17:40, 26.32s/it] {'loss': 0.0051, 'grad_norm': 3.759941972758284, 'learning_rate': 6.18e-07, 'completion_length': 64.52679061889648, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.127197265625, 'epoch': 0.38} 38%|███▊ | 955/2500 [8:19:50<11:17:40, 26.32s/it] 38%|███▊ | 956/2500 [8:20:17<11:25:51, 26.65s/it] {'loss': 0.0051, 'grad_norm': 1.1112562983886016, 'learning_rate': 6.176e-07, 'completion_length': 64.90178680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9642857313156128, 'reward_std': 0.0835726335644722, 'kl': 0.12744140625, 'epoch': 0.38} 38%|███▊ | 956/2500 [8:20:17<11:25:51, 26.65s/it] 38%|███▊ | 957/2500 [8:20:41<11:05:27, 25.88s/it] {'loss': 0.0057, 'grad_norm': 1.9241925122597139, 'learning_rate': 6.172e-07, 'completion_length': 59.44643211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.14306640625, 'epoch': 0.38} 38%|███▊ | 957/2500 [8:20:41<11:05:27, 25.88s/it] 38%|███▊ | 958/2500 [8:21:11<11:32:55, 26.96s/it] {'loss': 0.0057, 'grad_norm': 1.3438667074434567, 'learning_rate': 6.168e-07, 'completion_length': 66.17857551574707, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.1435546875, 'epoch': 0.38} 38%|███▊ | 958/2500 [8:21:11<11:32:55, 26.96s/it] 38%|███▊ | 959/2500 [8:21:37<11:28:18, 26.80s/it] {'loss': 0.0058, 'grad_norm': 1.2873789689997455, 'learning_rate': 6.163999999999999e-07, 'completion_length': 61.53571701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.14453125, 'epoch': 0.38} 38%|███▊ | 959/2500 [8:21:37<11:28:18, 26.80s/it] 38%|███▊ | 960/2500 [8:22:06<11:42:43, 27.38s/it] {'loss': 0.0052, 'grad_norm': 0.930240961738296, 'learning_rate': 6.16e-07, 'completion_length': 60.00000190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13037109375, 'epoch': 0.38} 38%|███▊ | 960/2500 [8:22:06<11:42:43, 27.38s/it] 38%|███▊ | 961/2500 [8:22:28<11:04:39, 25.91s/it] {'loss': 0.0048, 'grad_norm': 2.361216558937507, 'learning_rate': 6.156e-07, 'completion_length': 53.81250190734863, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.118896484375, 'epoch': 0.38} 38%|███▊ | 961/2500 [8:22:28<11:04:39, 25.91s/it] 38%|███▊ | 962/2500 [8:22:53<10:53:50, 25.51s/it] {'loss': 0.0046, 'grad_norm': 3.2838760474798847, 'learning_rate': 6.152e-07, 'completion_length': 56.72321701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.114013671875, 'epoch': 0.38} 38%|███▊ | 962/2500 [8:22:53<10:53:50, 25.51s/it] 39%|███▊ | 963/2500 [8:23:19<10:55:56, 25.61s/it] {'loss': 0.0045, 'grad_norm': 4.453444047117041, 'learning_rate': 6.148e-07, 'completion_length': 61.02678680419922, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.8928571939468384, 'reward_std': 0.10101525112986565, 'kl': 0.113525390625, 'epoch': 0.39} 39%|███▊ | 963/2500 [8:23:19<10:55:56, 25.61s/it] 39%|███▊ | 964/2500 [8:23:46<11:07:45, 26.08s/it] {'loss': 0.006, 'grad_norm': 1.4687970826903765, 'learning_rate': 6.143999999999999e-07, 'completion_length': 61.59821701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.1513671875, 'epoch': 0.39} 39%|███▊ | 964/2500 [8:23:46<11:07:45, 26.08s/it] 39%|███▊ | 965/2500 [8:24:11<11:00:46, 25.83s/it] {'loss': 0.0049, 'grad_norm': 1.9189566413349934, 'learning_rate': 6.14e-07, 'completion_length': 66.13393211364746, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9375000596046448, 'reward_std': 0.1379830539226532, 'kl': 0.12255859375, 'epoch': 0.39} 39%|███▊ | 965/2500 [8:24:11<11:00:46, 25.83s/it] 39%|███▊ | 966/2500 [8:24:40<11:22:34, 26.70s/it] {'loss': 0.0043, 'grad_norm': 2.0590926586336256, 'learning_rate': 6.136e-07, 'completion_length': 68.52678680419922, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9017857909202576, 'reward_std': 0.17104806005954742, 'kl': 0.10791015625, 'epoch': 0.39} 39%|███▊ | 966/2500 [8:24:40<11:22:34, 26.70s/it] 39%|███▊ | 967/2500 [8:25:06<11:16:43, 26.49s/it] {'loss': 0.0043, 'grad_norm': 2.4016645280789226, 'learning_rate': 6.131999999999999e-07, 'completion_length': 60.06250190734863, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.1083984375, 'epoch': 0.39} 39%|███▊ | 967/2500 [8:25:06<11:16:43, 26.49s/it] 39%|███▊ | 968/2500 [8:25:42<12:29:28, 29.35s/it] {'loss': 0.0051, 'grad_norm': 0.8630819454064934, 'learning_rate': 6.128e-07, 'completion_length': 55.71428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.127685546875, 'epoch': 0.39} 39%|███▊ | 968/2500 [8:25:42<12:29:28, 29.35s/it] 39%|███▉ | 969/2500 [8:26:11<12:30:46, 29.42s/it] {'loss': 0.0052, 'grad_norm': 2.734936540704025, 'learning_rate': 6.124000000000001e-07, 'completion_length': 82.31250381469727, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.9017858505249023, 'reward_std': 0.19588638097047806, 'kl': 0.129150390625, 'epoch': 0.39} 39%|███▉ | 969/2500 [8:26:11<12:30:46, 29.42s/it] 39%|███▉ | 970/2500 [8:26:39<12:13:31, 28.77s/it] {'loss': 0.0047, 'grad_norm': 0.704974031530837, 'learning_rate': 6.119999999999999e-07, 'completion_length': 63.937503814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.11669921875, 'epoch': 0.39} 39%|███▉ | 970/2500 [8:26:39<12:13:31, 28.77s/it] 39%|███▉ | 971/2500 [8:27:05<11:57:21, 28.15s/it] {'loss': 0.0055, 'grad_norm': 2.8245541988967022, 'learning_rate': 6.116e-07, 'completion_length': 61.00893211364746, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9553571939468384, 'reward_std': 0.06343399360775948, 'kl': 0.13720703125, 'epoch': 0.39} 39%|███▉ | 971/2500 [8:27:05<11:57:21, 28.15s/it] 39%|███▉ | 972/2500 [8:27:30<11:32:27, 27.19s/it] {'loss': 0.0057, 'grad_norm': 0.8024609870321046, 'learning_rate': 6.112e-07, 'completion_length': 61.67857360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.14208984375, 'epoch': 0.39} 39%|███▉ | 972/2500 [8:27:30<11:32:27, 27.19s/it] 39%|███▉ | 973/2500 [8:27:57<11:27:20, 27.01s/it] {'loss': 0.0054, 'grad_norm': 1.1203308670690286, 'learning_rate': 6.107999999999999e-07, 'completion_length': 63.07143020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.134033203125, 'epoch': 0.39} 39%|███▉ | 973/2500 [8:27:57<11:27:20, 27.01s/it] 39%|███▉ | 974/2500 [8:28:24<11:29:25, 27.11s/it] {'loss': 0.0062, 'grad_norm': 2.2377766429368706, 'learning_rate': 6.104e-07, 'completion_length': 71.36607360839844, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9107143878936768, 'reward_std': 0.18290093541145325, 'kl': 0.1552734375, 'epoch': 0.39} 39%|███▉ | 974/2500 [8:28:24<11:29:25, 27.11s/it] 39%|███▉ | 975/2500 [8:28:54<11:52:36, 28.04s/it] {'loss': 0.0059, 'grad_norm': 2.2540620023078577, 'learning_rate': 6.1e-07, 'completion_length': 74.43750381469727, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.9375000596046448, 'reward_std': 0.15933407843112946, 'kl': 0.14892578125, 'epoch': 0.39} 39%|███▉ | 975/2500 [8:28:55<11:52:36, 28.04s/it] 39%|███▉ | 976/2500 [8:29:18<11:20:08, 26.78s/it] {'loss': 0.0052, 'grad_norm': 2.318353489026865, 'learning_rate': 6.096e-07, 'completion_length': 58.267860412597656, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9375000596046448, 'reward_std': 0.05831881985068321, 'kl': 0.130615234375, 'epoch': 0.39} 39%|███▉ | 976/2500 [8:29:18<11:20:08, 26.78s/it] 39%|███▉ | 977/2500 [8:29:44<11:10:52, 26.43s/it] {'loss': 0.0065, 'grad_norm': 2.7714681786910833, 'learning_rate': 6.091999999999999e-07, 'completion_length': 59.49107360839844, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9464285969734192, 'reward_std': 0.11272923648357391, 'kl': 0.16259765625, 'epoch': 0.39} 39%|███▉ | 977/2500 [8:29:44<11:10:52, 26.43s/it] 39%|███▉ | 978/2500 [8:30:09<11:01:11, 26.07s/it] {'loss': 0.0068, 'grad_norm': 1.6939581533156634, 'learning_rate': 6.088e-07, 'completion_length': 60.75893020629883, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.8928572535514832, 'reward_std': 0.10101525485515594, 'kl': 0.17041015625, 'epoch': 0.39} 39%|███▉ | 978/2500 [8:30:09<11:01:11, 26.07s/it] 39%|███▉ | 979/2500 [8:30:39<11:29:21, 27.19s/it] {'loss': 0.0077, 'grad_norm': 2.0948722048544264, 'learning_rate': 6.084000000000001e-07, 'completion_length': 80.77679061889648, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9196429252624512, 'reward_std': 0.1575138419866562, 'kl': 0.19287109375, 'epoch': 0.39} 39%|███▉ | 979/2500 [8:30:39<11:29:21, 27.19s/it] 39%|███▉ | 980/2500 [8:31:08<11:40:18, 27.64s/it] {'loss': 0.0058, 'grad_norm': 1.6904145570187035, 'learning_rate': 6.079999999999999e-07, 'completion_length': 74.48214721679688, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.9375000596046448, 'reward_std': 0.1541598103940487, 'kl': 0.14501953125, 'epoch': 0.39} 39%|███▉ | 980/2500 [8:31:08<11:40:18, 27.64s/it] 39%|███▉ | 981/2500 [8:31:33<11:20:39, 26.89s/it] {'loss': 0.0073, 'grad_norm': 0.31394347828914293, 'learning_rate': 6.076e-07, 'completion_length': 51.58928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.18212890625, 'epoch': 0.39} 39%|███▉ | 981/2500 [8:31:33<11:20:39, 26.89s/it] 39%|███▉ | 982/2500 [8:31:59<11:18:51, 26.83s/it] {'loss': 0.0049, 'grad_norm': 1.2826870491218854, 'learning_rate': 6.072e-07, 'completion_length': 65.33036231994629, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.123046875, 'epoch': 0.39} 39%|███▉ | 982/2500 [8:31:59<11:18:51, 26.83s/it] 39%|███▉ | 983/2500 [8:32:27<11:21:54, 26.97s/it] {'loss': 0.0051, 'grad_norm': 1.2967461965476037, 'learning_rate': 6.068e-07, 'completion_length': 62.53571701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.126708984375, 'epoch': 0.39} 39%|███▉ | 983/2500 [8:32:27<11:21:54, 26.97s/it] 39%|███▉ | 984/2500 [8:32:58<11:54:37, 28.28s/it] {'loss': 0.0065, 'grad_norm': 1.5402279884181764, 'learning_rate': 6.064e-07, 'completion_length': 56.00893211364746, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.1630859375, 'epoch': 0.39} 39%|███▉ | 984/2500 [8:32:58<11:54:37, 28.28s/it] 39%|███▉ | 985/2500 [8:33:22<11:22:03, 27.01s/it] {'loss': 0.0052, 'grad_norm': 0.24117945269883534, 'learning_rate': 6.06e-07, 'completion_length': 55.03571701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.131103515625, 'epoch': 0.39} 39%|███▉ | 985/2500 [8:33:22<11:22:03, 27.01s/it] 39%|███▉ | 986/2500 [8:33:50<11:30:22, 27.36s/it] {'loss': 0.0101, 'grad_norm': 4.183150170872358, 'learning_rate': 6.056e-07, 'completion_length': 58.87500190734863, 'rewards/accuracy_reward': 0.9017857313156128, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.8660715222358704, 'reward_std': 0.13671720027923584, 'kl': 0.2509765625, 'epoch': 0.39} 39%|███▉ | 986/2500 [8:33:50<11:30:22, 27.36s/it] 39%|███▉ | 987/2500 [8:34:23<12:08:56, 28.91s/it] {'loss': 0.0065, 'grad_norm': 2.278573341196277, 'learning_rate': 6.051999999999999e-07, 'completion_length': 59.330360412597656, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.883928656578064, 'reward_std': 0.1767766997218132, 'kl': 0.1630859375, 'epoch': 0.39} 39%|███▉ | 987/2500 [8:34:23<12:08:56, 28.91s/it] 40%|███▉ | 988/2500 [8:34:50<11:53:37, 28.32s/it] {'loss': 0.0063, 'grad_norm': 0.38313719675993746, 'learning_rate': 6.048e-07, 'completion_length': 47.21428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.158203125, 'epoch': 0.4} 40%|███▉ | 988/2500 [8:34:50<11:53:37, 28.32s/it] 40%|███▉ | 989/2500 [8:35:15<11:27:41, 27.31s/it] {'loss': 0.007, 'grad_norm': 1.7658286138058512, 'learning_rate': 6.044e-07, 'completion_length': 48.21428871154785, 'rewards/accuracy_reward': 0.8660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.8660714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.174560546875, 'epoch': 0.4} 40%|███▉ | 989/2500 [8:35:15<11:27:41, 27.31s/it] 40%|███▉ | 990/2500 [8:35:45<11:46:08, 28.06s/it] {'loss': 0.0103, 'grad_norm': 3.5404373643496534, 'learning_rate': 6.04e-07, 'completion_length': 81.60714340209961, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.8392857909202576, 'reward_std': 0.3290572017431259, 'kl': 0.2568359375, 'epoch': 0.4} 40%|███▉ | 990/2500 [8:35:45<11:46:08, 28.06s/it] 40%|███▉ | 991/2500 [8:36:11<11:34:14, 27.60s/it] {'loss': 0.0072, 'grad_norm': 1.9299979875623454, 'learning_rate': 6.036e-07, 'completion_length': 58.83928871154785, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.1806640625, 'epoch': 0.4} 40%|███▉ | 991/2500 [8:36:11<11:34:14, 27.60s/it] 40%|███▉ | 992/2500 [8:36:40<11:41:28, 27.91s/it] {'loss': 0.0082, 'grad_norm': 4.0524586376641665, 'learning_rate': 6.031999999999999e-07, 'completion_length': 65.25000381469727, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9375000596046448, 'reward_std': 0.15933407098054886, 'kl': 0.20458984375, 'epoch': 0.4} 40%|███▉ | 992/2500 [8:36:40<11:41:28, 27.91s/it] 40%|███▉ | 993/2500 [8:37:08<11:47:01, 28.15s/it] {'loss': 0.0107, 'grad_norm': 2.1936735824694744, 'learning_rate': 6.028e-07, 'completion_length': 79.77678680419922, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.9107143878936768, 'reward_std': 0.18276765197515488, 'kl': 0.26806640625, 'epoch': 0.4} 40%|███▉ | 993/2500 [8:37:08<11:47:01, 28.15s/it] 40%|███▉ | 994/2500 [8:37:39<12:04:46, 28.88s/it] {'loss': 0.0131, 'grad_norm': 4.466283050352255, 'learning_rate': 6.024e-07, 'completion_length': 79.76786041259766, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.8125001192092896, 'reward_std': 0.39788323640823364, 'kl': 0.328125, 'epoch': 0.4} 40%|███▉ | 994/2500 [8:37:39<12:04:46, 28.88s/it] 40%|███▉ | 995/2500 [8:38:06<11:48:36, 28.25s/it] {'loss': 0.013, 'grad_norm': 4.272679117500504, 'learning_rate': 6.019999999999999e-07, 'completion_length': 69.02678871154785, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9375000596046448, 'reward': 1.9017857909202576, 'reward_std': 0.22028982266783714, 'kl': 0.3232421875, 'epoch': 0.4} 40%|███▉ | 995/2500 [8:38:06<11:48:36, 28.25s/it] 40%|███▉ | 996/2500 [8:38:34<11:47:20, 28.22s/it] {'loss': 0.0081, 'grad_norm': 2.4419211136190624, 'learning_rate': 6.016e-07, 'completion_length': 68.69643211364746, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9553571939468384, 'reward_std': 0.12626907229423523, 'kl': 0.203125, 'epoch': 0.4} 40%|███▉ | 996/2500 [8:38:34<11:47:20, 28.22s/it] 40%|███▉ | 997/2500 [8:39:04<11:57:38, 28.65s/it] {'loss': 0.0089, 'grad_norm': 4.55707468074196, 'learning_rate': 6.012e-07, 'completion_length': 84.7589340209961, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.955357164144516, 'reward': 1.8928571939468384, 'reward_std': 0.30304576456546783, 'kl': 0.22265625, 'epoch': 0.4} 40%|███▉ | 997/2500 [8:39:04<11:57:38, 28.65s/it] 40%|███▉ | 998/2500 [8:39:46<13:42:55, 32.87s/it] {'loss': 0.0116, 'grad_norm': 3.0694576796557294, 'learning_rate': 6.007999999999999e-07, 'completion_length': 88.5714340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.9017857909202576, 'reward_std': 0.2254640758037567, 'kl': 0.2900390625, 'epoch': 0.4} 40%|███▉ | 998/2500 [8:39:46<13:42:55, 32.87s/it] 40%|███▉ | 999/2500 [8:40:43<16:38:54, 39.93s/it] {'loss': 0.0081, 'grad_norm': 2.8259091036197828, 'learning_rate': 6.004e-07, 'completion_length': 76.83929061889648, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.9107143878936768, 'reward_std': 0.25253813713788986, 'kl': 0.20166015625, 'epoch': 0.4} 40%|███▉ | 999/2500 [8:40:43<16:38:54, 39.93s/it] 40%|████ | 1000/2500 [8:41:47<19:42:49, 47.31s/it] {'loss': 0.0133, 'grad_norm': 5.419277302160128, 'learning_rate': 6e-07, 'completion_length': 80.26786041259766, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.821428656578064, 'reward_std': 0.38661840558052063, 'kl': 0.3330078125, 'epoch': 0.4} 40%|████ | 1000/2500 [8:41:47<19:42:49, 47.31s/it] 40%|████ | 1001/2500 [8:43:26<26:09:41, 62.83s/it] {'loss': 0.01, 'grad_norm': 4.5767291468165245, 'learning_rate': 5.995999999999999e-07, 'completion_length': 71.05357551574707, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.8660714626312256, 'reward_std': 0.15008661895990372, 'kl': 0.2490234375, 'epoch': 0.4} 40%|████ | 1001/2500 [8:43:26<26:09:41, 62.83s/it] 40%|████ | 1002/2500 [8:44:26<25:43:43, 61.83s/it] {'loss': 0.0121, 'grad_norm': 7.80662263999901, 'learning_rate': 5.991999999999999e-07, 'completion_length': 89.27679061889648, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9464286267757416, 'reward': 1.9017858505249023, 'reward_std': 0.24290671199560165, 'kl': 0.3017578125, 'epoch': 0.4} 40%|████ | 1002/2500 [8:44:26<25:43:43, 61.83s/it] 40%|████ | 1003/2500 [8:46:23<32:33:45, 78.31s/it] {'loss': 0.0123, 'grad_norm': 4.203616512454418, 'learning_rate': 5.988e-07, 'completion_length': 86.35714721679688, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.9017857909202576, 'reward_std': 0.27779194712638855, 'kl': 0.30859375, 'epoch': 0.4} 40%|████ | 1003/2500 [8:46:23<32:33:45, 78.31s/it] 40%|████ | 1004/2500 [8:47:01<27:33:48, 66.33s/it] {'loss': 0.0071, 'grad_norm': 2.143373350992914, 'learning_rate': 5.984000000000001e-07, 'completion_length': 78.92857360839844, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9553571939468384, 'reward_std': 0.12626906856894493, 'kl': 0.17724609375, 'epoch': 0.4} 40%|████ | 1004/2500 [8:47:01<27:33:48, 66.33s/it] 40%|████ | 1005/2500 [8:47:49<25:17:08, 60.89s/it] {'loss': 0.008, 'grad_norm': 1.3193226413185783, 'learning_rate': 5.979999999999999e-07, 'completion_length': 55.49107360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.20068359375, 'epoch': 0.4} 40%|████ | 1005/2500 [8:47:49<25:17:08, 60.89s/it] 40%|████ | 1006/2500 [8:48:37<23:41:54, 57.11s/it] {'loss': 0.0166, 'grad_norm': 4.99174192046027, 'learning_rate': 5.976e-07, 'completion_length': 62.71428871154785, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.8928571939468384, 'reward_std': 0.20532545074820518, 'kl': 0.4140625, 'epoch': 0.4} 40%|████ | 1006/2500 [8:48:37<23:41:54, 57.11s/it] 40%|████ | 1007/2500 [8:49:56<26:21:10, 63.54s/it] {'loss': 0.0117, 'grad_norm': 2.1693355314909644, 'learning_rate': 5.972e-07, 'completion_length': 79.73214530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9464285969734192, 'reward_std': 0.0901123583316803, 'kl': 0.29345703125, 'epoch': 0.4} 40%|████ | 1007/2500 [8:49:56<26:21:10, 63.54s/it] 40%|████ | 1008/2500 [8:50:25<22:01:28, 53.14s/it] {'loss': 0.0138, 'grad_norm': 3.1340781858367546, 'learning_rate': 5.967999999999999e-07, 'completion_length': 69.83929061889648, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.9375000596046448, 'reward_std': 0.15933407098054886, 'kl': 0.345703125, 'epoch': 0.4} 40%|████ | 1008/2500 [8:50:25<22:01:28, 53.14s/it] 40%|████ | 1009/2500 [8:51:07<20:38:50, 49.85s/it] {'loss': 0.0124, 'grad_norm': 1.5060495269080432, 'learning_rate': 5.964e-07, 'completion_length': 70.83929061889648, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9464285969734192, 'reward_std': 0.06331466138362885, 'kl': 0.31103515625, 'epoch': 0.4} 40%|████ | 1009/2500 [8:51:07<20:38:50, 49.85s/it] 40%|████ | 1010/2500 [8:51:31<17:23:06, 42.00s/it] {'loss': 0.0215, 'grad_norm': 19.197259302534096, 'learning_rate': 5.96e-07, 'completion_length': 65.02678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.53955078125, 'epoch': 0.4} 40%|████ | 1010/2500 [8:51:31<17:23:06, 42.00s/it] 40%|████ | 1011/2500 [8:52:05<16:22:29, 39.59s/it] {'loss': 0.0106, 'grad_norm': 1.8310961724966794, 'learning_rate': 5.956e-07, 'completion_length': 73.29464340209961, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9553572535514832, 'reward_std': 0.12626906484365463, 'kl': 0.26513671875, 'epoch': 0.4} 40%|████ | 1011/2500 [8:52:05<16:22:29, 39.59s/it] 40%|████ | 1012/2500 [8:53:18<20:33:04, 49.72s/it] {'loss': 0.0098, 'grad_norm': 1.201757164319766, 'learning_rate': 5.951999999999999e-07, 'completion_length': 77.24107360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9553571939468384, 'reward_std': 0.06543753296136856, 'kl': 0.24365234375, 'epoch': 0.4} 40%|████ | 1012/2500 [8:53:18<20:33:04, 49.72s/it] 41%|████ | 1013/2500 [8:53:57<19:11:52, 46.48s/it] {'loss': 0.0083, 'grad_norm': 3.2091853578040546, 'learning_rate': 5.948e-07, 'completion_length': 64.32143020629883, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.20849609375, 'epoch': 0.41} 41%|████ | 1013/2500 [8:53:57<19:11:52, 46.48s/it] 41%|████ | 1014/2500 [8:54:34<18:00:59, 43.65s/it] {'loss': 0.0057, 'grad_norm': 1.109522289655612, 'learning_rate': 5.944e-07, 'completion_length': 54.57143020629883, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.025253813713788986, 'kl': 0.14306640625, 'epoch': 0.41} 41%|████ | 1014/2500 [8:54:34<18:00:59, 43.65s/it] 41%|████ | 1015/2500 [8:55:29<19:28:09, 47.20s/it] {'loss': 0.0052, 'grad_norm': 0.410835093941121, 'learning_rate': 5.939999999999999e-07, 'completion_length': 67.04464530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13037109375, 'epoch': 0.41} 41%|████ | 1015/2500 [8:55:30<19:28:09, 47.20s/it] 41%|████ | 1016/2500 [8:56:10<18:35:46, 45.11s/it] {'loss': 0.007, 'grad_norm': 0.4007159674134899, 'learning_rate': 5.936e-07, 'completion_length': 53.92857551574707, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.17529296875, 'epoch': 0.41} 41%|████ | 1016/2500 [8:56:10<18:35:46, 45.11s/it] 41%|████ | 1017/2500 [8:56:33<15:53:25, 38.57s/it] {'loss': 0.0065, 'grad_norm': 0.4456998744659921, 'learning_rate': 5.931999999999999e-07, 'completion_length': 71.09821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.16162109375, 'epoch': 0.41} 41%|████ | 1017/2500 [8:56:33<15:53:25, 38.57s/it] 41%|████ | 1018/2500 [8:56:59<14:19:29, 34.80s/it] {'loss': 0.0095, 'grad_norm': 2.1669230309964282, 'learning_rate': 5.928e-07, 'completion_length': 68.32143020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.23779296875, 'epoch': 0.41} 41%|████ | 1018/2500 [8:56:59<14:19:29, 34.80s/it] 41%|████ | 1019/2500 [8:57:23<12:57:42, 31.51s/it] {'loss': 0.004, 'grad_norm': 0.43983532687982607, 'learning_rate': 5.924e-07, 'completion_length': 63.42857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10009765625, 'epoch': 0.41} 41%|████ | 1019/2500 [8:57:23<12:57:42, 31.51s/it] 41%|████ | 1020/2500 [8:57:47<12:06:24, 29.45s/it] {'loss': 0.0058, 'grad_norm': 0.5506228010089141, 'learning_rate': 5.919999999999999e-07, 'completion_length': 59.47321701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1455078125, 'epoch': 0.41} 41%|████ | 1020/2500 [8:57:47<12:06:24, 29.45s/it] 41%|████ | 1021/2500 [8:58:12<11:29:05, 27.95s/it] {'loss': 0.0048, 'grad_norm': 0.25690363555039314, 'learning_rate': 5.916e-07, 'completion_length': 66.76786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.119384765625, 'epoch': 0.41} 41%|████ | 1021/2500 [8:58:12<11:29:05, 27.95s/it] 41%|████ | 1022/2500 [8:59:11<15:19:18, 37.32s/it] {'loss': 0.013, 'grad_norm': 2.894795819711248, 'learning_rate': 5.911999999999999e-07, 'completion_length': 68.50000381469727, 'rewards/accuracy_reward': 0.928571492433548, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9017858505249023, 'reward_std': 0.18849068135023117, 'kl': 0.32373046875, 'epoch': 0.41} 41%|████ | 1022/2500 [8:59:11<15:19:18, 37.32s/it] 41%|████ | 1023/2500 [8:59:39<14:09:33, 34.51s/it] {'loss': 0.0068, 'grad_norm': 2.1504348114121243, 'learning_rate': 5.907999999999999e-07, 'completion_length': 66.50893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.17138671875, 'epoch': 0.41} 41%|████ | 1023/2500 [8:59:39<14:09:33, 34.51s/it] 41%|████ | 1024/2500 [9:00:11<13:47:27, 33.64s/it] {'loss': 0.035, 'grad_norm': 4.405608740731122, 'learning_rate': 5.904e-07, 'completion_length': 87.1964340209961, 'rewards/accuracy_reward': 0.9375000596046448, 'rewards/format_reward': 0.9196429252624512, 'reward': 1.857142984867096, 'reward_std': 0.28123997896909714, 'kl': 0.8740234375, 'epoch': 0.41} 41%|████ | 1024/2500 [9:00:11<13:47:27, 33.64s/it] 41%|████ | 1025/2500 [9:01:04<16:11:28, 39.52s/it] {'loss': 0.034, 'grad_norm': 4.679243125697534, 'learning_rate': 5.9e-07, 'completion_length': 86.9910774230957, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 0.910714328289032, 'reward': 1.8303571939468384, 'reward_std': 0.3143351972103119, 'kl': 0.8505859375, 'epoch': 0.41} 41%|████ | 1025/2500 [9:01:04<16:11:28, 39.52s/it] 41%|████ | 1026/2500 [9:01:38<15:33:32, 38.00s/it] {'loss': 0.0291, 'grad_norm': 4.032300111420013, 'learning_rate': 5.896e-07, 'completion_length': 69.90179061889648, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.8660714626312256, 'reward_std': 0.20101575553417206, 'kl': 0.7294921875, 'epoch': 0.41} 41%|████ | 1026/2500 [9:01:38<15:33:32, 38.00s/it] 41%|████ | 1027/2500 [9:02:46<19:13:10, 46.97s/it] {'loss': 0.007, 'grad_norm': 3.9985120576532034, 'learning_rate': 5.891999999999999e-07, 'completion_length': 59.705360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.17626953125, 'epoch': 0.41} 41%|████ | 1027/2500 [9:02:46<19:13:10, 46.97s/it] 41%|████ | 1028/2500 [9:03:15<17:01:27, 41.64s/it] {'loss': 0.0091, 'grad_norm': 1.5566752327095084, 'learning_rate': 5.888e-07, 'completion_length': 65.8839340209961, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.228515625, 'epoch': 0.41} 41%|████ | 1028/2500 [9:03:15<17:01:27, 41.64s/it] 41%|████ | 1029/2500 [9:03:41<15:02:16, 36.80s/it] {'loss': 0.0054, 'grad_norm': 1.259840787908024, 'learning_rate': 5.884000000000001e-07, 'completion_length': 69.38393020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.135986328125, 'epoch': 0.41} 41%|████ | 1029/2500 [9:03:41<15:02:16, 36.80s/it] 41%|████ | 1030/2500 [9:04:04<13:18:28, 32.59s/it] {'loss': 0.0044, 'grad_norm': 0.46880107119473297, 'learning_rate': 5.879999999999999e-07, 'completion_length': 59.80357551574707, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.109375, 'epoch': 0.41} 41%|████ | 1030/2500 [9:04:04<13:18:28, 32.59s/it] 41%|████ | 1031/2500 [9:04:27<12:09:22, 29.79s/it] {'loss': 0.0065, 'grad_norm': 2.560420732689184, 'learning_rate': 5.876e-07, 'completion_length': 53.83928680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.162109375, 'epoch': 0.41} 41%|████ | 1031/2500 [9:04:27<12:09:22, 29.79s/it] 41%|████▏ | 1032/2500 [9:05:00<12:31:58, 30.73s/it] {'loss': 0.0091, 'grad_norm': 1.5046440905372946, 'learning_rate': 5.872000000000001e-07, 'completion_length': 61.57143211364746, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.053144559264183044, 'kl': 0.22802734375, 'epoch': 0.41} 41%|████▏ | 1032/2500 [9:05:00<12:31:58, 30.73s/it] 41%|████▏ | 1033/2500 [9:05:26<11:58:01, 29.37s/it] {'loss': 0.0118, 'grad_norm': 1.194962926362951, 'learning_rate': 5.867999999999999e-07, 'completion_length': 63.562503814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.295166015625, 'epoch': 0.41} 41%|████▏ | 1033/2500 [9:05:26<11:58:01, 29.37s/it] 41%|████▏ | 1034/2500 [9:05:50<11:17:20, 27.72s/it] {'loss': 0.0057, 'grad_norm': 0.3028421085954815, 'learning_rate': 5.864e-07, 'completion_length': 58.07143211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1416015625, 'epoch': 0.41} 41%|████▏ | 1034/2500 [9:05:50<11:17:20, 27.72s/it] 41%|████▏ | 1035/2500 [9:06:15<10:58:51, 26.98s/it] {'loss': 0.0051, 'grad_norm': 1.196697889662903, 'learning_rate': 5.86e-07, 'completion_length': 60.17857551574707, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.1259765625, 'epoch': 0.41} 41%|████▏ | 1035/2500 [9:06:15<10:58:51, 26.98s/it] 41%|████▏ | 1036/2500 [9:06:41<10:48:57, 26.60s/it] {'loss': 0.0045, 'grad_norm': 0.2291831464576, 'learning_rate': 5.856e-07, 'completion_length': 69.22321701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11328125, 'epoch': 0.41} 41%|████▏ | 1036/2500 [9:06:41<10:48:57, 26.60s/it] 41%|████▏ | 1037/2500 [9:07:18<12:03:07, 29.66s/it] {'loss': 0.0054, 'grad_norm': 0.22751645694671727, 'learning_rate': 5.852e-07, 'completion_length': 58.53571701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13623046875, 'epoch': 0.41} 41%|████▏ | 1037/2500 [9:07:18<12:03:07, 29.66s/it] 42%|████▏ | 1038/2500 [9:07:43<11:32:59, 28.44s/it] {'loss': 0.0054, 'grad_norm': 0.23929059116819773, 'learning_rate': 5.848e-07, 'completion_length': 68.76786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1337890625, 'epoch': 0.42} 42%|████▏ | 1038/2500 [9:07:43<11:32:59, 28.44s/it] 42%|████▏ | 1039/2500 [9:08:08<11:07:51, 27.43s/it] {'loss': 0.006, 'grad_norm': 1.8250751205511417, 'learning_rate': 5.844e-07, 'completion_length': 67.00893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.149169921875, 'epoch': 0.42} 42%|████▏ | 1039/2500 [9:08:08<11:07:51, 27.43s/it] 42%|████▏ | 1040/2500 [9:08:34<10:56:42, 26.99s/it] {'loss': 0.0041, 'grad_norm': 2.0502571981723032, 'learning_rate': 5.839999999999999e-07, 'completion_length': 71.02679061889648, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.10302734375, 'epoch': 0.42} 42%|████▏ | 1040/2500 [9:08:34<10:56:42, 26.99s/it] 42%|████▏ | 1041/2500 [9:09:25<13:48:21, 34.07s/it] {'loss': 0.0086, 'grad_norm': 1.4491379182018833, 'learning_rate': 5.836e-07, 'completion_length': 65.85714340209961, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.21484375, 'epoch': 0.42} 42%|████▏ | 1041/2500 [9:09:25<13:48:21, 34.07s/it] 42%|████▏ | 1042/2500 [9:09:50<12:38:59, 31.23s/it] {'loss': 0.0046, 'grad_norm': 0.1902642412852541, 'learning_rate': 5.832e-07, 'completion_length': 61.937503814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.115234375, 'epoch': 0.42} 42%|████▏ | 1042/2500 [9:09:50<12:38:59, 31.23s/it] 42%|████▏ | 1043/2500 [9:10:14<11:49:23, 29.21s/it] {'loss': 0.006, 'grad_norm': 0.4025971433449985, 'learning_rate': 5.828e-07, 'completion_length': 70.76786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.14990234375, 'epoch': 0.42} 42%|████▏ | 1043/2500 [9:10:14<11:49:23, 29.21s/it] 42%|████▏ | 1044/2500 [9:10:38<11:13:00, 27.73s/it] {'loss': 0.0055, 'grad_norm': 0.945888905320002, 'learning_rate': 5.824e-07, 'completion_length': 63.17857360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.137451171875, 'epoch': 0.42} 42%|████▏ | 1044/2500 [9:10:38<11:13:00, 27.73s/it] 42%|████▏ | 1045/2500 [9:11:07<11:16:47, 27.91s/it] {'loss': 0.0194, 'grad_norm': 2.20316459492339, 'learning_rate': 5.819999999999999e-07, 'completion_length': 69.84821701049805, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.946428656578064, 'reward_std': 0.1289059966802597, 'kl': 0.4853515625, 'epoch': 0.42} 42%|████▏ | 1045/2500 [9:11:07<11:16:47, 27.91s/it] 42%|████▏ | 1046/2500 [9:11:32<11:01:02, 27.28s/it] {'loss': 0.0094, 'grad_norm': 1.5824876205798817, 'learning_rate': 5.816e-07, 'completion_length': 66.08928871154785, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9017858505249023, 'reward_std': 0.07576144114136696, 'kl': 0.234375, 'epoch': 0.42} 42%|████▏ | 1046/2500 [9:11:33<11:01:02, 27.28s/it] 42%|████▏ | 1047/2500 [9:12:19<13:20:18, 33.05s/it] {'loss': 0.0057, 'grad_norm': 3.281600554001859, 'learning_rate': 5.812e-07, 'completion_length': 63.06250190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.1416015625, 'epoch': 0.42} 42%|████▏ | 1047/2500 [9:12:19<13:20:18, 33.05s/it] 42%|████▏ | 1048/2500 [9:12:56<13:47:39, 34.20s/it] {'loss': 0.0048, 'grad_norm': 0.34170968547017727, 'learning_rate': 5.807999999999999e-07, 'completion_length': 63.812503814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11962890625, 'epoch': 0.42} 42%|████▏ | 1048/2500 [9:12:56<13:47:39, 34.20s/it] 42%|████▏ | 1049/2500 [9:13:22<12:45:13, 31.64s/it] {'loss': 0.019, 'grad_norm': 4.029474855813039, 'learning_rate': 5.804e-07, 'completion_length': 64.01786041259766, 'rewards/accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.8660715222358704, 'reward_std': 0.1767766922712326, 'kl': 0.4736328125, 'epoch': 0.42} 42%|████▏ | 1049/2500 [9:13:22<12:45:13, 31.64s/it] 42%|████▏ | 1050/2500 [9:13:49<12:15:35, 30.44s/it] {'loss': 0.0194, 'grad_norm': 2.134505772657693, 'learning_rate': 5.8e-07, 'completion_length': 67.3125, 'rewards/accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.8660715222358704, 'reward_std': 0.1767766922712326, 'kl': 0.486328125, 'epoch': 0.42} 42%|████▏ | 1050/2500 [9:13:49<12:15:35, 30.44s/it] 42%|████▏ | 1051/2500 [9:14:16<11:48:15, 29.33s/it] {'loss': 0.0256, 'grad_norm': 4.987837489374531, 'learning_rate': 5.796e-07, 'completion_length': 63.75893211364746, 'rewards/accuracy_reward': 0.8839285969734192, 'rewards/format_reward': 0.955357164144516, 'reward': 1.8392857909202576, 'reward_std': 0.2847747504711151, 'kl': 0.638671875, 'epoch': 0.42} 42%|████▏ | 1051/2500 [9:14:16<11:48:15, 29.33s/it] 42%|████▏ | 1052/2500 [9:14:41<11:17:19, 28.07s/it] {'loss': 0.0347, 'grad_norm': 4.727004825840619, 'learning_rate': 5.792e-07, 'completion_length': 62.65178871154785, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.8750000596046448, 'reward_std': 0.3186681419610977, 'kl': 0.8671875, 'epoch': 0.42} 42%|████▏ | 1052/2500 [9:14:41<11:17:19, 28.07s/it] 42%|████▏ | 1053/2500 [9:15:09<11:12:47, 27.90s/it] {'loss': 0.0237, 'grad_norm': 3.727417199409506, 'learning_rate': 5.788e-07, 'completion_length': 67.31250381469727, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 0.973214328289032, 'reward': 1.8660715222358704, 'reward_std': 0.1541598215699196, 'kl': 0.591796875, 'epoch': 0.42} 42%|████▏ | 1053/2500 [9:15:09<11:12:47, 27.90s/it] 42%|████▏ | 1054/2500 [9:15:34<10:51:30, 27.03s/it] {'loss': 0.0049, 'grad_norm': 1.634624455856327, 'learning_rate': 5.784e-07, 'completion_length': 67.33929061889648, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.123291015625, 'epoch': 0.42} 42%|████▏ | 1054/2500 [9:15:34<10:51:30, 27.03s/it] 42%|████▏ | 1055/2500 [9:15:59<10:42:37, 26.68s/it] {'loss': 0.0075, 'grad_norm': 2.6003801756575813, 'learning_rate': 5.779999999999999e-07, 'completion_length': 63.84821701049805, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.187255859375, 'epoch': 0.42} 42%|████▏ | 1055/2500 [9:15:59<10:42:37, 26.68s/it] 42%|████▏ | 1056/2500 [9:16:40<12:19:16, 30.72s/it] {'loss': 0.0122, 'grad_norm': 1.9373431420047231, 'learning_rate': 5.776e-07, 'completion_length': 63.55357360839844, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.3046875, 'epoch': 0.42} 42%|████▏ | 1056/2500 [9:16:40<12:19:16, 30.72s/it] 42%|████▏ | 1057/2500 [9:17:07<11:52:22, 29.62s/it] {'loss': 0.0066, 'grad_norm': 4.5080497248169555, 'learning_rate': 5.772000000000001e-07, 'completion_length': 60.03571701049805, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.08868780732154846, 'kl': 0.16552734375, 'epoch': 0.42} 42%|████▏ | 1057/2500 [9:17:07<11:52:22, 29.62s/it] 42%|████▏ | 1058/2500 [9:17:29<11:01:57, 27.54s/it] {'loss': 0.0052, 'grad_norm': 5.505637738261707, 'learning_rate': 5.767999999999999e-07, 'completion_length': 58.53571701049805, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.130859375, 'epoch': 0.42} 42%|████▏ | 1058/2500 [9:17:29<11:01:57, 27.54s/it] 42%|████▏ | 1059/2500 [9:17:54<10:40:26, 26.67s/it] {'loss': 0.0091, 'grad_norm': 2.1942156561487014, 'learning_rate': 5.764e-07, 'completion_length': 67.05357551574707, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.226318359375, 'epoch': 0.42} 42%|████▏ | 1059/2500 [9:17:54<10:40:26, 26.67s/it] 42%|████▏ | 1060/2500 [9:18:17<10:14:45, 25.61s/it] {'loss': 0.005, 'grad_norm': 0.2625564268852831, 'learning_rate': 5.76e-07, 'completion_length': 57.67857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.124267578125, 'epoch': 0.42} 42%|████▏ | 1060/2500 [9:18:17<10:14:45, 25.61s/it] 42%|████▏ | 1061/2500 [9:18:43<10:16:35, 25.71s/it] {'loss': 0.0053, 'grad_norm': 0.41051868408263964, 'learning_rate': 5.755999999999999e-07, 'completion_length': 61.41964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.132568359375, 'epoch': 0.42} 42%|████▏ | 1061/2500 [9:18:43<10:16:35, 25.71s/it] 42%|████▏ | 1062/2500 [9:19:07<10:04:02, 25.20s/it] {'loss': 0.004, 'grad_norm': 1.3363440469018701, 'learning_rate': 5.752e-07, 'completion_length': 67.48214721679688, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.099609375, 'epoch': 0.42} 42%|████▏ | 1062/2500 [9:19:07<10:04:02, 25.20s/it] 43%|████▎ | 1063/2500 [9:19:30<9:50:31, 24.66s/it] {'loss': 0.0039, 'grad_norm': 0.4648526490236585, 'learning_rate': 5.748e-07, 'completion_length': 62.13393020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0986328125, 'epoch': 0.43} 43%|████▎ | 1063/2500 [9:19:30<9:50:31, 24.66s/it] 43%|████▎ | 1064/2500 [9:20:51<16:30:00, 41.37s/it] {'loss': 0.0044, 'grad_norm': 1.3040377358158484, 'learning_rate': 5.744e-07, 'completion_length': 58.07143020629883, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.110107421875, 'epoch': 0.43} 43%|████▎ | 1064/2500 [9:20:51<16:30:00, 41.37s/it] 43%|████▎ | 1065/2500 [9:22:36<24:05:52, 60.45s/it] {'loss': 0.0045, 'grad_norm': 0.23914096781681896, 'learning_rate': 5.739999999999999e-07, 'completion_length': 59.49107360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.113525390625, 'epoch': 0.43} 43%|████▎ | 1065/2500 [9:22:36<24:05:52, 60.45s/it] 43%|████▎ | 1066/2500 [9:24:29<30:25:44, 76.39s/it] {'loss': 0.0042, 'grad_norm': 0.2505700717263739, 'learning_rate': 5.736e-07, 'completion_length': 63.16071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10498046875, 'epoch': 0.43} 43%|████▎ | 1066/2500 [9:24:29<30:25:44, 76.39s/it] 43%|████▎ | 1067/2500 [9:26:35<36:16:32, 91.13s/it] {'loss': 0.0056, 'grad_norm': 1.6944523383514507, 'learning_rate': 5.732e-07, 'completion_length': 63.25893020629883, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.139892578125, 'epoch': 0.43} 43%|████▎ | 1067/2500 [9:26:35<36:16:32, 91.13s/it] 43%|████▎ | 1068/2500 [9:26:59<28:17:35, 71.13s/it] {'loss': 0.0041, 'grad_norm': 0.23725678756919094, 'learning_rate': 5.727999999999999e-07, 'completion_length': 64.00000381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1015625, 'epoch': 0.43} 43%|████▎ | 1068/2500 [9:26:59<28:17:35, 71.13s/it] 43%|████▎ | 1069/2500 [9:27:23<22:37:12, 56.91s/it] {'loss': 0.0065, 'grad_norm': 0.7396214262244374, 'learning_rate': 5.724e-07, 'completion_length': 60.86607551574707, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1630859375, 'epoch': 0.43} 43%|████▎ | 1069/2500 [9:27:23<22:37:12, 56.91s/it] 43%|████▎ | 1070/2500 [9:27:47<18:42:47, 47.11s/it] {'loss': 0.0156, 'grad_norm': 2.8225287863871276, 'learning_rate': 5.719999999999999e-07, 'completion_length': 64.50000381469727, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9375000596046448, 'reward_std': 0.154159814119339, 'kl': 0.3896484375, 'epoch': 0.43} 43%|████▎ | 1070/2500 [9:27:47<18:42:47, 47.11s/it] 43%|████▎ | 1071/2500 [9:28:15<16:23:43, 41.30s/it] {'loss': 0.0087, 'grad_norm': 1.4921474160827084, 'learning_rate': 5.716e-07, 'completion_length': 69.66964721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.21630859375, 'epoch': 0.43} 43%|████▎ | 1071/2500 [9:28:15<16:23:43, 41.30s/it] 43%|████▎ | 1072/2500 [9:28:45<14:59:14, 37.78s/it] {'loss': 0.0455, 'grad_norm': 5.620689120250642, 'learning_rate': 5.712e-07, 'completion_length': 80.81250381469727, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.9017857909202576, 'reward_std': 0.2603493183851242, 'kl': 1.13671875, 'epoch': 0.43} 43%|████▎ | 1072/2500 [9:28:45<14:59:14, 37.78s/it] 43%|████▎ | 1073/2500 [9:29:09<13:25:07, 33.85s/it] {'loss': 0.0536, 'grad_norm': 5.7002939364053145, 'learning_rate': 5.707999999999999e-07, 'completion_length': 67.02678680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.8928571939468384, 'reward_std': 0.30304574966430664, 'kl': 1.33984375, 'epoch': 0.43} 43%|████▎ | 1073/2500 [9:29:09<13:25:07, 33.85s/it] 43%|████▎ | 1074/2500 [9:29:35<12:27:06, 31.44s/it] {'loss': 0.0587, 'grad_norm': 5.8431964116227055, 'learning_rate': 5.704e-07, 'completion_length': 70.70536041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.8928571939468384, 'reward_std': 0.20532545447349548, 'kl': 1.46875, 'epoch': 0.43} 43%|████▎ | 1074/2500 [9:29:35<12:27:06, 31.44s/it] 43%|████▎ | 1075/2500 [9:29:59<11:32:51, 29.17s/it] {'loss': 0.0254, 'grad_norm': 2.9059592500071796, 'learning_rate': 5.699999999999999e-07, 'completion_length': 59.76785850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9553572535514832, 'reward_std': 0.10365218669176102, 'kl': 0.634765625, 'epoch': 0.43} 43%|████▎ | 1075/2500 [9:29:59<11:32:51, 29.17s/it] 43%|████▎ | 1076/2500 [9:30:23<10:54:01, 27.56s/it] {'loss': 0.005, 'grad_norm': 0.25421230649242815, 'learning_rate': 5.696e-07, 'completion_length': 68.75, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.125, 'epoch': 0.43} 43%|████▎ | 1076/2500 [9:30:23<10:54:01, 27.56s/it] 43%|████▎ | 1077/2500 [9:30:46<10:22:05, 26.23s/it] {'loss': 0.0089, 'grad_norm': 1.329653442266628, 'learning_rate': 5.692e-07, 'completion_length': 69.25000381469727, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.221923828125, 'epoch': 0.43} 43%|████▎ | 1077/2500 [9:30:46<10:22:05, 26.23s/it] 43%|████▎ | 1078/2500 [9:31:09<9:57:54, 25.23s/it] {'loss': 0.0038, 'grad_norm': 0.24599088537726335, 'learning_rate': 5.688e-07, 'completion_length': 65.09821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.094482421875, 'epoch': 0.43} 43%|████▎ | 1078/2500 [9:31:09<9:57:54, 25.23s/it] 43%|████▎ | 1079/2500 [9:31:32<9:45:15, 24.71s/it] {'loss': 0.0046, 'grad_norm': 0.16902560739249575, 'learning_rate': 5.684e-07, 'completion_length': 74.02679061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.115478515625, 'epoch': 0.43} 43%|████▎ | 1079/2500 [9:31:32<9:45:15, 24.71s/it] 43%|████▎ | 1080/2500 [9:31:56<9:36:10, 24.35s/it] {'loss': 0.0038, 'grad_norm': 1.1949351665940167, 'learning_rate': 5.679999999999999e-07, 'completion_length': 68.29464340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.095703125, 'epoch': 0.43} 43%|████▎ | 1080/2500 [9:31:56<9:36:10, 24.35s/it] 43%|████▎ | 1081/2500 [9:32:19<9:27:51, 24.01s/it] {'loss': 0.0046, 'grad_norm': 0.17267079478117628, 'learning_rate': 5.676e-07, 'completion_length': 69.30357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11376953125, 'epoch': 0.43} 43%|████▎ | 1081/2500 [9:32:19<9:27:51, 24.01s/it] 43%|████▎ | 1082/2500 [9:32:44<9:36:52, 24.41s/it] {'loss': 0.0046, 'grad_norm': 1.3223203304655997, 'learning_rate': 5.672e-07, 'completion_length': 60.63393211364746, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.03818017989397049, 'kl': 0.115966796875, 'epoch': 0.43} 43%|████▎ | 1082/2500 [9:32:44<9:36:52, 24.41s/it] 43%|████▎ | 1083/2500 [9:33:10<9:47:15, 24.87s/it] {'loss': 0.0051, 'grad_norm': 1.0005918868751764, 'learning_rate': 5.667999999999999e-07, 'completion_length': 65.15178871154785, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.126953125, 'epoch': 0.43} 43%|████▎ | 1083/2500 [9:33:10<9:47:15, 24.87s/it] 43%|████▎ | 1084/2500 [9:33:34<9:38:18, 24.50s/it] {'loss': 0.0077, 'grad_norm': 1.0520887019635279, 'learning_rate': 5.664e-07, 'completion_length': 67.20535850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.19189453125, 'epoch': 0.43} 43%|████▎ | 1084/2500 [9:33:34<9:38:18, 24.50s/it] 43%|████▎ | 1085/2500 [9:33:59<9:41:24, 24.65s/it] {'loss': 0.0062, 'grad_norm': 1.3436018318696081, 'learning_rate': 5.66e-07, 'completion_length': 67.53571701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.155517578125, 'epoch': 0.43} 43%|████▎ | 1085/2500 [9:33:59<9:41:24, 24.65s/it] 43%|████▎ | 1086/2500 [9:34:23<9:33:42, 24.34s/it] {'loss': 0.0043, 'grad_norm': 0.20306323757345984, 'learning_rate': 5.655999999999999e-07, 'completion_length': 59.60714530944824, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.107666015625, 'epoch': 0.43} 43%|████▎ | 1086/2500 [9:34:23<9:33:42, 24.34s/it] 43%|████▎ | 1087/2500 [9:34:47<9:32:09, 24.30s/it] {'loss': 0.0031, 'grad_norm': 1.0554231557665743, 'learning_rate': 5.652e-07, 'completion_length': 67.6339340209961, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.07861328125, 'epoch': 0.43} 43%|████▎ | 1087/2500 [9:34:47<9:32:09, 24.30s/it] 44%|████▎ | 1088/2500 [9:35:11<9:28:15, 24.15s/it] {'loss': 0.0052, 'grad_norm': 0.22993974665577688, 'learning_rate': 5.648e-07, 'completion_length': 55.580360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13037109375, 'epoch': 0.44} 44%|████▎ | 1088/2500 [9:35:11<9:28:15, 24.15s/it] 44%|████▎ | 1089/2500 [9:35:35<9:30:41, 24.27s/it] {'loss': 0.0039, 'grad_norm': 1.720573818527011, 'learning_rate': 5.643999999999999e-07, 'completion_length': 63.89285850524902, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.09716796875, 'epoch': 0.44} 44%|████▎ | 1089/2500 [9:35:35<9:30:41, 24.27s/it] 44%|████▎ | 1090/2500 [9:36:00<9:32:45, 24.37s/it] {'loss': 0.0036, 'grad_norm': 0.1613602260408707, 'learning_rate': 5.639999999999999e-07, 'completion_length': 63.33928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09033203125, 'epoch': 0.44} 44%|████▎ | 1090/2500 [9:36:00<9:32:45, 24.37s/it] 44%|████▎ | 1091/2500 [9:36:22<9:16:34, 23.70s/it] {'loss': 0.0076, 'grad_norm': 5.507221830320927, 'learning_rate': 5.636e-07, 'completion_length': 59.00000190734863, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.8928572535514832, 'reward_std': 0.10101525485515594, 'kl': 0.19091796875, 'epoch': 0.44} 44%|████▎ | 1091/2500 [9:36:22<9:16:34, 23.70s/it] 44%|████▎ | 1092/2500 [9:36:46<9:18:03, 23.78s/it] {'loss': 0.0115, 'grad_norm': 3.4811370579742023, 'learning_rate': 5.632e-07, 'completion_length': 56.60714530944824, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642858505249023, 'reward_std': 0.0835726335644722, 'kl': 0.2880859375, 'epoch': 0.44} 44%|████▎ | 1092/2500 [9:36:46<9:18:03, 23.78s/it] 44%|████▎ | 1093/2500 [9:37:11<9:24:43, 24.08s/it] {'loss': 0.0052, 'grad_norm': 0.8715077908046609, 'learning_rate': 5.627999999999999e-07, 'completion_length': 63.03571891784668, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.910714328289032, 'reward_std': 0.05050762742757797, 'kl': 0.131103515625, 'epoch': 0.44} 44%|████▎ | 1093/2500 [9:37:11<9:24:43, 24.08s/it] 44%|████▍ | 1094/2500 [9:37:34<9:20:47, 23.93s/it] {'loss': 0.0072, 'grad_norm': 1.9198348993846304, 'learning_rate': 5.624e-07, 'completion_length': 61.61607360839844, 'rewards/accuracy_reward': 0.9017857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8839285969734192, 'reward_std': 0.10365219414234161, 'kl': 0.180419921875, 'epoch': 0.44} 44%|████▍ | 1094/2500 [9:37:34<9:20:47, 23.93s/it] 44%|████▍ | 1095/2500 [9:37:56<9:08:41, 23.43s/it] {'loss': 0.0068, 'grad_norm': 1.3071877728698678, 'learning_rate': 5.620000000000001e-07, 'completion_length': 52.35714340209961, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.16943359375, 'epoch': 0.44} 44%|████▍ | 1095/2500 [9:37:57<9:08:41, 23.43s/it] 44%|████▍ | 1096/2500 [9:38:20<9:08:57, 23.46s/it] {'loss': 0.0898, 'grad_norm': 10.686850301678016, 'learning_rate': 5.615999999999999e-07, 'completion_length': 53.77678871154785, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.8392857909202576, 'reward_std': 0.37698137760162354, 'kl': 2.25, 'epoch': 0.44} 44%|████▍ | 1096/2500 [9:38:20<9:08:57, 23.46s/it] 44%|████▍ | 1097/2500 [9:38:44<9:12:27, 23.63s/it] {'loss': 0.0945, 'grad_norm': 12.65129677513678, 'learning_rate': 5.612e-07, 'completion_length': 49.65178680419922, 'rewards/accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.8035715222358704, 'reward_std': 0.4431114196777344, 'kl': 2.36328125, 'epoch': 0.44} 44%|████▍ | 1097/2500 [9:38:44<9:12:27, 23.63s/it] 44%|████▍ | 1098/2500 [9:39:09<9:23:49, 24.13s/it] {'loss': 0.0631, 'grad_norm': 8.549700278103536, 'learning_rate': 5.608e-07, 'completion_length': 55.61607360839844, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.7232143878936768, 'reward_std': 0.2716331481933594, 'kl': 1.578125, 'epoch': 0.44} 44%|████▍ | 1098/2500 [9:39:09<9:23:49, 24.13s/it] 44%|████▍ | 1099/2500 [9:39:32<9:16:08, 23.82s/it] {'loss': 0.017, 'grad_norm': 3.1831249557212398, 'learning_rate': 5.604e-07, 'completion_length': 52.46428871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.423828125, 'epoch': 0.44} 44%|████▍ | 1099/2500 [9:39:32<9:16:08, 23.82s/it] 44%|████▍ | 1100/2500 [9:39:56<9:10:29, 23.59s/it] {'loss': 0.0191, 'grad_norm': 2.1267332306923055, 'learning_rate': 5.6e-07, 'completion_length': 50.20535850524902, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9375000596046448, 'reward_std': 0.11594516038894653, 'kl': 0.47802734375, 'epoch': 0.44} 44%|████▍ | 1100/2500 [9:39:56<9:10:29, 23.59s/it] 44%|████▍ | 1101/2500 [9:41:06<14:35:04, 37.53s/it] {'loss': 0.0059, 'grad_norm': 1.2292381126898921, 'learning_rate': 5.596e-07, 'completion_length': 51.85714530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.146484375, 'epoch': 0.44} 44%|████▍ | 1101/2500 [9:41:06<14:35:04, 37.53s/it] 44%|████▍ | 1102/2500 [9:41:27<12:38:57, 32.57s/it] {'loss': 0.0064, 'grad_norm': 0.36957410537705254, 'learning_rate': 5.592e-07, 'completion_length': 50.81250190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.16064453125, 'epoch': 0.44} 44%|████▍ | 1102/2500 [9:41:27<12:38:57, 32.57s/it] 44%|████▍ | 1103/2500 [9:41:48<11:18:28, 29.14s/it] {'loss': 0.0048, 'grad_norm': 4.001414562394128, 'learning_rate': 5.588e-07, 'completion_length': 56.00893020629883, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.0739355981349945, 'kl': 0.120361328125, 'epoch': 0.44} 44%|████▍ | 1103/2500 [9:41:48<11:18:28, 29.14s/it] 44%|████▍ | 1104/2500 [9:42:11<10:35:06, 27.30s/it] {'loss': 0.0048, 'grad_norm': 0.25504273465439636, 'learning_rate': 5.584e-07, 'completion_length': 59.63393211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.119384765625, 'epoch': 0.44} 44%|████▍ | 1104/2500 [9:42:11<10:35:06, 27.30s/it] 44%|████▍ | 1105/2500 [9:42:34<10:03:40, 25.96s/it] {'loss': 0.0054, 'grad_norm': 0.2623987443122544, 'learning_rate': 5.58e-07, 'completion_length': 53.66964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.136474609375, 'epoch': 0.44} 44%|████▍ | 1105/2500 [9:42:34<10:03:40, 25.96s/it] 44%|████▍ | 1106/2500 [9:42:56<9:41:23, 25.02s/it] {'loss': 0.0055, 'grad_norm': 0.2651821729922054, 'learning_rate': 5.576e-07, 'completion_length': 55.81250190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.136962890625, 'epoch': 0.44} 44%|████▍ | 1106/2500 [9:42:56<9:41:23, 25.02s/it] 44%|████▍ | 1107/2500 [9:43:21<9:37:08, 24.86s/it] {'loss': 0.0042, 'grad_norm': 0.16925854732808518, 'learning_rate': 5.572e-07, 'completion_length': 63.142860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10546875, 'epoch': 0.44} 44%|████▍ | 1107/2500 [9:43:21<9:37:08, 24.86s/it] 44%|████▍ | 1108/2500 [9:43:45<9:31:07, 24.62s/it] {'loss': 0.0045, 'grad_norm': 0.19652024478756397, 'learning_rate': 5.567999999999999e-07, 'completion_length': 58.625003814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.113525390625, 'epoch': 0.44} 44%|████▍ | 1108/2500 [9:43:45<9:31:07, 24.62s/it] 44%|████▍ | 1109/2500 [9:44:08<9:19:46, 24.15s/it] {'loss': 0.0042, 'grad_norm': 0.16758676329909755, 'learning_rate': 5.564e-07, 'completion_length': 61.24107551574707, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.103759765625, 'epoch': 0.44} 44%|████▍ | 1109/2500 [9:44:08<9:19:46, 24.15s/it] 44%|████▍ | 1110/2500 [9:44:32<9:19:36, 24.16s/it] {'loss': 0.0042, 'grad_norm': 0.26014539637106165, 'learning_rate': 5.560000000000001e-07, 'completion_length': 63.83928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.105712890625, 'epoch': 0.44} 44%|████▍ | 1110/2500 [9:44:32<9:19:36, 24.16s/it] 44%|████▍ | 1111/2500 [9:44:55<9:12:33, 23.87s/it] {'loss': 0.0044, 'grad_norm': 0.1951782357615883, 'learning_rate': 5.555999999999999e-07, 'completion_length': 55.58928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.109130859375, 'epoch': 0.44} 44%|████▍ | 1111/2500 [9:44:55<9:12:33, 23.87s/it] 44%|████▍ | 1112/2500 [9:45:19<9:10:15, 23.79s/it] {'loss': 0.0031, 'grad_norm': 0.17196239357637283, 'learning_rate': 5.552e-07, 'completion_length': 61.84821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.076904296875, 'epoch': 0.44} 44%|████▍ | 1112/2500 [9:45:19<9:10:15, 23.79s/it] 45%|████▍ | 1113/2500 [9:45:44<9:21:33, 24.29s/it] {'loss': 0.0039, 'grad_norm': 0.15556023330277755, 'learning_rate': 5.548e-07, 'completion_length': 62.90178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09814453125, 'epoch': 0.45} 45%|████▍ | 1113/2500 [9:45:44<9:21:33, 24.29s/it] 45%|████▍ | 1114/2500 [9:46:09<9:23:20, 24.39s/it] {'loss': 0.0052, 'grad_norm': 0.30284766088484216, 'learning_rate': 5.543999999999999e-07, 'completion_length': 59.85714530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.129638671875, 'epoch': 0.45} 45%|████▍ | 1114/2500 [9:46:09<9:23:20, 24.39s/it] 45%|████▍ | 1115/2500 [9:46:33<9:21:45, 24.34s/it] {'loss': 0.0071, 'grad_norm': 4.352311203084353, 'learning_rate': 5.54e-07, 'completion_length': 67.15179061889648, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9375001192092896, 'reward_std': 0.12054043263196945, 'kl': 0.176513671875, 'epoch': 0.45} 45%|████▍ | 1115/2500 [9:46:33<9:21:45, 24.34s/it] 45%|████▍ | 1116/2500 [9:46:58<9:22:05, 24.37s/it] {'loss': 0.0051, 'grad_norm': 2.811598907944986, 'learning_rate': 5.536e-07, 'completion_length': 66.24107360839844, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553572535514832, 'reward_std': 0.07003280520439148, 'kl': 0.12841796875, 'epoch': 0.45} 45%|████▍ | 1116/2500 [9:46:58<9:22:05, 24.37s/it] 45%|████▍ | 1117/2500 [9:47:21<9:17:58, 24.21s/it] {'loss': 0.0041, 'grad_norm': 0.16632271206099777, 'learning_rate': 5.532e-07, 'completion_length': 70.65179061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10302734375, 'epoch': 0.45} 45%|████▍ | 1117/2500 [9:47:21<9:17:58, 24.21s/it] 45%|████▍ | 1118/2500 [9:47:51<9:54:54, 25.83s/it] {'loss': 0.0108, 'grad_norm': 0.9568532157274621, 'learning_rate': 5.527999999999999e-07, 'completion_length': 65.69643020629883, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.025253813713788986, 'kl': 0.27099609375, 'epoch': 0.45} 45%|████▍ | 1118/2500 [9:47:51<9:54:54, 25.83s/it] 45%|████▍ | 1119/2500 [9:48:13<9:24:46, 24.54s/it] {'loss': 0.0037, 'grad_norm': 2.8703913595679915, 'learning_rate': 5.524e-07, 'completion_length': 63.99107551574707, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.025253813713788986, 'kl': 0.092041015625, 'epoch': 0.45} 45%|████▍ | 1119/2500 [9:48:13<9:24:46, 24.54s/it] 45%|████▍ | 1120/2500 [9:48:38<9:28:41, 24.73s/it] {'loss': 0.0116, 'grad_norm': 1.5251275813876668, 'learning_rate': 5.520000000000001e-07, 'completion_length': 62.60714530944824, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285714626312256, 'reward_std': 0.0835726335644722, 'kl': 0.2890625, 'epoch': 0.45} 45%|████▍ | 1120/2500 [9:48:38<9:28:41, 24.73s/it] 45%|████▍ | 1121/2500 [9:49:02<9:28:11, 24.72s/it] {'loss': 0.0138, 'grad_norm': 2.220244986387916, 'learning_rate': 5.515999999999999e-07, 'completion_length': 71.4375, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9285714626312256, 'reward_std': 0.2020305097103119, 'kl': 0.3447265625, 'epoch': 0.45} 45%|████▍ | 1121/2500 [9:49:03<9:28:11, 24.72s/it] 45%|████▍ | 1122/2500 [9:49:26<9:19:29, 24.36s/it] {'loss': 0.0063, 'grad_norm': 0.5841657586263724, 'learning_rate': 5.512e-07, 'completion_length': 67.05357551574707, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.15771484375, 'epoch': 0.45} 45%|████▍ | 1122/2500 [9:49:26<9:19:29, 24.36s/it] 45%|████▍ | 1123/2500 [9:49:51<9:22:58, 24.53s/it] {'loss': 0.0124, 'grad_norm': 3.611722782689208, 'learning_rate': 5.508e-07, 'completion_length': 68.1964340209961, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9196429252624512, 'reward_std': 0.1379830539226532, 'kl': 0.312255859375, 'epoch': 0.45} 45%|████▍ | 1123/2500 [9:49:51<9:22:58, 24.53s/it] 45%|████▍ | 1124/2500 [9:50:16<9:28:54, 24.81s/it] {'loss': 0.03, 'grad_norm': 3.0701305646349857, 'learning_rate': 5.504e-07, 'completion_length': 68.02679061889648, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.955357164144516, 'reward': 1.8928572535514832, 'reward_std': 0.22936688363552094, 'kl': 0.748046875, 'epoch': 0.45} 45%|████▍ | 1124/2500 [9:50:16<9:28:54, 24.81s/it] 45%|████▌ | 1125/2500 [9:50:41<9:29:56, 24.87s/it] {'loss': 0.0113, 'grad_norm': 2.3927414093848167, 'learning_rate': 5.5e-07, 'completion_length': 58.85714530944824, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.283203125, 'epoch': 0.45} 45%|████▌ | 1125/2500 [9:50:41<9:29:56, 24.87s/it] 45%|████▌ | 1126/2500 [9:51:06<9:30:05, 24.89s/it] {'loss': 0.0186, 'grad_norm': 1.5539852687913371, 'learning_rate': 5.496e-07, 'completion_length': 63.41964340209961, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.462890625, 'epoch': 0.45} 45%|████▌ | 1126/2500 [9:51:06<9:30:05, 24.89s/it] 45%|████▌ | 1127/2500 [9:51:31<9:27:46, 24.81s/it] {'loss': 0.024, 'grad_norm': 2.8890804845840354, 'learning_rate': 5.492e-07, 'completion_length': 61.00000190734863, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.8750000596046448, 'reward_std': 0.20021028071641922, 'kl': 0.59765625, 'epoch': 0.45} 45%|████▌ | 1127/2500 [9:51:31<9:27:46, 24.81s/it] 45%|████▌ | 1128/2500 [9:51:55<9:21:04, 24.54s/it] {'loss': 0.0076, 'grad_norm': 1.3266073309386222, 'learning_rate': 5.487999999999999e-07, 'completion_length': 60.785715103149414, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.19091796875, 'epoch': 0.45} 45%|████▌ | 1128/2500 [9:51:55<9:21:04, 24.54s/it] 45%|████▌ | 1129/2500 [9:52:17<9:04:49, 23.84s/it] {'loss': 0.0221, 'grad_norm': 1.8924328583515657, 'learning_rate': 5.484e-07, 'completion_length': 58.78571701049805, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9375000596046448, 'reward_std': 0.08747542649507523, 'kl': 0.552490234375, 'epoch': 0.45} 45%|████▌ | 1129/2500 [9:52:17<9:04:49, 23.84s/it] 45%|████▌ | 1130/2500 [9:52:41<9:06:12, 23.92s/it] {'loss': 0.0176, 'grad_norm': 2.111461459760318, 'learning_rate': 5.48e-07, 'completion_length': 57.61607360839844, 'rewards/accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 0.973214328289032, 'reward': 1.8750001192092896, 'reward_std': 0.15152288228273392, 'kl': 0.43896484375, 'epoch': 0.45} 45%|████▌ | 1130/2500 [9:52:41<9:06:12, 23.92s/it] 45%|████▌ | 1131/2500 [9:53:07<9:17:41, 24.44s/it] {'loss': 0.0707, 'grad_norm': 3.605248679761875, 'learning_rate': 5.476e-07, 'completion_length': 60.455360412597656, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.8750000596046448, 'reward_std': 0.2960512936115265, 'kl': 1.767578125, 'epoch': 0.45} 45%|████▌ | 1131/2500 [9:53:07<9:17:41, 24.44s/it] 45%|████▌ | 1132/2500 [9:53:29<9:04:13, 23.87s/it] {'loss': 0.0561, 'grad_norm': 4.354826357738132, 'learning_rate': 5.472e-07, 'completion_length': 59.16071701049805, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.883928656578064, 'reward_std': 0.25852909684181213, 'kl': 1.40234375, 'epoch': 0.45} 45%|████▌ | 1132/2500 [9:53:29<9:04:13, 23.87s/it] 45%|████▌ | 1133/2500 [9:53:53<8:58:48, 23.65s/it] {'loss': 0.025, 'grad_norm': 3.5532740100834506, 'learning_rate': 5.467999999999999e-07, 'completion_length': 56.74107360839844, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.6279296875, 'epoch': 0.45} 45%|████▌ | 1133/2500 [9:53:53<8:58:48, 23.65s/it] 45%|████▌ | 1134/2500 [9:54:15<8:51:06, 23.33s/it] {'loss': 0.0114, 'grad_norm': 1.645112462681555, 'learning_rate': 5.464e-07, 'completion_length': 64.08035850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.28564453125, 'epoch': 0.45} 45%|████▌ | 1134/2500 [9:54:15<8:51:06, 23.33s/it] 45%|████▌ | 1135/2500 [9:54:41<9:06:29, 24.02s/it] {'loss': 0.0349, 'grad_norm': 8.27583951927808, 'learning_rate': 5.46e-07, 'completion_length': 58.09821701049805, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9375000596046448, 'reward_std': 0.1379830539226532, 'kl': 0.8720703125, 'epoch': 0.45} 45%|████▌ | 1135/2500 [9:54:41<9:06:29, 24.02s/it] 45%|████▌ | 1136/2500 [9:55:04<9:00:59, 23.80s/it] {'loss': 0.0249, 'grad_norm': 2.4204799538465838, 'learning_rate': 5.455999999999999e-07, 'completion_length': 56.36607551574707, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.07576143741607666, 'kl': 0.6220703125, 'epoch': 0.45} 45%|████▌ | 1136/2500 [9:55:04<9:00:59, 23.80s/it] 45%|████▌ | 1137/2500 [9:55:28<9:00:05, 23.78s/it] {'loss': 0.0084, 'grad_norm': 1.9124059926782107, 'learning_rate': 5.452e-07, 'completion_length': 60.60714530944824, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857313156128, 'reward_std': 0.05399492755532265, 'kl': 0.2099609375, 'epoch': 0.45} 45%|████▌ | 1137/2500 [9:55:28<9:00:05, 23.78s/it] 46%|████▌ | 1138/2500 [9:55:52<9:05:42, 24.04s/it] {'loss': 0.0074, 'grad_norm': 0.876758750139393, 'learning_rate': 5.448e-07, 'completion_length': 54.18750190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1845703125, 'epoch': 0.46} 46%|████▌ | 1138/2500 [9:55:52<9:05:42, 24.04s/it] 46%|████▌ | 1139/2500 [9:56:17<9:06:13, 24.08s/it] {'loss': 0.007, 'grad_norm': 3.5295386043112815, 'learning_rate': 5.443999999999999e-07, 'completion_length': 66.60714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.17578125, 'epoch': 0.46} 46%|████▌ | 1139/2500 [9:56:17<9:06:13, 24.08s/it] 46%|████▌ | 1140/2500 [9:56:45<9:36:58, 25.45s/it] {'loss': 0.0157, 'grad_norm': 2.280847793739515, 'learning_rate': 5.44e-07, 'completion_length': 62.33928871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.3935546875, 'epoch': 0.46} 46%|████▌ | 1140/2500 [9:56:45<9:36:58, 25.45s/it] 46%|████▌ | 1141/2500 [9:57:17<10:18:35, 27.31s/it] {'loss': 0.0051, 'grad_norm': 1.9383763373103449, 'learning_rate': 5.436e-07, 'completion_length': 65.62500381469727, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.127197265625, 'epoch': 0.46} 46%|████▌ | 1141/2500 [9:57:17<10:18:35, 27.31s/it] 46%|████▌ | 1142/2500 [9:57:56<11:36:08, 30.76s/it] {'loss': 0.0058, 'grad_norm': 1.7881541908073766, 'learning_rate': 5.431999999999999e-07, 'completion_length': 61.339290618896484, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.14404296875, 'epoch': 0.46} 46%|████▌ | 1142/2500 [9:57:56<11:36:08, 30.76s/it] 46%|████▌ | 1143/2500 [9:58:22<11:03:11, 29.32s/it] {'loss': 0.0032, 'grad_norm': 1.7296569261342751, 'learning_rate': 5.427999999999999e-07, 'completion_length': 61.68750190734863, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.0810546875, 'epoch': 0.46} 46%|████▌ | 1143/2500 [9:58:22<11:03:11, 29.32s/it] 46%|████▌ | 1144/2500 [9:58:56<11:33:23, 30.68s/it] {'loss': 0.0047, 'grad_norm': 1.051911576854851, 'learning_rate': 5.424e-07, 'completion_length': 62.23214530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.118896484375, 'epoch': 0.46} 46%|████▌ | 1144/2500 [9:58:56<11:33:23, 30.68s/it] 46%|████▌ | 1145/2500 [9:59:19<10:43:01, 28.47s/it] {'loss': 0.0056, 'grad_norm': 0.2939433181488314, 'learning_rate': 5.420000000000001e-07, 'completion_length': 62.32143020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.14013671875, 'epoch': 0.46} 46%|████▌ | 1145/2500 [9:59:19<10:43:01, 28.47s/it] 46%|████▌ | 1146/2500 [9:59:43<10:11:52, 27.11s/it] {'loss': 0.005, 'grad_norm': 0.28075708747183536, 'learning_rate': 5.415999999999999e-07, 'completion_length': 60.86607360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.1240234375, 'epoch': 0.46} 46%|████▌ | 1146/2500 [9:59:43<10:11:52, 27.11s/it] 46%|████▌ | 1147/2500 [10:00:06<9:45:32, 25.97s/it] {'loss': 0.0054, 'grad_norm': 0.2052805380392644, 'learning_rate': 5.412e-07, 'completion_length': 66.26786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.134765625, 'epoch': 0.46} 46%|████▌ | 1147/2500 [10:00:06<9:45:32, 25.97s/it] 46%|████▌ | 1148/2500 [10:00:31<9:36:55, 25.60s/it] {'loss': 0.0046, 'grad_norm': 0.26528619193754255, 'learning_rate': 5.408e-07, 'completion_length': 61.223215103149414, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.115966796875, 'epoch': 0.46} 46%|████▌ | 1148/2500 [10:00:31<9:36:55, 25.60s/it] 46%|████▌ | 1149/2500 [10:01:08<10:57:26, 29.20s/it] {'loss': 0.0041, 'grad_norm': 2.1822126859170825, 'learning_rate': 5.403999999999999e-07, 'completion_length': 60.53571701049805, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.1025390625, 'epoch': 0.46} 46%|████▌ | 1149/2500 [10:01:08<10:57:26, 29.20s/it] 46%|████▌ | 1150/2500 [10:01:37<10:52:28, 29.00s/it] {'loss': 0.004, 'grad_norm': 0.21458433002614868, 'learning_rate': 5.4e-07, 'completion_length': 59.40178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.099853515625, 'epoch': 0.46} 46%|████▌ | 1150/2500 [10:01:37<10:52:28, 29.00s/it] 46%|████▌ | 1151/2500 [10:02:02<10:22:09, 27.67s/it] {'loss': 0.0041, 'grad_norm': 0.19223275769608458, 'learning_rate': 5.396e-07, 'completion_length': 61.05357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.101318359375, 'epoch': 0.46} 46%|████▌ | 1151/2500 [10:02:02<10:22:09, 27.67s/it] 46%|████▌ | 1152/2500 [10:02:30<10:24:08, 27.78s/it] {'loss': 0.0048, 'grad_norm': 0.20875679146985326, 'learning_rate': 5.392e-07, 'completion_length': 59.10714530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.120361328125, 'epoch': 0.46} 46%|████▌ | 1152/2500 [10:02:30<10:24:08, 27.78s/it] 46%|████▌ | 1153/2500 [10:03:26<13:36:58, 36.39s/it] {'loss': 0.0077, 'grad_norm': 1.14706019961318, 'learning_rate': 5.387999999999999e-07, 'completion_length': 63.61607551574707, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9107143878936768, 'reward_std': 0.05050762742757797, 'kl': 0.193359375, 'epoch': 0.46} 46%|████▌ | 1153/2500 [10:03:26<13:36:58, 36.39s/it] 46%|████▌ | 1154/2500 [10:03:55<12:46:55, 34.19s/it] {'loss': 0.0049, 'grad_norm': 0.23853457799010566, 'learning_rate': 5.384e-07, 'completion_length': 60.10714530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.122802734375, 'epoch': 0.46} 46%|████▌ | 1154/2500 [10:03:55<12:46:55, 34.19s/it] 46%|████▌ | 1155/2500 [10:04:44<14:26:21, 38.65s/it] {'loss': 0.0045, 'grad_norm': 0.18001955022279567, 'learning_rate': 5.38e-07, 'completion_length': 62.45536231994629, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11328125, 'epoch': 0.46} 46%|████▌ | 1155/2500 [10:04:44<14:26:21, 38.65s/it] 46%|████▌ | 1156/2500 [10:05:09<12:56:25, 34.66s/it] {'loss': 0.0043, 'grad_norm': 0.19997520978966102, 'learning_rate': 5.375999999999999e-07, 'completion_length': 60.812503814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.108154296875, 'epoch': 0.46} 46%|████▌ | 1156/2500 [10:05:10<12:56:25, 34.66s/it] 46%|████▋ | 1157/2500 [10:05:36<12:03:26, 32.32s/it] {'loss': 0.0049, 'grad_norm': 2.5476179479705174, 'learning_rate': 5.372e-07, 'completion_length': 55.625003814697266, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.123779296875, 'epoch': 0.46} 46%|████▋ | 1157/2500 [10:05:36<12:03:26, 32.32s/it] 46%|████▋ | 1158/2500 [10:06:03<11:26:58, 30.71s/it] {'loss': 0.0086, 'grad_norm': 1.9013146831150354, 'learning_rate': 5.368e-07, 'completion_length': 55.71428871154785, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.8928571939468384, 'reward_std': 0.10101525112986565, 'kl': 0.2158203125, 'epoch': 0.46} 46%|████▋ | 1158/2500 [10:06:03<11:26:58, 30.71s/it] 46%|████▋ | 1159/2500 [10:06:35<11:34:32, 31.08s/it] {'loss': 0.0058, 'grad_norm': 1.1649559994063166, 'learning_rate': 5.364e-07, 'completion_length': 54.92857551574707, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.14404296875, 'epoch': 0.46} 46%|████▋ | 1159/2500 [10:06:35<11:34:32, 31.08s/it] 46%|████▋ | 1160/2500 [10:07:02<11:01:59, 29.64s/it] {'loss': 0.0055, 'grad_norm': 0.2941780266584491, 'learning_rate': 5.36e-07, 'completion_length': 56.74107360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.136474609375, 'epoch': 0.46} 46%|████▋ | 1160/2500 [10:07:02<11:01:59, 29.64s/it] 46%|████▋ | 1161/2500 [10:07:24<10:16:08, 27.61s/it] {'loss': 0.0048, 'grad_norm': 0.8347155923334072, 'learning_rate': 5.355999999999999e-07, 'completion_length': 59.58928871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.119140625, 'epoch': 0.46} 46%|████▋ | 1161/2500 [10:07:24<10:16:08, 27.61s/it] 46%|████▋ | 1162/2500 [10:07:49<9:52:50, 26.58s/it] {'loss': 0.0071, 'grad_norm': 1.7948974011741958, 'learning_rate': 5.352e-07, 'completion_length': 58.21428871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.177734375, 'epoch': 0.46} 46%|████▋ | 1162/2500 [10:07:49<9:52:50, 26.58s/it] 47%|████▋ | 1163/2500 [10:08:24<10:53:32, 29.33s/it] {'loss': 0.0115, 'grad_norm': 1.1339700647002968, 'learning_rate': 5.348e-07, 'completion_length': 61.20535850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.28662109375, 'epoch': 0.47} 47%|████▋ | 1163/2500 [10:08:24<10:53:32, 29.33s/it] 47%|████▋ | 1164/2500 [10:08:58<11:19:56, 30.54s/it] {'loss': 0.0048, 'grad_norm': 0.24214571898534132, 'learning_rate': 5.343999999999999e-07, 'completion_length': 59.41071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.120361328125, 'epoch': 0.47} 47%|████▋ | 1164/2500 [10:08:58<11:19:56, 30.54s/it] 47%|████▋ | 1165/2500 [10:09:23<10:45:56, 29.03s/it] {'loss': 0.0103, 'grad_norm': 2.3621121373572693, 'learning_rate': 5.34e-07, 'completion_length': 60.24107551574707, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.2568359375, 'epoch': 0.47} 47%|████▋ | 1165/2500 [10:09:23<10:45:56, 29.03s/it] 47%|████▋ | 1166/2500 [10:09:50<10:28:07, 28.25s/it] {'loss': 0.0085, 'grad_norm': 0.9956646288985781, 'learning_rate': 5.336e-07, 'completion_length': 57.312503814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.21142578125, 'epoch': 0.47} 47%|████▋ | 1166/2500 [10:09:50<10:28:07, 28.25s/it] 47%|████▋ | 1167/2500 [10:10:14<10:03:49, 27.18s/it] {'loss': 0.0053, 'grad_norm': 1.1743369222996691, 'learning_rate': 5.331999999999999e-07, 'completion_length': 58.72321701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.1318359375, 'epoch': 0.47} 47%|████▋ | 1167/2500 [10:10:14<10:03:49, 27.18s/it] 47%|████▋ | 1168/2500 [10:10:39<9:46:10, 26.40s/it] {'loss': 0.0054, 'grad_norm': 1.9964252530618465, 'learning_rate': 5.328e-07, 'completion_length': 53.87500190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.1337890625, 'epoch': 0.47} 47%|████▋ | 1168/2500 [10:10:39<9:46:10, 26.40s/it] 47%|████▋ | 1169/2500 [10:11:02<9:24:09, 25.43s/it] {'loss': 0.0051, 'grad_norm': 1.8113516848986166, 'learning_rate': 5.324e-07, 'completion_length': 49.50000190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1279296875, 'epoch': 0.47} 47%|████▋ | 1169/2500 [10:11:02<9:24:09, 25.43s/it] 47%|████▋ | 1170/2500 [10:11:26<9:11:24, 24.88s/it] {'loss': 0.0045, 'grad_norm': 1.0060330352163063, 'learning_rate': 5.32e-07, 'completion_length': 56.67857551574707, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.11328125, 'epoch': 0.47} 47%|████▋ | 1170/2500 [10:11:26<9:11:24, 24.88s/it] 47%|████▋ | 1171/2500 [10:11:50<9:08:01, 24.74s/it] {'loss': 0.0053, 'grad_norm': 0.3649626057573215, 'learning_rate': 5.315999999999999e-07, 'completion_length': 49.26785850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1318359375, 'epoch': 0.47} 47%|████▋ | 1171/2500 [10:11:50<9:08:01, 24.74s/it] 47%|████▋ | 1172/2500 [10:12:13<8:55:32, 24.20s/it] {'loss': 0.0057, 'grad_norm': 0.3357391516673942, 'learning_rate': 5.312e-07, 'completion_length': 55.23214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1416015625, 'epoch': 0.47} 47%|████▋ | 1172/2500 [10:12:13<8:55:32, 24.20s/it] 47%|████▋ | 1173/2500 [10:12:37<8:54:25, 24.16s/it] {'loss': 0.0105, 'grad_norm': 1.5242315891930953, 'learning_rate': 5.308000000000001e-07, 'completion_length': 56.69643211364746, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9464285969734192, 'reward_std': 0.11663764715194702, 'kl': 0.2646484375, 'epoch': 0.47} 47%|████▋ | 1173/2500 [10:12:37<8:54:25, 24.16s/it] 47%|████▋ | 1174/2500 [10:13:03<9:05:34, 24.69s/it] {'loss': 0.0048, 'grad_norm': 0.24622274933197666, 'learning_rate': 5.303999999999999e-07, 'completion_length': 54.23214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.120849609375, 'epoch': 0.47} 47%|████▋ | 1174/2500 [10:13:03<9:05:34, 24.69s/it] 47%|████▋ | 1175/2500 [10:13:28<9:09:39, 24.89s/it] {'loss': 0.0056, 'grad_norm': 0.34090332770348375, 'learning_rate': 5.3e-07, 'completion_length': 51.98214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.138671875, 'epoch': 0.47} 47%|████▋ | 1175/2500 [10:13:28<9:09:39, 24.89s/it] 47%|████▋ | 1176/2500 [10:13:55<9:17:42, 25.27s/it] {'loss': 0.0043, 'grad_norm': 2.491341985984314, 'learning_rate': 5.296e-07, 'completion_length': 55.35714530944824, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.108642578125, 'epoch': 0.47} 47%|████▋ | 1176/2500 [10:13:55<9:17:42, 25.27s/it] 47%|████▋ | 1177/2500 [10:14:18<9:05:42, 24.75s/it] {'loss': 0.0048, 'grad_norm': 1.631802394463759, 'learning_rate': 5.292e-07, 'completion_length': 56.22321701049805, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.06222161278128624, 'kl': 0.119140625, 'epoch': 0.47} 47%|████▋ | 1177/2500 [10:14:18<9:05:42, 24.75s/it] 47%|████▋ | 1178/2500 [10:14:42<8:58:29, 24.44s/it] {'loss': 0.0051, 'grad_norm': 0.2881816823076559, 'learning_rate': 5.288e-07, 'completion_length': 55.43750190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.127197265625, 'epoch': 0.47} 47%|████▋ | 1178/2500 [10:14:42<8:58:29, 24.44s/it] 47%|████▋ | 1179/2500 [10:15:06<8:55:09, 24.31s/it] {'loss': 0.0036, 'grad_norm': 0.6053080153556231, 'learning_rate': 5.284e-07, 'completion_length': 55.89285850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0908203125, 'epoch': 0.47} 47%|████▋ | 1179/2500 [10:15:06<8:55:09, 24.31s/it] 47%|████▋ | 1180/2500 [10:15:30<8:53:56, 24.27s/it] {'loss': 0.0113, 'grad_norm': 2.5747152289775306, 'learning_rate': 5.28e-07, 'completion_length': 59.642860412597656, 'rewards/accuracy_reward': 0.928571492433548, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9107143878936768, 'reward_std': 0.10101525485515594, 'kl': 0.2822265625, 'epoch': 0.47} 47%|████▋ | 1180/2500 [10:15:30<8:53:56, 24.27s/it] 47%|████▋ | 1181/2500 [10:15:54<8:50:11, 24.12s/it] {'loss': 0.0032, 'grad_norm': 10.5215250999288, 'learning_rate': 5.275999999999999e-07, 'completion_length': 54.61607360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07124518603086472, 'kl': 0.08056640625, 'epoch': 0.47} 47%|████▋ | 1181/2500 [10:15:54<8:50:11, 24.12s/it] 47%|████▋ | 1182/2500 [10:16:24<9:29:54, 25.94s/it] {'loss': 0.0051, 'grad_norm': 0.24559058521790475, 'learning_rate': 5.272e-07, 'completion_length': 60.34821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12646484375, 'epoch': 0.47} 47%|████▋ | 1182/2500 [10:16:24<9:29:54, 25.94s/it] 47%|████▋ | 1183/2500 [10:16:53<9:50:38, 26.91s/it] {'loss': 0.0052, 'grad_norm': 0.45916003904474945, 'learning_rate': 5.268e-07, 'completion_length': 61.75893211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13037109375, 'epoch': 0.47} 47%|████▋ | 1183/2500 [10:16:53<9:50:38, 26.91s/it] 47%|████▋ | 1184/2500 [10:17:18<9:34:30, 26.19s/it] {'loss': 0.0051, 'grad_norm': 22.537694666472813, 'learning_rate': 5.264e-07, 'completion_length': 51.64285850524902, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.12744140625, 'epoch': 0.47} 47%|████▋ | 1184/2500 [10:17:18<9:34:30, 26.19s/it] 47%|████▋ | 1185/2500 [10:17:42<9:22:19, 25.66s/it] {'loss': 0.0033, 'grad_norm': 1.348210362487115, 'learning_rate': 5.26e-07, 'completion_length': 64.76786041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.083251953125, 'epoch': 0.47} 47%|████▋ | 1185/2500 [10:17:42<9:22:19, 25.66s/it] 47%|████▋ | 1186/2500 [10:18:10<9:34:14, 26.22s/it] {'loss': 0.0072, 'grad_norm': 1.1743567440684, 'learning_rate': 5.255999999999999e-07, 'completion_length': 57.60714530944824, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.18115234375, 'epoch': 0.47} 47%|████▋ | 1186/2500 [10:18:10<9:34:14, 26.22s/it] 47%|████▋ | 1187/2500 [10:18:34<9:24:13, 25.78s/it] {'loss': 0.008, 'grad_norm': 0.6535430601314691, 'learning_rate': 5.252e-07, 'completion_length': 53.91071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.2001953125, 'epoch': 0.47} 47%|████▋ | 1187/2500 [10:18:34<9:24:13, 25.78s/it] 48%|████▊ | 1188/2500 [10:18:59<9:14:50, 25.37s/it] {'loss': 0.0074, 'grad_norm': 2.8560913164340054, 'learning_rate': 5.248e-07, 'completion_length': 53.69643020629883, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.1845703125, 'epoch': 0.48} 48%|████▊ | 1188/2500 [10:18:59<9:14:50, 25.37s/it] 48%|████▊ | 1189/2500 [10:19:23<9:04:15, 24.91s/it] {'loss': 0.0066, 'grad_norm': 1.3466842064959543, 'learning_rate': 5.243999999999999e-07, 'completion_length': 56.32143020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.164306640625, 'epoch': 0.48} 48%|████▊ | 1189/2500 [10:19:23<9:04:15, 24.91s/it] 48%|████▊ | 1190/2500 [10:19:47<8:59:53, 24.73s/it] {'loss': 0.0112, 'grad_norm': 2.3311716177348716, 'learning_rate': 5.24e-07, 'completion_length': 61.73214530944824, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857313156128, 'reward_std': 0.0835726335644722, 'kl': 0.28076171875, 'epoch': 0.48} 48%|████▊ | 1190/2500 [10:19:47<8:59:53, 24.73s/it] 48%|████▊ | 1191/2500 [10:20:13<9:09:04, 25.17s/it] {'loss': 0.0086, 'grad_norm': 4.878616731482891, 'learning_rate': 5.236e-07, 'completion_length': 64.76786041259766, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9107143878936768, 'reward_std': 0.08274422958493233, 'kl': 0.21435546875, 'epoch': 0.48} 48%|████▊ | 1191/2500 [10:20:13<9:09:04, 25.17s/it] 48%|████▊ | 1192/2500 [10:20:40<9:19:49, 25.68s/it] {'loss': 0.0047, 'grad_norm': 0.28277661827676387, 'learning_rate': 5.232e-07, 'completion_length': 60.78571701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1171875, 'epoch': 0.48} 48%|████▊ | 1192/2500 [10:20:40<9:19:49, 25.68s/it] 48%|████▊ | 1193/2500 [10:21:09<9:41:14, 26.68s/it] {'loss': 0.0044, 'grad_norm': 0.392460453784723, 'learning_rate': 5.228e-07, 'completion_length': 62.59821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1103515625, 'epoch': 0.48} 48%|████▊ | 1193/2500 [10:21:09<9:41:14, 26.68s/it] 48%|████▊ | 1194/2500 [10:21:35<9:36:33, 26.49s/it] {'loss': 0.0083, 'grad_norm': 2.0580762178350285, 'learning_rate': 5.224e-07, 'completion_length': 62.00000190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.20654296875, 'epoch': 0.48} 48%|████▊ | 1194/2500 [10:21:35<9:36:33, 26.49s/it] 48%|████▊ | 1195/2500 [10:21:59<9:18:11, 25.66s/it] {'loss': 0.0072, 'grad_norm': 0.657983868350389, 'learning_rate': 5.22e-07, 'completion_length': 56.88393020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.180419921875, 'epoch': 0.48} 48%|████▊ | 1195/2500 [10:21:59<9:18:11, 25.66s/it] 48%|████▊ | 1196/2500 [10:22:24<9:15:31, 25.56s/it] {'loss': 0.0043, 'grad_norm': 0.38540762819638025, 'learning_rate': 5.215999999999999e-07, 'completion_length': 65.87500190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10693359375, 'epoch': 0.48} 48%|████▊ | 1196/2500 [10:22:24<9:15:31, 25.56s/it] 48%|████▊ | 1197/2500 [10:22:51<9:27:12, 26.12s/it] {'loss': 0.0107, 'grad_norm': 2.6063169910073434, 'learning_rate': 5.212e-07, 'completion_length': 61.267860412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.26708984375, 'epoch': 0.48} 48%|████▊ | 1197/2500 [10:22:51<9:27:12, 26.12s/it] 48%|████▊ | 1198/2500 [10:23:18<9:31:12, 26.32s/it] {'loss': 0.0263, 'grad_norm': 2.9101051875645934, 'learning_rate': 5.208000000000001e-07, 'completion_length': 58.28571701049805, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.919642984867096, 'reward_std': 0.1664527878165245, 'kl': 0.6611328125, 'epoch': 0.48} 48%|████▊ | 1198/2500 [10:23:18<9:31:12, 26.32s/it] 48%|████▊ | 1199/2500 [10:23:44<9:25:25, 26.08s/it] {'loss': 0.0137, 'grad_norm': 4.771949111846585, 'learning_rate': 5.203999999999999e-07, 'completion_length': 68.41964530944824, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.946428656578064, 'reward_std': 0.11663763970136642, 'kl': 0.3427734375, 'epoch': 0.48} 48%|████▊ | 1199/2500 [10:23:44<9:25:25, 26.08s/it] 48%|████▊ | 1200/2500 [10:24:19<10:27:44, 28.97s/it] {'loss': 0.0545, 'grad_norm': 7.805054078456885, 'learning_rate': 5.2e-07, 'completion_length': 64.37500190734863, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.8571429252624512, 'reward_std': 0.3465588837862015, 'kl': 1.361328125, 'epoch': 0.48} 48%|████▊ | 1200/2500 [10:24:20<10:27:44, 28.97s/it] 48%|████▊ | 1201/2500 [10:25:59<18:03:30, 50.05s/it] {'loss': 0.1391, 'grad_norm': 10.402111210101747, 'learning_rate': 5.196e-07, 'completion_length': 61.767860412597656, 'rewards/accuracy_reward': 0.830357164144516, 'rewards/format_reward': 0.8214285969734192, 'reward': 1.6517857909202576, 'reward_std': 0.6326810717582703, 'kl': 3.4765625, 'epoch': 0.48} 48%|████▊ | 1201/2500 [10:25:59<18:03:30, 50.05s/it] 48%|████▊ | 1202/2500 [10:26:15<14:25:55, 40.03s/it] {'loss': 0.4419, 'grad_norm': 23.037262666372463, 'learning_rate': 5.191999999999999e-07, 'completion_length': 51.69643020629883, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.6071428656578064, 'reward': 1.2500000596046448, 'reward_std': 0.9134338796138763, 'kl': 11.0625, 'epoch': 0.48} 48%|████▊ | 1202/2500 [10:26:15<14:25:55, 40.03s/it] 48%|████▊ | 1203/2500 [10:26:54<14:18:28, 39.71s/it] {'loss': 0.6065, 'grad_norm': 41.06419941036406, 'learning_rate': 5.188e-07, 'completion_length': 48.17857360839844, 'rewards/accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 0.5535714626312256, 'reward': 1.1517857909202576, 'reward_std': 0.9471865296363831, 'kl': 15.125, 'epoch': 0.48} 48%|████▊ | 1203/2500 [10:26:54<14:18:28, 39.71s/it] 48%|████▊ | 1204/2500 [10:27:17<12:29:21, 34.69s/it] {'loss': 0.7362, 'grad_norm': 53.77522898663975, 'learning_rate': 5.184e-07, 'completion_length': 50.88393211364746, 'rewards/accuracy_reward': 0.5178571939468384, 'rewards/format_reward': 0.5089285969734192, 'reward': 1.0267857909202576, 'reward_std': 0.9603662490844727, 'kl': 18.4375, 'epoch': 0.48} 48%|████▊ | 1204/2500 [10:27:17<12:29:21, 34.69s/it] 48%|████▊ | 1205/2500 [10:27:27<9:50:01, 27.34s/it] {'loss': 0.9244, 'grad_norm': 69.53231478968354, 'learning_rate': 5.18e-07, 'completion_length': 46.62500190734863, 'rewards/accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 0.4375000149011612, 'reward': 0.910714328289032, 'reward_std': 0.9192624688148499, 'kl': 23.125, 'epoch': 0.48} 48%|████▊ | 1205/2500 [10:27:28<9:50:01, 27.34s/it] 48%|████▊ | 1206/2500 [10:27:41<8:19:29, 23.16s/it] {'loss': 0.6527, 'grad_norm': 49.93227431682476, 'learning_rate': 5.175999999999999e-07, 'completion_length': 48.09821701049805, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.5446428954601288, 'reward': 1.0803571939468384, 'reward_std': 0.9474038779735565, 'kl': 16.34375, 'epoch': 0.48} 48%|████▊ | 1206/2500 [10:27:41<8:19:29, 23.16s/it] 48%|████▊ | 1207/2500 [10:27:52<6:59:49, 19.48s/it] {'loss': 0.6044, 'grad_norm': 34.2656594074802, 'learning_rate': 5.172e-07, 'completion_length': 51.19643020629883, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.5000000298023224, 'reward': 1.0714285969734192, 'reward_std': 0.89887934923172, 'kl': 15.125, 'epoch': 0.48} 48%|████▊ | 1207/2500 [10:27:52<6:59:49, 19.48s/it] 48%|████▊ | 1208/2500 [10:28:38<9:49:08, 27.36s/it] {'loss': 0.2995, 'grad_norm': 16.08038287902917, 'learning_rate': 5.168e-07, 'completion_length': 53.78571701049805, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 0.6875000298023224, 'reward': 1.4375000596046448, 'reward_std': 0.8053743839263916, 'kl': 7.46875, 'epoch': 0.48} 48%|████▊ | 1208/2500 [10:28:38<9:49:08, 27.36s/it] 48%|████▊ | 1209/2500 [10:29:01<9:20:49, 26.06s/it] {'loss': 0.2185, 'grad_norm': 21.46206249804205, 'learning_rate': 5.163999999999999e-07, 'completion_length': 59.91071701049805, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 0.8035714626312256, 'reward': 1.6250000596046448, 'reward_std': 0.7173814475536346, 'kl': 5.46875, 'epoch': 0.48} 48%|████▊ | 1209/2500 [10:29:01<9:20:49, 26.06s/it] 48%|████▊ | 1210/2500 [10:29:13<7:54:28, 22.07s/it] {'loss': 0.1317, 'grad_norm': 12.614757216533084, 'learning_rate': 5.16e-07, 'completion_length': 74.83036041259766, 'rewards/accuracy_reward': 0.8303571939468384, 'rewards/format_reward': 0.8125000298023224, 'reward': 1.6428571939468384, 'reward_std': 0.6684167087078094, 'kl': 3.2890625, 'epoch': 0.48} 48%|████▊ | 1210/2500 [10:29:13<7:54:28, 22.07s/it] 48%|████▊ | 1211/2500 [10:29:37<8:05:27, 22.60s/it] {'loss': 0.1042, 'grad_norm': 18.694018694688644, 'learning_rate': 5.155999999999999e-07, 'completion_length': 74.4910774230957, 'rewards/accuracy_reward': 0.8839285969734192, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.7767858505249023, 'reward_std': 0.45538534224033356, 'kl': 2.6015625, 'epoch': 0.48} 48%|████▊ | 1211/2500 [10:29:37<8:05:27, 22.60s/it] 48%|████▊ | 1212/2500 [10:30:04<8:30:21, 23.77s/it] {'loss': 0.0274, 'grad_norm': 5.177610723039158, 'learning_rate': 5.152e-07, 'completion_length': 80.55357360839844, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.955357164144516, 'reward': 1.910714328289032, 'reward_std': 0.18276765942573547, 'kl': 0.688232421875, 'epoch': 0.48} 48%|████▊ | 1212/2500 [10:30:04<8:30:21, 23.77s/it] 49%|████▊ | 1213/2500 [10:30:28<8:31:15, 23.83s/it] {'loss': 0.016, 'grad_norm': 4.8859568279259635, 'learning_rate': 5.148e-07, 'completion_length': 82.41964340209961, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.9107143878936768, 'reward_std': 0.21765290200710297, 'kl': 0.400390625, 'epoch': 0.49} 49%|████▊ | 1213/2500 [10:30:28<8:31:15, 23.83s/it] 49%|████▊ | 1214/2500 [10:31:09<10:25:37, 29.19s/it] {'loss': 0.0279, 'grad_norm': 3.835236552644168, 'learning_rate': 5.143999999999999e-07, 'completion_length': 86.1160774230957, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.6943359375, 'epoch': 0.49} 49%|████▊ | 1214/2500 [10:31:09<10:25:37, 29.19s/it] 49%|████▊ | 1215/2500 [10:31:25<8:59:58, 25.21s/it] {'loss': 0.0244, 'grad_norm': 4.804369635432823, 'learning_rate': 5.14e-07, 'completion_length': 82.78572082519531, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9285715222358704, 'reward_std': 0.16197101771831512, 'kl': 0.611328125, 'epoch': 0.49} 49%|████▊ | 1215/2500 [10:31:25<8:59:58, 25.21s/it] 49%|████▊ | 1216/2500 [10:31:37<7:30:31, 21.05s/it] {'loss': 0.0162, 'grad_norm': 3.2651525941913775, 'learning_rate': 5.135999999999999e-07, 'completion_length': 82.87500381469727, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.40625, 'epoch': 0.49} 49%|████▊ | 1216/2500 [10:31:37<7:30:31, 21.05s/it] 49%|████▊ | 1217/2500 [10:31:49<6:36:17, 18.53s/it] {'loss': 0.006, 'grad_norm': 2.90514219464385, 'learning_rate': 5.132e-07, 'completion_length': 85.85714721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.15087890625, 'epoch': 0.49} 49%|████▊ | 1217/2500 [10:31:49<6:36:17, 18.53s/it] 49%|████▊ | 1218/2500 [10:32:30<8:55:47, 25.08s/it] {'loss': 0.0112, 'grad_norm': 1.9749459375521667, 'learning_rate': 5.128e-07, 'completion_length': 86.8214340209961, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9375000596046448, 'reward_std': 0.08747542649507523, 'kl': 0.279052734375, 'epoch': 0.49} 49%|████▊ | 1218/2500 [10:32:30<8:55:47, 25.08s/it] 49%|████▉ | 1219/2500 [10:33:33<12:58:27, 36.46s/it] {'loss': 0.0038, 'grad_norm': 1.0962171935361626, 'learning_rate': 5.124e-07, 'completion_length': 88.49107360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.09619140625, 'epoch': 0.49} 49%|████▉ | 1219/2500 [10:33:33<12:58:27, 36.46s/it] 49%|████▉ | 1220/2500 [10:33:46<10:30:36, 29.56s/it] {'loss': 0.0041, 'grad_norm': 0.30025409357392835, 'learning_rate': 5.12e-07, 'completion_length': 80.91964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1015625, 'epoch': 0.49} 49%|████▉ | 1220/2500 [10:33:46<10:30:36, 29.56s/it] 49%|████▉ | 1221/2500 [10:34:16<10:29:52, 29.55s/it] {'loss': 0.006, 'grad_norm': 1.4858742462763144, 'learning_rate': 5.116e-07, 'completion_length': 90.33929061889648, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.150390625, 'epoch': 0.49} 49%|████▉ | 1221/2500 [10:34:16<10:29:52, 29.55s/it] 49%|████▉ | 1222/2500 [10:34:59<11:59:43, 33.79s/it] {'loss': 0.0035, 'grad_norm': 0.2514453766903487, 'learning_rate': 5.112e-07, 'completion_length': 86.65179061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.087890625, 'epoch': 0.49} 49%|████▉ | 1222/2500 [10:34:59<11:59:43, 33.79s/it] 49%|████▉ | 1223/2500 [10:35:30<11:40:23, 32.91s/it] {'loss': 0.0046, 'grad_norm': 1.1190267637769844, 'learning_rate': 5.108e-07, 'completion_length': 73.83036041259766, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.11572265625, 'epoch': 0.49} 49%|████▉ | 1223/2500 [10:35:30<11:40:23, 32.91s/it] 49%|████▉ | 1224/2500 [10:35:56<10:52:40, 30.69s/it] {'loss': 0.0058, 'grad_norm': 1.9451269917400682, 'learning_rate': 5.103999999999999e-07, 'completion_length': 84.42857360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.09132473915815353, 'kl': 0.144775390625, 'epoch': 0.49} 49%|████▉ | 1224/2500 [10:35:56<10:52:40, 30.69s/it] 49%|████▉ | 1225/2500 [10:36:25<10:44:19, 30.32s/it] {'loss': 0.0038, 'grad_norm': 0.16490094527933352, 'learning_rate': 5.1e-07, 'completion_length': 72.7589340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09423828125, 'epoch': 0.49} 49%|████▉ | 1225/2500 [10:36:25<10:44:19, 30.32s/it] 49%|████▉ | 1226/2500 [10:36:50<10:05:56, 28.54s/it] {'loss': 0.0044, 'grad_norm': 0.31582112634051934, 'learning_rate': 5.096000000000001e-07, 'completion_length': 72.88393020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1103515625, 'epoch': 0.49} 49%|████▉ | 1226/2500 [10:36:50<10:05:56, 28.54s/it] 49%|████▉ | 1227/2500 [10:37:14<9:40:44, 27.37s/it] {'loss': 0.0038, 'grad_norm': 0.19843251406587878, 'learning_rate': 5.091999999999999e-07, 'completion_length': 76.65178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09375, 'epoch': 0.49} 49%|████▉ | 1227/2500 [10:37:14<9:40:44, 27.37s/it] 49%|████▉ | 1228/2500 [10:37:43<9:47:48, 27.73s/it] {'loss': 0.0154, 'grad_norm': 26.798962442262397, 'learning_rate': 5.088e-07, 'completion_length': 65.03572082519531, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.38623046875, 'epoch': 0.49} 49%|████▉ | 1228/2500 [10:37:43<9:47:48, 27.73s/it] 49%|████▉ | 1229/2500 [10:38:44<13:17:57, 37.67s/it] {'loss': 0.0175, 'grad_norm': 2.54169851646816, 'learning_rate': 5.084e-07, 'completion_length': 70.45536041259766, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9553572535514832, 'reward_std': 0.12626906484365463, 'kl': 0.4365234375, 'epoch': 0.49} 49%|████▉ | 1229/2500 [10:38:44<13:17:57, 37.67s/it] 49%|████▉ | 1230/2500 [10:40:28<20:19:11, 57.60s/it] {'loss': 0.0089, 'grad_norm': 1.9416327546720384, 'learning_rate': 5.079999999999999e-07, 'completion_length': 66.83035850524902, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.22216796875, 'epoch': 0.49} 49%|████▉ | 1230/2500 [10:40:28<20:19:11, 57.60s/it] 49%|████▉ | 1231/2500 [10:40:55<17:04:52, 48.46s/it] {'loss': 0.0152, 'grad_norm': 3.1595924662676085, 'learning_rate': 5.076e-07, 'completion_length': 61.01785850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.380859375, 'epoch': 0.49} 49%|████▉ | 1231/2500 [10:40:55<17:04:52, 48.46s/it] 49%|████▉ | 1232/2500 [10:41:19<14:28:33, 41.10s/it] {'loss': 0.0188, 'grad_norm': 2.2696065286615483, 'learning_rate': 5.072e-07, 'completion_length': 73.74107360839844, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.946428656578064, 'reward_std': 0.13408026099205017, 'kl': 0.4697265625, 'epoch': 0.49} 49%|████▉ | 1232/2500 [10:41:19<14:28:33, 41.10s/it] 49%|████▉ | 1233/2500 [10:41:43<12:43:13, 36.14s/it] {'loss': 0.0057, 'grad_norm': 0.29299480142873796, 'learning_rate': 5.068e-07, 'completion_length': 59.30357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.142578125, 'epoch': 0.49} 49%|████▉ | 1233/2500 [10:41:43<12:43:13, 36.14s/it] 49%|████▉ | 1234/2500 [10:42:36<14:27:03, 41.09s/it] {'loss': 0.0289, 'grad_norm': 2.7724010725768724, 'learning_rate': 5.063999999999999e-07, 'completion_length': 65.82143020629883, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.06613001227378845, 'kl': 0.72265625, 'epoch': 0.49} 49%|████▉ | 1234/2500 [10:42:36<14:27:03, 41.09s/it] 49%|████▉ | 1235/2500 [10:43:08<13:31:07, 38.47s/it] {'loss': 0.0037, 'grad_norm': 0.1650120637832888, 'learning_rate': 5.06e-07, 'completion_length': 63.500003814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.092041015625, 'epoch': 0.49} 49%|████▉ | 1235/2500 [10:43:08<13:31:07, 38.47s/it] 49%|████▉ | 1236/2500 [10:43:35<12:13:59, 34.84s/it] {'loss': 0.0038, 'grad_norm': 1.938710351840459, 'learning_rate': 5.056e-07, 'completion_length': 63.67857360839844, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.094970703125, 'epoch': 0.49} 49%|████▉ | 1236/2500 [10:43:35<12:13:59, 34.84s/it] 49%|████▉ | 1237/2500 [10:44:53<16:46:27, 47.81s/it] {'loss': 0.0099, 'grad_norm': 3.759475484713496, 'learning_rate': 5.051999999999999e-07, 'completion_length': 65.65179061889648, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.8839285969734192, 'reward_std': 0.08747542649507523, 'kl': 0.24853515625, 'epoch': 0.49} 49%|████▉ | 1237/2500 [10:44:53<16:46:27, 47.81s/it] 50%|████▉ | 1238/2500 [10:45:56<18:25:54, 52.58s/it] {'loss': 0.0058, 'grad_norm': 0.41413251451498057, 'learning_rate': 5.048e-07, 'completion_length': 58.71428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.14501953125, 'epoch': 0.5} 50%|████▉ | 1238/2500 [10:45:57<18:25:54, 52.58s/it] 50%|████▉ | 1239/2500 [10:46:29<16:15:39, 46.42s/it] {'loss': 0.0045, 'grad_norm': 0.17769548740846997, 'learning_rate': 5.043999999999999e-07, 'completion_length': 65.25000190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.113037109375, 'epoch': 0.5} 50%|████▉ | 1239/2500 [10:46:29<16:15:39, 46.42s/it] 50%|████▉ | 1240/2500 [10:46:52<13:49:40, 39.51s/it] {'loss': 0.0029, 'grad_norm': 0.14722278956060827, 'learning_rate': 5.04e-07, 'completion_length': 59.51785850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.072998046875, 'epoch': 0.5} 50%|████▉ | 1240/2500 [10:46:52<13:49:40, 39.51s/it] 50%|████▉ | 1241/2500 [10:47:16<12:08:52, 34.74s/it] {'loss': 0.0044, 'grad_norm': 0.33570006659183543, 'learning_rate': 5.036e-07, 'completion_length': 61.55357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1103515625, 'epoch': 0.5} 50%|████▉ | 1241/2500 [10:47:16<12:08:52, 34.74s/it] 50%|████▉ | 1242/2500 [10:47:40<11:02:59, 31.62s/it] {'loss': 0.0047, 'grad_norm': 0.9403115309297468, 'learning_rate': 5.032e-07, 'completion_length': 64.76786041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.11669921875, 'epoch': 0.5} 50%|████▉ | 1242/2500 [10:47:40<11:02:59, 31.62s/it] 50%|████▉ | 1243/2500 [10:48:05<10:20:04, 29.60s/it] {'loss': 0.0039, 'grad_norm': 0.9329487537510884, 'learning_rate': 5.028e-07, 'completion_length': 59.12500190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.097412109375, 'epoch': 0.5} 50%|████▉ | 1243/2500 [10:48:05<10:20:04, 29.60s/it] 50%|████▉ | 1244/2500 [10:48:29<9:47:37, 28.07s/it] {'loss': 0.0038, 'grad_norm': 0.186838120068473, 'learning_rate': 5.023999999999999e-07, 'completion_length': 61.91071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.093994140625, 'epoch': 0.5} 50%|████▉ | 1244/2500 [10:48:29<9:47:37, 28.07s/it] 50%|████▉ | 1245/2500 [10:48:54<9:25:09, 27.02s/it] {'loss': 0.004, 'grad_norm': 2.005061790798661, 'learning_rate': 5.02e-07, 'completion_length': 60.63393020629883, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.06222161650657654, 'kl': 0.099365234375, 'epoch': 0.5} 50%|████▉ | 1245/2500 [10:48:54<9:25:09, 27.02s/it] 50%|████▉ | 1246/2500 [10:49:17<9:02:38, 25.96s/it] {'loss': 0.0032, 'grad_norm': 0.16279159082505062, 'learning_rate': 5.016e-07, 'completion_length': 60.11607360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.080810546875, 'epoch': 0.5} 50%|████▉ | 1246/2500 [10:49:17<9:02:38, 25.96s/it] 50%|████▉ | 1247/2500 [10:49:41<8:47:32, 25.26s/it] {'loss': 0.0043, 'grad_norm': 0.2529000683050455, 'learning_rate': 5.012e-07, 'completion_length': 67.90178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10791015625, 'epoch': 0.5} 50%|████▉ | 1247/2500 [10:49:41<8:47:32, 25.26s/it] 50%|████▉ | 1248/2500 [10:50:08<8:57:57, 25.78s/it] {'loss': 0.0041, 'grad_norm': 12.051961328800882, 'learning_rate': 5.008e-07, 'completion_length': 69.65178680419922, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.101806640625, 'epoch': 0.5} 50%|████▉ | 1248/2500 [10:50:08<8:57:57, 25.78s/it] 50%|████▉ | 1249/2500 [10:50:31<8:40:53, 24.98s/it] {'loss': 0.0036, 'grad_norm': 0.15135634102273773, 'learning_rate': 5.003999999999999e-07, 'completion_length': 72.3035774230957, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.08935546875, 'epoch': 0.5} 50%|████▉ | 1249/2500 [10:50:31<8:40:53, 24.98s/it] 50%|█████ | 1250/2500 [10:50:57<8:46:26, 25.27s/it] {'loss': 0.0036, 'grad_norm': 1.3821875124579415, 'learning_rate': 5e-07, 'completion_length': 70.30357360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.08984375, 'epoch': 0.5} 50%|█████ | 1250/2500 [10:50:57<8:46:26, 25.27s/it] 50%|█████ | 1251/2500 [10:51:20<8:34:41, 24.72s/it] {'loss': 0.0037, 'grad_norm': 0.6180358860204874, 'learning_rate': 4.996e-07, 'completion_length': 76.05357360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.09228515625, 'epoch': 0.5} 50%|█████ | 1251/2500 [10:51:20<8:34:41, 24.72s/it] 50%|█████ | 1252/2500 [10:51:46<8:39:01, 24.95s/it] {'loss': 0.0039, 'grad_norm': 0.22437393838884134, 'learning_rate': 4.991999999999999e-07, 'completion_length': 75.79464721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0966796875, 'epoch': 0.5} 50%|█████ | 1252/2500 [10:51:46<8:39:01, 24.95s/it] 50%|█████ | 1253/2500 [10:52:10<8:30:01, 24.54s/it] {'loss': 0.0028, 'grad_norm': 0.1509145467574842, 'learning_rate': 4.988e-07, 'completion_length': 60.32143020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0697021484375, 'epoch': 0.5} 50%|█████ | 1253/2500 [10:52:10<8:30:01, 24.54s/it] 50%|█████ | 1254/2500 [10:52:36<8:39:30, 25.02s/it] {'loss': 0.0044, 'grad_norm': 0.8075387985870536, 'learning_rate': 4.984e-07, 'completion_length': 67.97321701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.110595703125, 'epoch': 0.5} 50%|█████ | 1254/2500 [10:52:36<8:39:30, 25.02s/it] 50%|█████ | 1255/2500 [10:53:01<8:38:53, 25.01s/it] {'loss': 0.0049, 'grad_norm': 0.28691726941845186, 'learning_rate': 4.979999999999999e-07, 'completion_length': 65.25, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12255859375, 'epoch': 0.5} 50%|█████ | 1255/2500 [10:53:01<8:38:53, 25.01s/it] 50%|█████ | 1256/2500 [10:53:24<8:28:34, 24.53s/it] {'loss': 0.0041, 'grad_norm': 0.7786147906567182, 'learning_rate': 4.976e-07, 'completion_length': 67.77679061889648, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1025390625, 'epoch': 0.5} 50%|█████ | 1256/2500 [10:53:24<8:28:34, 24.53s/it] 50%|█████ | 1257/2500 [10:53:50<8:34:35, 24.84s/it] {'loss': 0.0033, 'grad_norm': 0.13120915362137495, 'learning_rate': 4.972e-07, 'completion_length': 70.74107360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.082763671875, 'epoch': 0.5} 50%|█████ | 1257/2500 [10:53:50<8:34:35, 24.84s/it] 50%|█████ | 1258/2500 [10:54:16<8:40:54, 25.16s/it] {'loss': 0.0035, 'grad_norm': 0.17034730953780206, 'learning_rate': 4.968e-07, 'completion_length': 62.642860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0869140625, 'epoch': 0.5} 50%|█████ | 1258/2500 [10:54:16<8:40:54, 25.16s/it] 50%|█████ | 1259/2500 [10:54:40<8:35:16, 24.91s/it] {'loss': 0.0035, 'grad_norm': 5.707936072037131, 'learning_rate': 4.964e-07, 'completion_length': 68.54464721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.05050762742757797, 'kl': 0.08837890625, 'epoch': 0.5} 50%|█████ | 1259/2500 [10:54:40<8:35:16, 24.91s/it] 50%|█████ | 1260/2500 [10:55:04<8:30:07, 24.68s/it] {'loss': 0.0056, 'grad_norm': 0.34761669660089783, 'learning_rate': 4.96e-07, 'completion_length': 65.34821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.14111328125, 'epoch': 0.5} 50%|█████ | 1260/2500 [10:55:04<8:30:07, 24.68s/it] 50%|█████ | 1261/2500 [10:55:27<8:19:31, 24.19s/it] {'loss': 0.0044, 'grad_norm': 4.405514302761105, 'learning_rate': 4.956e-07, 'completion_length': 65.14286041259766, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.10986328125, 'epoch': 0.5} 50%|█████ | 1261/2500 [10:55:27<8:19:31, 24.19s/it] 50%|█████ | 1262/2500 [10:55:52<8:23:19, 24.39s/it] {'loss': 0.004, 'grad_norm': 0.17713981320655425, 'learning_rate': 4.951999999999999e-07, 'completion_length': 60.812503814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10107421875, 'epoch': 0.5} 50%|█████ | 1262/2500 [10:55:52<8:23:19, 24.39s/it] 51%|█████ | 1263/2500 [10:56:15<8:13:55, 23.96s/it] {'loss': 0.0038, 'grad_norm': 0.22717140987447876, 'learning_rate': 4.948e-07, 'completion_length': 61.00893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09521484375, 'epoch': 0.51} 51%|█████ | 1263/2500 [10:56:15<8:13:55, 23.96s/it] 51%|█████ | 1264/2500 [10:56:37<8:01:15, 23.36s/it] {'loss': 0.0048, 'grad_norm': 0.30058794424973095, 'learning_rate': 4.944e-07, 'completion_length': 62.955360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12060546875, 'epoch': 0.51} 51%|█████ | 1264/2500 [10:56:37<8:01:15, 23.36s/it] 51%|█████ | 1265/2500 [10:56:59<7:54:02, 23.03s/it] {'loss': 0.0042, 'grad_norm': 0.16424399202918286, 'learning_rate': 4.94e-07, 'completion_length': 64.53571701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1044921875, 'epoch': 0.51} 51%|█████ | 1265/2500 [10:56:59<7:54:02, 23.03s/it] 51%|█████ | 1266/2500 [10:57:23<8:00:53, 23.38s/it] {'loss': 0.0049, 'grad_norm': 1.084227048794896, 'learning_rate': 4.935999999999999e-07, 'completion_length': 67.08036041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.123779296875, 'epoch': 0.51} 51%|█████ | 1266/2500 [10:57:23<8:00:53, 23.38s/it] 51%|█████ | 1267/2500 [10:57:53<8:36:39, 25.14s/it] {'loss': 0.0047, 'grad_norm': 1.256211271251895, 'learning_rate': 4.932e-07, 'completion_length': 63.312503814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.11767578125, 'epoch': 0.51} 51%|█████ | 1267/2500 [10:57:53<8:36:39, 25.14s/it] 51%|█████ | 1268/2500 [10:58:17<8:31:57, 24.93s/it] {'loss': 0.0027, 'grad_norm': 0.8202047514388133, 'learning_rate': 4.928e-07, 'completion_length': 66.47321701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.067138671875, 'epoch': 0.51} 51%|█████ | 1268/2500 [10:58:17<8:31:57, 24.93s/it] 51%|█████ | 1269/2500 [10:58:41<8:25:31, 24.64s/it] {'loss': 0.004, 'grad_norm': 0.16627645782423514, 'learning_rate': 4.923999999999999e-07, 'completion_length': 66.03571510314941, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.100830078125, 'epoch': 0.51} 51%|█████ | 1269/2500 [10:58:41<8:25:31, 24.64s/it] 51%|█████ | 1270/2500 [10:59:05<8:19:35, 24.37s/it] {'loss': 0.0064, 'grad_norm': 1.3186832774089028, 'learning_rate': 4.92e-07, 'completion_length': 63.107147216796875, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.1591796875, 'epoch': 0.51} 51%|█████ | 1270/2500 [10:59:05<8:19:35, 24.37s/it] 51%|█████ | 1271/2500 [10:59:30<8:27:09, 24.76s/it] {'loss': 0.0039, 'grad_norm': 0.5276749347215453, 'learning_rate': 4.916e-07, 'completion_length': 60.267860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.096923828125, 'epoch': 0.51} 51%|█████ | 1271/2500 [10:59:30<8:27:09, 24.76s/it] 51%|█████ | 1272/2500 [10:59:54<8:22:44, 24.56s/it] {'loss': 0.0036, 'grad_norm': 1.0649808232370546, 'learning_rate': 4.912e-07, 'completion_length': 65.22321701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.0908203125, 'epoch': 0.51} 51%|█████ | 1272/2500 [10:59:54<8:22:44, 24.56s/it] 51%|█████ | 1273/2500 [11:00:20<8:29:56, 24.94s/it] {'loss': 0.0042, 'grad_norm': 0.6162299304348383, 'learning_rate': 4.908e-07, 'completion_length': 66.81250381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.105712890625, 'epoch': 0.51} 51%|█████ | 1273/2500 [11:00:20<8:29:56, 24.94s/it] 51%|█████ | 1274/2500 [11:00:45<8:27:44, 24.85s/it] {'loss': 0.0081, 'grad_norm': 1.1942406440760758, 'learning_rate': 4.904e-07, 'completion_length': 55.75893211364746, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.20361328125, 'epoch': 0.51} 51%|█████ | 1274/2500 [11:00:45<8:27:44, 24.85s/it] 51%|█████ | 1275/2500 [11:01:08<8:16:55, 24.34s/it] {'loss': 0.0032, 'grad_norm': 2.2905288484440587, 'learning_rate': 4.9e-07, 'completion_length': 60.27678871154785, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.0810546875, 'epoch': 0.51} 51%|█████ | 1275/2500 [11:01:08<8:16:55, 24.34s/it] 51%|█████ | 1276/2500 [11:01:31<8:06:32, 23.85s/it] {'loss': 0.0046, 'grad_norm': 0.15237145155277532, 'learning_rate': 4.895999999999999e-07, 'completion_length': 63.40178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.115478515625, 'epoch': 0.51} 51%|█████ | 1276/2500 [11:01:31<8:06:32, 23.85s/it] 51%|█████ | 1277/2500 [11:01:55<8:05:57, 23.84s/it] {'loss': 0.0052, 'grad_norm': 1.4693009429145094, 'learning_rate': 4.892e-07, 'completion_length': 52.52678871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.129150390625, 'epoch': 0.51} 51%|█████ | 1277/2500 [11:01:55<8:05:57, 23.84s/it] 51%|█████ | 1278/2500 [11:02:20<8:12:47, 24.20s/it] {'loss': 0.0039, 'grad_norm': 0.172403234962861, 'learning_rate': 4.888e-07, 'completion_length': 59.93750190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09716796875, 'epoch': 0.51} 51%|█████ | 1278/2500 [11:02:20<8:12:47, 24.20s/it] 51%|█████ | 1279/2500 [11:02:44<8:11:38, 24.16s/it] {'loss': 0.0048, 'grad_norm': 0.17650623741829427, 'learning_rate': 4.884e-07, 'completion_length': 63.09821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12109375, 'epoch': 0.51} 51%|█████ | 1279/2500 [11:02:44<8:11:38, 24.16s/it] 51%|█████ | 1280/2500 [11:03:09<8:15:36, 24.37s/it] {'loss': 0.0042, 'grad_norm': 0.19423099501865534, 'learning_rate': 4.879999999999999e-07, 'completion_length': 54.31250190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.104736328125, 'epoch': 0.51} 51%|█████ | 1280/2500 [11:03:09<8:15:36, 24.37s/it] 51%|█████ | 1281/2500 [11:03:33<8:14:56, 24.36s/it] {'loss': 0.0059, 'grad_norm': 0.32147659385202554, 'learning_rate': 4.876e-07, 'completion_length': 54.42857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.146240234375, 'epoch': 0.51} 51%|█████ | 1281/2500 [11:03:33<8:14:56, 24.36s/it] 51%|█████▏ | 1282/2500 [11:03:57<8:15:57, 24.43s/it] {'loss': 0.0061, 'grad_norm': 0.3224893428715166, 'learning_rate': 4.872e-07, 'completion_length': 57.69643211364746, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.15283203125, 'epoch': 0.51} 51%|█████▏ | 1282/2500 [11:03:57<8:15:57, 24.43s/it] 51%|█████▏ | 1283/2500 [11:04:20<8:06:19, 23.98s/it] {'loss': 0.0054, 'grad_norm': 0.35069259768826133, 'learning_rate': 4.867999999999999e-07, 'completion_length': 53.24107360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1357421875, 'epoch': 0.51} 51%|█████▏ | 1283/2500 [11:04:20<8:06:19, 23.98s/it] 51%|█████▏ | 1284/2500 [11:04:46<8:17:57, 24.57s/it] {'loss': 0.0053, 'grad_norm': 0.9683141152682949, 'learning_rate': 4.864e-07, 'completion_length': 56.67857360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13134765625, 'epoch': 0.51} 51%|█████▏ | 1284/2500 [11:04:46<8:17:57, 24.57s/it] 51%|█████▏ | 1285/2500 [11:05:10<8:13:21, 24.36s/it] {'loss': 0.0058, 'grad_norm': 0.166871353230372, 'learning_rate': 4.86e-07, 'completion_length': 61.77678871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1455078125, 'epoch': 0.51} 51%|█████▏ | 1285/2500 [11:05:10<8:13:21, 24.36s/it] 51%|█████▏ | 1286/2500 [11:05:35<8:15:29, 24.49s/it] {'loss': 0.0088, 'grad_norm': 2.7571250743321505, 'learning_rate': 4.856e-07, 'completion_length': 58.30357360839844, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9375000596046448, 'reward_std': 0.07576144114136696, 'kl': 0.22021484375, 'epoch': 0.51} 51%|█████▏ | 1286/2500 [11:05:35<8:15:29, 24.49s/it] 51%|█████▏ | 1287/2500 [11:06:00<8:16:51, 24.58s/it] {'loss': 0.004, 'grad_norm': 1.7512283357978005, 'learning_rate': 4.852e-07, 'completion_length': 58.33928871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1005859375, 'epoch': 0.51} 51%|█████▏ | 1287/2500 [11:06:00<8:16:51, 24.58s/it] 52%|█████▏ | 1288/2500 [11:06:24<8:15:00, 24.51s/it] {'loss': 0.0042, 'grad_norm': 0.15390252996515874, 'learning_rate': 4.848e-07, 'completion_length': 65.65178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.105712890625, 'epoch': 0.52} 52%|█████▏ | 1288/2500 [11:06:24<8:15:00, 24.51s/it] 52%|█████▏ | 1289/2500 [11:06:48<8:11:04, 24.33s/it] {'loss': 0.0052, 'grad_norm': 0.2961914861000559, 'learning_rate': 4.844e-07, 'completion_length': 58.16071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.130615234375, 'epoch': 0.52} 52%|█████▏ | 1289/2500 [11:06:48<8:11:04, 24.33s/it] 52%|█████▏ | 1290/2500 [11:07:12<8:05:38, 24.08s/it] {'loss': 0.0041, 'grad_norm': 0.8109927921617907, 'learning_rate': 4.839999999999999e-07, 'completion_length': 64.61607360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.102294921875, 'epoch': 0.52} 52%|█████▏ | 1290/2500 [11:07:12<8:05:38, 24.08s/it] 52%|█████▏ | 1291/2500 [11:07:36<8:07:29, 24.19s/it] {'loss': 0.0042, 'grad_norm': 0.22549511592956814, 'learning_rate': 4.835999999999999e-07, 'completion_length': 64.69643020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.104248046875, 'epoch': 0.52} 52%|█████▏ | 1291/2500 [11:07:36<8:07:29, 24.19s/it] 52%|█████▏ | 1292/2500 [11:08:03<8:25:26, 25.10s/it] {'loss': 0.0087, 'grad_norm': 0.5688399727194384, 'learning_rate': 4.832e-07, 'completion_length': 71.41071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.21728515625, 'epoch': 0.52} 52%|█████▏ | 1292/2500 [11:08:03<8:25:26, 25.10s/it] 52%|█████▏ | 1293/2500 [11:08:27<8:17:28, 24.73s/it] {'loss': 0.0035, 'grad_norm': 2.9396245581912153, 'learning_rate': 4.828e-07, 'completion_length': 72.10714721679688, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642858505249023, 'reward_std': 0.0835726335644722, 'kl': 0.088134765625, 'epoch': 0.52} 52%|█████▏ | 1293/2500 [11:08:27<8:17:28, 24.73s/it] 52%|█████▏ | 1294/2500 [11:08:52<8:16:48, 24.72s/it] {'loss': 0.0047, 'grad_norm': 1.3915666226855115, 'learning_rate': 4.823999999999999e-07, 'completion_length': 64.88393211364746, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.1181640625, 'epoch': 0.52} 52%|█████▏ | 1294/2500 [11:08:52<8:16:48, 24.72s/it] 52%|█████▏ | 1295/2500 [11:09:17<8:21:16, 24.96s/it] {'loss': 0.0029, 'grad_norm': 1.4147586280214983, 'learning_rate': 4.82e-07, 'completion_length': 70.8214340209961, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.071533203125, 'epoch': 0.52} 52%|█████▏ | 1295/2500 [11:09:17<8:21:16, 24.96s/it] 52%|█████▏ | 1296/2500 [11:09:45<8:37:53, 25.81s/it] {'loss': 0.0043, 'grad_norm': 0.11858941687148057, 'learning_rate': 4.816e-07, 'completion_length': 71.06250381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1083984375, 'epoch': 0.52} 52%|█████▏ | 1296/2500 [11:09:45<8:37:53, 25.81s/it] 52%|█████▏ | 1297/2500 [11:10:09<8:25:09, 25.19s/it] {'loss': 0.0039, 'grad_norm': 1.204629452607238, 'learning_rate': 4.812e-07, 'completion_length': 63.580360412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.098388671875, 'epoch': 0.52} 52%|█████▏ | 1297/2500 [11:10:09<8:25:09, 25.19s/it] 52%|█████▏ | 1298/2500 [11:10:33<8:21:22, 25.03s/it] {'loss': 0.0041, 'grad_norm': 0.16501276896520606, 'learning_rate': 4.808e-07, 'completion_length': 68.41072082519531, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.102783203125, 'epoch': 0.52} 52%|█████▏ | 1298/2500 [11:10:34<8:21:22, 25.03s/it] 52%|█████▏ | 1299/2500 [11:10:59<8:25:13, 25.24s/it] {'loss': 0.0055, 'grad_norm': 1.889025547198666, 'learning_rate': 4.804e-07, 'completion_length': 68.04464721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.138427734375, 'epoch': 0.52} 52%|█████▏ | 1299/2500 [11:10:59<8:25:13, 25.24s/it] 52%|█████▏ | 1300/2500 [11:11:24<8:23:14, 25.16s/it] {'loss': 0.0044, 'grad_norm': 1.3885327966238883, 'learning_rate': 4.8e-07, 'completion_length': 66.5535774230957, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.110595703125, 'epoch': 0.52} 52%|█████▏ | 1300/2500 [11:11:24<8:23:14, 25.16s/it] 52%|█████▏ | 1301/2500 [11:12:25<11:53:40, 35.71s/it] {'loss': 0.0055, 'grad_norm': 1.3377898451723238, 'learning_rate': 4.796e-07, 'completion_length': 63.75893020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13818359375, 'epoch': 0.52} 52%|█████▏ | 1301/2500 [11:12:25<11:53:40, 35.71s/it] 52%|█████▏ | 1302/2500 [11:13:14<13:14:20, 39.78s/it] {'loss': 0.0136, 'grad_norm': 1.0254900437011614, 'learning_rate': 4.792e-07, 'completion_length': 66.88393211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.340576171875, 'epoch': 0.52} 52%|█████▏ | 1302/2500 [11:13:14<13:14:20, 39.78s/it] 52%|█████▏ | 1303/2500 [11:14:43<18:09:55, 54.63s/it] {'loss': 0.0045, 'grad_norm': 1.1245112730458355, 'learning_rate': 4.788e-07, 'completion_length': 69.75000381469727, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.112060546875, 'epoch': 0.52} 52%|█████▏ | 1303/2500 [11:14:43<18:09:55, 54.63s/it] 52%|█████▏ | 1304/2500 [11:15:12<15:35:17, 46.92s/it] {'loss': 0.003, 'grad_norm': 0.1574995076858158, 'learning_rate': 4.783999999999999e-07, 'completion_length': 64.23214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.076171875, 'epoch': 0.52} 52%|█████▏ | 1304/2500 [11:15:12<15:35:17, 46.92s/it] 52%|█████▏ | 1305/2500 [11:15:43<14:01:33, 42.25s/it] {'loss': 0.0032, 'grad_norm': 0.5894663187370213, 'learning_rate': 4.779999999999999e-07, 'completion_length': 64.99107360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.079833984375, 'epoch': 0.52} 52%|█████▏ | 1305/2500 [11:15:43<14:01:33, 42.25s/it] 52%|█████▏ | 1306/2500 [11:16:16<13:04:46, 39.44s/it] {'loss': 0.0055, 'grad_norm': 1.1125032243258668, 'learning_rate': 4.776e-07, 'completion_length': 73.08928680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9285714626312256, 'reward_std': 0.10431019216775894, 'kl': 0.13720703125, 'epoch': 0.52} 52%|█████▏ | 1306/2500 [11:16:16<13:04:46, 39.44s/it] 52%|█████▏ | 1307/2500 [11:17:13<14:48:44, 44.70s/it] {'loss': 0.0031, 'grad_norm': 0.25892835545344595, 'learning_rate': 4.772e-07, 'completion_length': 65.28571891784668, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.07666015625, 'epoch': 0.52} 52%|█████▏ | 1307/2500 [11:17:13<14:48:44, 44.70s/it] 52%|█████▏ | 1308/2500 [11:18:20<16:57:25, 51.21s/it] {'loss': 0.0043, 'grad_norm': 1.606795990701598, 'learning_rate': 4.768e-07, 'completion_length': 65.95536041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.107177734375, 'epoch': 0.52} 52%|█████▏ | 1308/2500 [11:18:20<16:57:25, 51.21s/it] 52%|█████▏ | 1309/2500 [11:18:47<14:36:06, 44.14s/it] {'loss': 0.0101, 'grad_norm': 2.5838060159059983, 'learning_rate': 4.7639999999999995e-07, 'completion_length': 59.71428680419922, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.910714328289032, 'reward_std': 0.12887153029441833, 'kl': 0.25341796875, 'epoch': 0.52} 52%|█████▏ | 1309/2500 [11:18:47<14:36:06, 44.14s/it] 52%|█████▏ | 1310/2500 [11:19:12<12:40:06, 38.32s/it] {'loss': 0.0129, 'grad_norm': 2.071200963737904, 'learning_rate': 4.76e-07, 'completion_length': 59.42857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.321533203125, 'epoch': 0.52} 52%|█████▏ | 1310/2500 [11:19:12<12:40:06, 38.32s/it] 52%|█████▏ | 1311/2500 [11:19:39<11:33:04, 34.97s/it] {'loss': 0.0037, 'grad_norm': 0.13866689340793042, 'learning_rate': 4.756e-07, 'completion_length': 64.59821891784668, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.091552734375, 'epoch': 0.52} 52%|█████▏ | 1311/2500 [11:19:39<11:33:04, 34.97s/it] 52%|█████▏ | 1312/2500 [11:20:08<10:56:44, 33.17s/it] {'loss': 0.0041, 'grad_norm': 0.12897710153208378, 'learning_rate': 4.7519999999999997e-07, 'completion_length': 60.52678871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1015625, 'epoch': 0.52} 52%|█████▏ | 1312/2500 [11:20:08<10:56:44, 33.17s/it] 53%|█████▎ | 1313/2500 [11:20:33<10:04:46, 30.57s/it] {'loss': 0.0042, 'grad_norm': 0.9379689712546604, 'learning_rate': 4.748e-07, 'completion_length': 61.58035850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1044921875, 'epoch': 0.53} 53%|█████▎ | 1313/2500 [11:20:33<10:04:46, 30.57s/it] 53%|█████▎ | 1314/2500 [11:20:58<9:32:10, 28.95s/it] {'loss': 0.0034, 'grad_norm': 0.2077576567089641, 'learning_rate': 4.7439999999999996e-07, 'completion_length': 60.10714530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.085693359375, 'epoch': 0.53} 53%|█████▎ | 1314/2500 [11:20:58<9:32:10, 28.95s/it] 53%|█████▎ | 1315/2500 [11:21:21<8:55:23, 27.11s/it] {'loss': 0.0049, 'grad_norm': 0.22330843665178407, 'learning_rate': 4.7399999999999993e-07, 'completion_length': 55.18750190734863, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.12158203125, 'epoch': 0.53} 53%|█████▎ | 1315/2500 [11:21:21<8:55:23, 27.11s/it] 53%|█████▎ | 1316/2500 [11:22:04<10:29:12, 31.89s/it] {'loss': 0.0054, 'grad_norm': 1.8516762780294584, 'learning_rate': 4.736e-07, 'completion_length': 57.66964530944824, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.05050762742757797, 'kl': 0.134765625, 'epoch': 0.53} 53%|█████▎ | 1316/2500 [11:22:04<10:29:12, 31.89s/it] 53%|█████▎ | 1317/2500 [11:22:29<9:51:08, 29.98s/it] {'loss': 0.0038, 'grad_norm': 6.1614370246422645, 'learning_rate': 4.732e-07, 'completion_length': 60.47321701049805, 'rewards/accuracy_reward': 0.8839285969734192, 'rewards/format_reward': 1.0, 'reward': 1.8839285969734192, 'reward_std': 0.08747542649507523, 'kl': 0.0947265625, 'epoch': 0.53} 53%|█████▎ | 1317/2500 [11:22:29<9:51:08, 29.98s/it] 53%|█████▎ | 1318/2500 [11:22:54<9:22:11, 28.54s/it] {'loss': 0.0067, 'grad_norm': 2.64038490158879, 'learning_rate': 4.728e-07, 'completion_length': 64.34821510314941, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.16748046875, 'epoch': 0.53} 53%|█████▎ | 1318/2500 [11:22:54<9:22:11, 28.54s/it] 53%|█████▎ | 1319/2500 [11:23:19<9:01:29, 27.51s/it] {'loss': 0.0119, 'grad_norm': 3.002774147772918, 'learning_rate': 4.7239999999999997e-07, 'completion_length': 53.66071701049805, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.29638671875, 'epoch': 0.53} 53%|█████▎ | 1319/2500 [11:23:19<9:01:29, 27.51s/it] 53%|█████▎ | 1320/2500 [11:23:50<9:21:16, 28.54s/it] {'loss': 0.0155, 'grad_norm': 4.3356637876758, 'learning_rate': 4.7199999999999994e-07, 'completion_length': 55.04464530944824, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.8750000596046448, 'reward_std': 0.23321620374917984, 'kl': 0.3857421875, 'epoch': 0.53} 53%|█████▎ | 1320/2500 [11:23:50<9:21:16, 28.54s/it] 53%|█████▎ | 1321/2500 [11:24:24<9:49:10, 29.98s/it] {'loss': 0.0077, 'grad_norm': 1.5604743661886635, 'learning_rate': 4.716e-07, 'completion_length': 56.64285850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.191162109375, 'epoch': 0.53} 53%|█████▎ | 1321/2500 [11:24:24<9:49:10, 29.98s/it] 53%|█████▎ | 1322/2500 [11:24:50<9:26:25, 28.85s/it] {'loss': 0.0122, 'grad_norm': 1.8993438151195707, 'learning_rate': 4.712e-07, 'completion_length': 61.875003814697266, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.3046875, 'epoch': 0.53} 53%|█████▎ | 1322/2500 [11:24:50<9:26:25, 28.85s/it] 53%|█████▎ | 1323/2500 [11:25:20<9:33:22, 29.23s/it] {'loss': 0.0041, 'grad_norm': 0.829666368057453, 'learning_rate': 4.7079999999999995e-07, 'completion_length': 58.57143211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.10205078125, 'epoch': 0.53} 53%|█████▎ | 1323/2500 [11:25:20<9:33:22, 29.23s/it] 53%|█████▎ | 1324/2500 [11:25:53<9:51:50, 30.20s/it] {'loss': 0.0037, 'grad_norm': 1.9829652644984765, 'learning_rate': 4.704e-07, 'completion_length': 61.11607360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.093017578125, 'epoch': 0.53} 53%|█████▎ | 1324/2500 [11:25:53<9:51:50, 30.20s/it] 53%|█████▎ | 1325/2500 [11:26:25<10:07:20, 31.01s/it] {'loss': 0.0072, 'grad_norm': 1.3766843362015606, 'learning_rate': 4.6999999999999995e-07, 'completion_length': 61.294647216796875, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.179931640625, 'epoch': 0.53} 53%|█████▎ | 1325/2500 [11:26:25<10:07:20, 31.01s/it] 53%|█████▎ | 1326/2500 [11:26:53<9:46:49, 29.99s/it] {'loss': 0.0093, 'grad_norm': 3.5377338724142304, 'learning_rate': 4.6959999999999997e-07, 'completion_length': 61.937503814697266, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857313156128, 'reward_std': 0.0835726335644722, 'kl': 0.2333984375, 'epoch': 0.53} 53%|█████▎ | 1326/2500 [11:26:53<9:46:49, 29.99s/it] 53%|█████▎ | 1327/2500 [11:27:39<11:18:13, 34.69s/it] {'loss': 0.005, 'grad_norm': 1.8749683332041407, 'learning_rate': 4.692e-07, 'completion_length': 59.72321701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1240234375, 'epoch': 0.53} 53%|█████▎ | 1327/2500 [11:27:39<11:18:13, 34.69s/it] 53%|█████▎ | 1328/2500 [11:28:26<12:33:15, 38.56s/it] {'loss': 0.0049, 'grad_norm': 0.9267185560358266, 'learning_rate': 4.6879999999999996e-07, 'completion_length': 58.392860412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.123291015625, 'epoch': 0.53} 53%|█████▎ | 1328/2500 [11:28:26<12:33:15, 38.56s/it] 53%|█████▎ | 1329/2500 [11:29:08<12:50:10, 39.46s/it] {'loss': 0.01, 'grad_norm': 1.2451536071152631, 'learning_rate': 4.684e-07, 'completion_length': 65.78571891784668, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.2490234375, 'epoch': 0.53} 53%|█████▎ | 1329/2500 [11:29:08<12:50:10, 39.46s/it] 53%|█████▎ | 1330/2500 [11:30:01<14:06:45, 43.42s/it] {'loss': 0.0172, 'grad_norm': 3.097318909135403, 'learning_rate': 4.68e-07, 'completion_length': 58.71428871154785, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.08747542649507523, 'kl': 0.4306640625, 'epoch': 0.53} 53%|█████▎ | 1330/2500 [11:30:01<14:06:45, 43.42s/it] 53%|█████▎ | 1331/2500 [11:30:53<14:57:44, 46.08s/it] {'loss': 0.0076, 'grad_norm': 2.010298196001704, 'learning_rate': 4.676e-07, 'completion_length': 59.392860412597656, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.946428656578064, 'reward_std': 0.08868780732154846, 'kl': 0.18994140625, 'epoch': 0.53} 53%|█████▎ | 1331/2500 [11:30:53<14:57:44, 46.08s/it] 53%|█████▎ | 1332/2500 [11:31:31<14:11:14, 43.73s/it] {'loss': 0.0033, 'grad_norm': 1.4872424548093712, 'learning_rate': 4.672e-07, 'completion_length': 64.07143020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.083251953125, 'epoch': 0.53} 53%|█████▎ | 1332/2500 [11:31:31<14:11:14, 43.73s/it] 53%|█████▎ | 1333/2500 [11:31:55<12:17:30, 37.92s/it] {'loss': 0.0033, 'grad_norm': 0.8641365946684337, 'learning_rate': 4.6679999999999997e-07, 'completion_length': 62.78571891784668, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.0830078125, 'epoch': 0.53} 53%|█████▎ | 1333/2500 [11:31:55<12:17:30, 37.92s/it] 53%|█████▎ | 1334/2500 [11:32:46<13:31:29, 41.76s/it] {'loss': 0.0031, 'grad_norm': 2.505595503656492, 'learning_rate': 4.6639999999999994e-07, 'completion_length': 75.04464721679688, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.0835726372897625, 'kl': 0.076416015625, 'epoch': 0.53} 53%|█████▎ | 1334/2500 [11:32:46<13:31:29, 41.76s/it] 53%|█████▎ | 1335/2500 [11:34:14<18:00:03, 55.63s/it] {'loss': 0.0039, 'grad_norm': 1.7978447286150976, 'learning_rate': 4.66e-07, 'completion_length': 62.21428871154785, 'rewards/accuracy_reward': 0.8482142984867096, 'rewards/format_reward': 1.0, 'reward': 1.848214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.09765625, 'epoch': 0.53} 53%|█████▎ | 1335/2500 [11:34:14<18:00:03, 55.63s/it] 53%|█████▎ | 1336/2500 [11:34:40<15:05:30, 46.68s/it] {'loss': 0.0029, 'grad_norm': 0.1516057449715052, 'learning_rate': 4.656e-07, 'completion_length': 63.12500190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.07373046875, 'epoch': 0.53} 53%|█████▎ | 1336/2500 [11:34:40<15:05:30, 46.68s/it] 53%|█████▎ | 1337/2500 [11:35:05<12:58:58, 40.19s/it] {'loss': 0.0109, 'grad_norm': 1.3897775704618498, 'learning_rate': 4.6519999999999996e-07, 'completion_length': 72.64286041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.273193359375, 'epoch': 0.53} 53%|█████▎ | 1337/2500 [11:35:05<12:58:58, 40.19s/it] 54%|█████▎ | 1338/2500 [11:35:28<11:19:26, 35.08s/it] {'loss': 0.0035, 'grad_norm': 0.9084755023607902, 'learning_rate': 4.648e-07, 'completion_length': 71.56250381469727, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.0885009765625, 'epoch': 0.54} 54%|█████▎ | 1338/2500 [11:35:28<11:19:26, 35.08s/it] 54%|█████▎ | 1339/2500 [11:35:55<10:31:17, 32.63s/it] {'loss': 0.0036, 'grad_norm': 1.4174358687839053, 'learning_rate': 4.6439999999999995e-07, 'completion_length': 70.74107360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.08984375, 'epoch': 0.54} 54%|█████▎ | 1339/2500 [11:35:55<10:31:17, 32.63s/it] 54%|█████▎ | 1340/2500 [11:36:19<9:40:26, 30.02s/it] {'loss': 0.0041, 'grad_norm': 0.24063786277337584, 'learning_rate': 4.64e-07, 'completion_length': 67.55357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.102294921875, 'epoch': 0.54} 54%|█████▎ | 1340/2500 [11:36:19<9:40:26, 30.02s/it] 54%|█████▎ | 1341/2500 [11:37:19<12:34:39, 39.07s/it] {'loss': 0.0066, 'grad_norm': 0.9565429036398321, 'learning_rate': 4.636e-07, 'completion_length': 76.07143020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1650390625, 'epoch': 0.54} 54%|█████▎ | 1341/2500 [11:37:19<12:34:39, 39.07s/it] 54%|█████▎ | 1342/2500 [11:38:47<17:16:58, 53.73s/it] {'loss': 0.0036, 'grad_norm': 1.0151718010905242, 'learning_rate': 4.6319999999999997e-07, 'completion_length': 71.61607360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.0894775390625, 'epoch': 0.54} 54%|█████▎ | 1342/2500 [11:38:47<17:16:58, 53.73s/it] 54%|█████▎ | 1343/2500 [11:39:13<14:35:36, 45.41s/it] {'loss': 0.0031, 'grad_norm': 0.18558685565143215, 'learning_rate': 4.628e-07, 'completion_length': 75.24107360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.07666015625, 'epoch': 0.54} 54%|█████▎ | 1343/2500 [11:39:13<14:35:36, 45.41s/it] 54%|█████▍ | 1344/2500 [11:39:38<12:35:22, 39.21s/it] {'loss': 0.0053, 'grad_norm': 1.3725484352377106, 'learning_rate': 4.6239999999999996e-07, 'completion_length': 73.24107360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9464285969734192, 'reward_std': 0.07839837670326233, 'kl': 0.13330078125, 'epoch': 0.54} 54%|█████▍ | 1344/2500 [11:39:38<12:35:22, 39.21s/it] 54%|█████▍ | 1345/2500 [11:40:16<12:26:49, 38.80s/it] {'loss': 0.0045, 'grad_norm': 0.8501087791651282, 'learning_rate': 4.62e-07, 'completion_length': 73.08928680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.06613001227378845, 'kl': 0.111572265625, 'epoch': 0.54} 54%|█████▍ | 1345/2500 [11:40:16<12:26:49, 38.80s/it] 54%|█████▍ | 1346/2500 [11:40:40<11:00:37, 34.35s/it] {'loss': 0.003, 'grad_norm': 1.4128889888788763, 'learning_rate': 4.616e-07, 'completion_length': 74.93750381469727, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.10882645845413208, 'kl': 0.074462890625, 'epoch': 0.54} 54%|█████▍ | 1346/2500 [11:40:40<11:00:37, 34.35s/it] 54%|█████▍ | 1347/2500 [11:41:04<10:02:34, 31.36s/it] {'loss': 0.0034, 'grad_norm': 2.018393781990806, 'learning_rate': 4.612e-07, 'completion_length': 71.58929061889648, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.0859375, 'epoch': 0.54} 54%|█████▍ | 1347/2500 [11:41:04<10:02:34, 31.36s/it] 54%|█████▍ | 1348/2500 [11:41:28<9:21:18, 29.23s/it] {'loss': 0.0036, 'grad_norm': 1.4426958322981887, 'learning_rate': 4.6079999999999994e-07, 'completion_length': 70.31250381469727, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.09130859375, 'epoch': 0.54} 54%|█████▍ | 1348/2500 [11:41:28<9:21:18, 29.23s/it] 54%|█████▍ | 1349/2500 [11:41:54<8:59:33, 28.13s/it] {'loss': 0.0141, 'grad_norm': 2.03347174794309, 'learning_rate': 4.6039999999999997e-07, 'completion_length': 71.95536041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.35107421875, 'epoch': 0.54} 54%|█████▍ | 1349/2500 [11:41:54<8:59:33, 28.13s/it] 54%|█████▍ | 1350/2500 [11:42:18<8:38:16, 27.04s/it] {'loss': 0.0131, 'grad_norm': 2.1106269478108444, 'learning_rate': 4.6e-07, 'completion_length': 79.86607360839844, 'rewards/accuracy_reward': 0.8392857313156128, 'rewards/format_reward': 0.973214328289032, 'reward': 1.8125000596046448, 'reward_std': 0.17816676944494247, 'kl': 0.3271484375, 'epoch': 0.54} 54%|█████▍ | 1350/2500 [11:42:18<8:38:16, 27.04s/it] 54%|█████▍ | 1351/2500 [11:42:45<8:35:44, 26.93s/it] {'loss': 0.004, 'grad_norm': 2.4523286780215017, 'learning_rate': 4.596e-07, 'completion_length': 69.17857360839844, 'rewards/accuracy_reward': 0.848214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.8392857909202576, 'reward_std': 0.1963018849492073, 'kl': 0.099609375, 'epoch': 0.54} 54%|█████▍ | 1351/2500 [11:42:45<8:35:44, 26.93s/it] 54%|█████▍ | 1352/2500 [11:43:13<8:40:29, 27.20s/it] {'loss': 0.0043, 'grad_norm': 0.7850655348964426, 'learning_rate': 4.592e-07, 'completion_length': 74.08036041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.108642578125, 'epoch': 0.54} 54%|█████▍ | 1352/2500 [11:43:13<8:40:29, 27.20s/it] 54%|█████▍ | 1353/2500 [11:43:43<8:57:48, 28.13s/it] {'loss': 0.0121, 'grad_norm': 5.576372712078236, 'learning_rate': 4.5879999999999995e-07, 'completion_length': 65.16071510314941, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.883928656578064, 'reward_std': 0.12956400960683823, 'kl': 0.302734375, 'epoch': 0.54} 54%|█████▍ | 1353/2500 [11:43:43<8:57:48, 28.13s/it] 54%|█████▍ | 1354/2500 [11:44:26<10:19:26, 32.43s/it] {'loss': 0.0031, 'grad_norm': 1.201891549007961, 'learning_rate': 4.584e-07, 'completion_length': 67.62500381469727, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.076416015625, 'epoch': 0.54} 54%|█████▍ | 1354/2500 [11:44:26<10:19:26, 32.43s/it] 54%|█████▍ | 1355/2500 [11:45:18<12:14:52, 38.51s/it] {'loss': 0.0038, 'grad_norm': 0.2761365540355422, 'learning_rate': 4.58e-07, 'completion_length': 61.732147216796875, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09521484375, 'epoch': 0.54} 54%|█████▍ | 1355/2500 [11:45:18<12:14:52, 38.51s/it] 54%|█████▍ | 1356/2500 [11:45:56<12:07:18, 38.15s/it] {'loss': 0.0035, 'grad_norm': 1.7746102485874427, 'learning_rate': 4.5759999999999997e-07, 'completion_length': 59.125003814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.087646484375, 'epoch': 0.54} 54%|█████▍ | 1356/2500 [11:45:56<12:07:18, 38.15s/it] 54%|█████▍ | 1357/2500 [11:46:21<10:51:52, 34.22s/it] {'loss': 0.0054, 'grad_norm': 0.5045499309331556, 'learning_rate': 4.572e-07, 'completion_length': 65.37500381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.135009765625, 'epoch': 0.54} 54%|█████▍ | 1357/2500 [11:46:21<10:51:52, 34.22s/it] 54%|█████▍ | 1358/2500 [11:46:46<10:00:17, 31.54s/it] {'loss': 0.0051, 'grad_norm': 1.3461506547985587, 'learning_rate': 4.5679999999999996e-07, 'completion_length': 63.098215103149414, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.126708984375, 'epoch': 0.54} 54%|█████▍ | 1358/2500 [11:46:46<10:00:17, 31.54s/it] 54%|█████▍ | 1359/2500 [11:47:09<9:11:13, 28.99s/it] {'loss': 0.0034, 'grad_norm': 1.014881502075769, 'learning_rate': 4.5639999999999993e-07, 'completion_length': 60.41964530944824, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.085205078125, 'epoch': 0.54} 54%|█████▍ | 1359/2500 [11:47:09<9:11:13, 28.99s/it] 54%|█████▍ | 1360/2500 [11:47:33<8:43:21, 27.55s/it] {'loss': 0.0056, 'grad_norm': 0.4572436073053789, 'learning_rate': 4.56e-07, 'completion_length': 65.09821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.14013671875, 'epoch': 0.54} 54%|█████▍ | 1360/2500 [11:47:33<8:43:21, 27.55s/it] 54%|█████▍ | 1361/2500 [11:47:58<8:27:32, 26.74s/it] {'loss': 0.0051, 'grad_norm': 0.46864796070768294, 'learning_rate': 4.556e-07, 'completion_length': 58.33928871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.126708984375, 'epoch': 0.54} 54%|█████▍ | 1361/2500 [11:47:58<8:27:32, 26.74s/it] 54%|█████▍ | 1362/2500 [11:48:22<8:13:09, 26.00s/it] {'loss': 0.005, 'grad_norm': 6.930297661063172, 'learning_rate': 4.5519999999999995e-07, 'completion_length': 56.45535850524902, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.125244140625, 'epoch': 0.54} 54%|█████▍ | 1362/2500 [11:48:22<8:13:09, 26.00s/it] 55%|█████▍ | 1363/2500 [11:48:49<8:14:58, 26.12s/it] {'loss': 0.0036, 'grad_norm': 1.2357459812063658, 'learning_rate': 4.5479999999999997e-07, 'completion_length': 68.17857360839844, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.06222161650657654, 'kl': 0.0888671875, 'epoch': 0.55} 55%|█████▍ | 1363/2500 [11:48:49<8:14:58, 26.12s/it] 55%|█████▍ | 1364/2500 [11:49:12<8:00:02, 25.35s/it] {'loss': 0.0052, 'grad_norm': 2.4393411634522795, 'learning_rate': 4.544e-07, 'completion_length': 58.78571701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.129150390625, 'epoch': 0.55} 55%|█████▍ | 1364/2500 [11:49:12<8:00:02, 25.35s/it] 55%|█████▍ | 1365/2500 [11:49:57<9:47:38, 31.06s/it] {'loss': 0.0045, 'grad_norm': 0.17789106522224754, 'learning_rate': 4.54e-07, 'completion_length': 63.669647216796875, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.112548828125, 'epoch': 0.55} 55%|█████▍ | 1365/2500 [11:49:57<9:47:38, 31.06s/it] 55%|█████▍ | 1366/2500 [11:50:29<9:53:55, 31.42s/it] {'loss': 0.0044, 'grad_norm': 0.19999142201244396, 'learning_rate': 4.536e-07, 'completion_length': 60.66071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.111083984375, 'epoch': 0.55} 55%|█████▍ | 1366/2500 [11:50:29<9:53:55, 31.42s/it] 55%|█████▍ | 1367/2500 [11:52:30<18:22:11, 58.37s/it] {'loss': 0.0079, 'grad_norm': 2.12094345420618, 'learning_rate': 4.5319999999999996e-07, 'completion_length': 60.32143020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.197021484375, 'epoch': 0.55} 55%|█████▍ | 1367/2500 [11:52:30<18:22:11, 58.37s/it] 55%|█████▍ | 1368/2500 [11:53:02<15:52:59, 50.51s/it] {'loss': 0.0041, 'grad_norm': 1.3484958559787614, 'learning_rate': 4.528e-07, 'completion_length': 55.28571701049805, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.103515625, 'epoch': 0.55} 55%|█████▍ | 1368/2500 [11:53:02<15:52:59, 50.51s/it] 55%|█████▍ | 1369/2500 [11:53:26<13:19:52, 42.43s/it] {'loss': 0.0058, 'grad_norm': 0.5802891698494735, 'learning_rate': 4.524e-07, 'completion_length': 61.69643020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1455078125, 'epoch': 0.55} 55%|█████▍ | 1369/2500 [11:53:26<13:19:52, 42.43s/it] 55%|█████▍ | 1370/2500 [11:53:50<11:32:12, 36.75s/it] {'loss': 0.006, 'grad_norm': 0.4034758343197092, 'learning_rate': 4.5199999999999997e-07, 'completion_length': 55.58928871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.15087890625, 'epoch': 0.55} 55%|█████▍ | 1370/2500 [11:53:50<11:32:12, 36.75s/it] 55%|█████▍ | 1371/2500 [11:54:12<10:12:15, 32.54s/it] {'loss': 0.0079, 'grad_norm': 1.2515135669586896, 'learning_rate': 4.516e-07, 'completion_length': 49.205360412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.197021484375, 'epoch': 0.55} 55%|█████▍ | 1371/2500 [11:54:12<10:12:15, 32.54s/it] 55%|█████▍ | 1372/2500 [11:54:36<9:20:12, 29.80s/it] {'loss': 0.0044, 'grad_norm': 0.19783523775402714, 'learning_rate': 4.5119999999999996e-07, 'completion_length': 58.30357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10986328125, 'epoch': 0.55} 55%|█████▍ | 1372/2500 [11:54:36<9:20:12, 29.80s/it] 55%|█████▍ | 1373/2500 [11:54:59<8:41:41, 27.77s/it] {'loss': 0.0063, 'grad_norm': 1.349123714106259, 'learning_rate': 4.5079999999999993e-07, 'completion_length': 56.21428680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.053144559264183044, 'kl': 0.158203125, 'epoch': 0.55} 55%|█████▍ | 1373/2500 [11:54:59<8:41:41, 27.77s/it] 55%|█████▍ | 1374/2500 [11:55:22<8:15:04, 26.38s/it] {'loss': 0.005, 'grad_norm': 0.8752404163009276, 'learning_rate': 4.504e-07, 'completion_length': 57.84821701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.125244140625, 'epoch': 0.55} 55%|█████▍ | 1374/2500 [11:55:22<8:15:04, 26.38s/it] 55%|█████▌ | 1375/2500 [11:55:46<8:02:06, 25.71s/it] {'loss': 0.0047, 'grad_norm': 0.17289589713613707, 'learning_rate': 4.5e-07, 'completion_length': 56.500003814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11669921875, 'epoch': 0.55} 55%|█████▌ | 1375/2500 [11:55:46<8:02:06, 25.71s/it] 55%|█████▌ | 1376/2500 [11:56:09<7:48:19, 25.00s/it] {'loss': 0.0125, 'grad_norm': 1.6065886612394795, 'learning_rate': 4.496e-07, 'completion_length': 60.53571701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.3125, 'epoch': 0.55} 55%|█████▌ | 1376/2500 [11:56:09<7:48:19, 25.00s/it] 55%|█████▌ | 1377/2500 [11:56:33<7:38:35, 24.50s/it] {'loss': 0.0119, 'grad_norm': 0.9597471536689344, 'learning_rate': 4.4919999999999997e-07, 'completion_length': 55.33928680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.29736328125, 'epoch': 0.55} 55%|█████▌ | 1377/2500 [11:56:33<7:38:35, 24.50s/it] 55%|█████▌ | 1378/2500 [11:56:57<7:37:49, 24.48s/it] {'loss': 0.0036, 'grad_norm': 0.2656868799414774, 'learning_rate': 4.4879999999999994e-07, 'completion_length': 54.00893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.089599609375, 'epoch': 0.55} 55%|█████▌ | 1378/2500 [11:56:57<7:37:49, 24.48s/it] 55%|█████▌ | 1379/2500 [11:57:21<7:32:53, 24.24s/it] {'loss': 0.004, 'grad_norm': 1.5779650178610103, 'learning_rate': 4.484e-07, 'completion_length': 61.43750190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.10009765625, 'epoch': 0.55} 55%|█████▌ | 1379/2500 [11:57:21<7:32:53, 24.24s/it] 55%|█████▌ | 1380/2500 [11:57:46<7:36:20, 24.45s/it] {'loss': 0.0079, 'grad_norm': 1.1195557970258965, 'learning_rate': 4.48e-07, 'completion_length': 59.294647216796875, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.910714328289032, 'reward_std': 0.05050762742757797, 'kl': 0.1982421875, 'epoch': 0.55} 55%|█████▌ | 1380/2500 [11:57:46<7:36:20, 24.45s/it] 55%|█████▌ | 1381/2500 [11:58:11<7:39:58, 24.66s/it] {'loss': 0.0045, 'grad_norm': 3.062063391306855, 'learning_rate': 4.4759999999999996e-07, 'completion_length': 62.848215103149414, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.11328125, 'epoch': 0.55} 55%|█████▌ | 1381/2500 [11:58:11<7:39:58, 24.66s/it] 55%|█████▌ | 1382/2500 [11:58:33<7:27:27, 24.01s/it] {'loss': 0.0059, 'grad_norm': 2.545407728493252, 'learning_rate': 4.472e-07, 'completion_length': 64.06250190734863, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.08747542649507523, 'kl': 0.148193359375, 'epoch': 0.55} 55%|█████▌ | 1382/2500 [11:58:33<7:27:27, 24.01s/it] 55%|█████▌ | 1383/2500 [11:58:58<7:28:52, 24.11s/it] {'loss': 0.0062, 'grad_norm': 0.42339481489350345, 'learning_rate': 4.4679999999999995e-07, 'completion_length': 56.86607360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.155517578125, 'epoch': 0.55} 55%|█████▌ | 1383/2500 [11:58:58<7:28:52, 24.11s/it] 55%|█████▌ | 1384/2500 [11:59:22<7:31:24, 24.27s/it] {'loss': 0.0043, 'grad_norm': 0.187598263017446, 'learning_rate': 4.464e-07, 'completion_length': 65.09821891784668, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.10693359375, 'epoch': 0.55} 55%|█████▌ | 1384/2500 [11:59:22<7:31:24, 24.27s/it] 55%|█████▌ | 1385/2500 [11:59:46<7:30:06, 24.22s/it] {'loss': 0.004, 'grad_norm': 1.342617297435614, 'learning_rate': 4.46e-07, 'completion_length': 57.94643020629883, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.1005859375, 'epoch': 0.55} 55%|█████▌ | 1385/2500 [11:59:46<7:30:06, 24.22s/it] 55%|█████▌ | 1386/2500 [12:00:10<7:24:49, 23.96s/it] {'loss': 0.0035, 'grad_norm': 2.055545425893607, 'learning_rate': 4.4559999999999997e-07, 'completion_length': 64.41964721679688, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.06343399360775948, 'kl': 0.087890625, 'epoch': 0.55} 55%|█████▌ | 1386/2500 [12:00:10<7:24:49, 23.96s/it] 55%|█████▌ | 1387/2500 [12:00:33<7:22:21, 23.85s/it] {'loss': 0.0052, 'grad_norm': 0.2799121545123168, 'learning_rate': 4.452e-07, 'completion_length': 63.13393211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.129150390625, 'epoch': 0.55} 55%|█████▌ | 1387/2500 [12:00:33<7:22:21, 23.85s/it] 56%|█████▌ | 1388/2500 [12:00:59<7:29:52, 24.27s/it] {'loss': 0.0036, 'grad_norm': 2.120071821247203, 'learning_rate': 4.4479999999999996e-07, 'completion_length': 70.21429061889648, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07124518603086472, 'kl': 0.09033203125, 'epoch': 0.56} 56%|█████▌ | 1388/2500 [12:00:59<7:29:52, 24.27s/it] 56%|█████▌ | 1389/2500 [12:01:25<7:43:48, 25.05s/it] {'loss': 0.004, 'grad_norm': 1.024622120938386, 'learning_rate': 4.444e-07, 'completion_length': 69.41071701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.100341796875, 'epoch': 0.56} 56%|█████▌ | 1389/2500 [12:01:25<7:43:48, 25.05s/it] 56%|█████▌ | 1390/2500 [12:01:51<7:48:32, 25.33s/it] {'loss': 0.0034, 'grad_norm': 0.13623352871652653, 'learning_rate': 4.44e-07, 'completion_length': 63.55357551574707, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0859375, 'epoch': 0.56} 56%|█████▌ | 1390/2500 [12:01:51<7:48:32, 25.33s/it] 56%|█████▌ | 1391/2500 [12:02:16<7:41:33, 24.97s/it] {'loss': 0.0112, 'grad_norm': 2.02172270425567, 'learning_rate': 4.436e-07, 'completion_length': 70.5714340209961, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.278564453125, 'epoch': 0.56} 56%|█████▌ | 1391/2500 [12:02:16<7:41:33, 24.97s/it] 56%|█████▌ | 1392/2500 [12:02:46<8:10:04, 26.54s/it] {'loss': 0.0046, 'grad_norm': 1.9406578286675675, 'learning_rate': 4.4319999999999995e-07, 'completion_length': 66.50000381469727, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.06343398988246918, 'kl': 0.11572265625, 'epoch': 0.56} 56%|█████▌ | 1392/2500 [12:02:46<8:10:04, 26.54s/it] 56%|█████▌ | 1393/2500 [12:03:08<7:44:54, 25.20s/it] {'loss': 0.0045, 'grad_norm': 0.2859231684465482, 'learning_rate': 4.428e-07, 'completion_length': 64.29464721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.111572265625, 'epoch': 0.56} 56%|█████▌ | 1393/2500 [12:03:08<7:44:54, 25.20s/it] 56%|█████▌ | 1394/2500 [12:03:35<7:53:44, 25.70s/it] {'loss': 0.0113, 'grad_norm': 1.9337707959071386, 'learning_rate': 4.424e-07, 'completion_length': 65.4285774230957, 'rewards/accuracy_reward': 0.928571492433548, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9107143878936768, 'reward_std': 0.13408026099205017, 'kl': 0.283203125, 'epoch': 0.56} 56%|█████▌ | 1394/2500 [12:03:35<7:53:44, 25.70s/it] 56%|█████▌ | 1395/2500 [12:04:01<7:57:28, 25.93s/it] {'loss': 0.0038, 'grad_norm': 3.0610676448779737, 'learning_rate': 4.4199999999999996e-07, 'completion_length': 75.37500381469727, 'rewards/accuracy_reward': 0.9375000596046448, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.08747543022036552, 'kl': 0.094482421875, 'epoch': 0.56} 56%|█████▌ | 1395/2500 [12:04:01<7:57:28, 25.93s/it] 56%|█████▌ | 1396/2500 [12:04:25<7:44:22, 25.24s/it] {'loss': 0.0375, 'grad_norm': 3.6635719438106413, 'learning_rate': 4.416e-07, 'completion_length': 65.80357360839844, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9553571939468384, 'reward_std': 0.09138382598757744, 'kl': 0.9404296875, 'epoch': 0.56} 56%|█████▌ | 1396/2500 [12:04:25<7:44:22, 25.24s/it] 56%|█████▌ | 1397/2500 [12:04:51<7:50:33, 25.60s/it] {'loss': 0.015, 'grad_norm': 3.1520119083116818, 'learning_rate': 4.4119999999999995e-07, 'completion_length': 65.28571701049805, 'rewards/accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.883928656578064, 'reward_std': 0.14700663089752197, 'kl': 0.3759765625, 'epoch': 0.56} 56%|█████▌ | 1397/2500 [12:04:51<7:50:33, 25.60s/it] 56%|█████▌ | 1398/2500 [12:05:15<7:42:10, 25.16s/it] {'loss': 0.0093, 'grad_norm': 1.5397929380000044, 'learning_rate': 4.4080000000000003e-07, 'completion_length': 66.95536041259766, 'rewards/accuracy_reward': 0.9375000596046448, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285715222358704, 'reward_std': 0.0835726335644722, 'kl': 0.2314453125, 'epoch': 0.56} 56%|█████▌ | 1398/2500 [12:05:15<7:42:10, 25.16s/it] 56%|█████▌ | 1399/2500 [12:05:46<8:10:25, 26.73s/it] {'loss': 0.0038, 'grad_norm': 0.13203374789913924, 'learning_rate': 4.404e-07, 'completion_length': 60.50000190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.094970703125, 'epoch': 0.56} 56%|█████▌ | 1399/2500 [12:05:46<8:10:25, 26.73s/it] 56%|█████▌ | 1400/2500 [12:06:09<7:49:03, 25.58s/it] {'loss': 0.0069, 'grad_norm': 3.1083015119836612, 'learning_rate': 4.3999999999999997e-07, 'completion_length': 61.41964530944824, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285715222358704, 'reward_std': 0.14579425007104874, 'kl': 0.17236328125, 'epoch': 0.56} 56%|█████▌ | 1400/2500 [12:06:09<7:49:03, 25.58s/it] 56%|█████▌ | 1401/2500 [12:07:08<10:56:54, 35.86s/it] {'loss': 0.0046, 'grad_norm': 0.8221232903721963, 'learning_rate': 4.396e-07, 'completion_length': 68.27678680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.11572265625, 'epoch': 0.56} 56%|█████▌ | 1401/2500 [12:07:08<10:56:54, 35.86s/it] 56%|█████▌ | 1402/2500 [12:07:29<9:33:28, 31.34s/it] {'loss': 0.0124, 'grad_norm': 2.638677257925489, 'learning_rate': 4.3919999999999996e-07, 'completion_length': 67.37500381469727, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.10882644355297089, 'kl': 0.31201171875, 'epoch': 0.56} 56%|█████▌ | 1402/2500 [12:07:29<9:33:28, 31.34s/it] 56%|█████▌ | 1403/2500 [12:07:51<8:37:45, 28.32s/it] {'loss': 0.0135, 'grad_norm': 3.04723667188385, 'learning_rate': 4.388e-07, 'completion_length': 66.6964340209961, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.946428656578064, 'reward_std': 0.15152287483215332, 'kl': 0.33837890625, 'epoch': 0.56} 56%|█████▌ | 1403/2500 [12:07:51<8:37:45, 28.32s/it] 56%|█████▌ | 1404/2500 [12:08:12<8:02:13, 26.40s/it] {'loss': 0.0085, 'grad_norm': 1.5158967583072755, 'learning_rate': 4.384e-07, 'completion_length': 60.32143020629883, 'rewards/accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.8928572535514832, 'reward_std': 0.10101525485515594, 'kl': 0.2138671875, 'epoch': 0.56} 56%|█████▌ | 1404/2500 [12:08:12<8:02:13, 26.40s/it] 56%|█████▌ | 1405/2500 [12:08:39<8:00:19, 26.32s/it] {'loss': 0.0157, 'grad_norm': 1.8262930485658009, 'learning_rate': 4.38e-07, 'completion_length': 64.91071701049805, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9464285969734192, 'reward_std': 0.0739355981349945, 'kl': 0.39453125, 'epoch': 0.56} 56%|█████▌ | 1405/2500 [12:08:39<8:00:19, 26.32s/it] 56%|█████▌ | 1406/2500 [12:09:03<7:51:17, 25.85s/it] {'loss': 0.0084, 'grad_norm': 0.9879853737819658, 'learning_rate': 4.3759999999999995e-07, 'completion_length': 62.651790618896484, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.2099609375, 'epoch': 0.56} 56%|█████▌ | 1406/2500 [12:09:03<7:51:17, 25.85s/it] 56%|█████▋ | 1407/2500 [12:09:28<7:42:58, 25.41s/it] {'loss': 0.0053, 'grad_norm': 5.8418056832356635, 'learning_rate': 4.3719999999999997e-07, 'completion_length': 67.31250381469727, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9107143878936768, 'reward_std': 0.10101525112986565, 'kl': 0.132568359375, 'epoch': 0.56} 56%|█████▋ | 1407/2500 [12:09:28<7:42:58, 25.41s/it] 56%|█████▋ | 1408/2500 [12:09:58<8:08:08, 26.82s/it] {'loss': 0.0169, 'grad_norm': 1.2652621516688232, 'learning_rate': 4.368e-07, 'completion_length': 59.88393020629883, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642858505249023, 'reward_std': 0.0835726335644722, 'kl': 0.42138671875, 'epoch': 0.56} 56%|█████▋ | 1408/2500 [12:09:58<8:08:08, 26.82s/it] 56%|█████▋ | 1409/2500 [12:10:26<8:14:22, 27.19s/it] {'loss': 0.0077, 'grad_norm': 0.9267181839617691, 'learning_rate': 4.364e-07, 'completion_length': 59.88393211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.193359375, 'epoch': 0.56} 56%|█████▋ | 1409/2500 [12:10:26<8:14:22, 27.19s/it] 56%|█████▋ | 1410/2500 [12:10:51<8:01:01, 26.48s/it] {'loss': 0.0049, 'grad_norm': 1.1789607274523086, 'learning_rate': 4.36e-07, 'completion_length': 63.58928680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.12158203125, 'epoch': 0.56} 56%|█████▋ | 1410/2500 [12:10:51<8:01:01, 26.48s/it] 56%|█████▋ | 1411/2500 [12:11:15<7:46:27, 25.70s/it] {'loss': 0.0049, 'grad_norm': 3.4345241580832977, 'learning_rate': 4.3559999999999996e-07, 'completion_length': 57.76785850524902, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.07124518603086472, 'kl': 0.121337890625, 'epoch': 0.56} 56%|█████▋ | 1411/2500 [12:11:15<7:46:27, 25.70s/it] 56%|█████▋ | 1412/2500 [12:11:41<7:48:17, 25.82s/it] {'loss': 0.0109, 'grad_norm': 1.2097340212269145, 'learning_rate': 4.352e-07, 'completion_length': 61.39285850524902, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.27294921875, 'epoch': 0.56} 56%|█████▋ | 1412/2500 [12:11:41<7:48:17, 25.82s/it] 57%|█████▋ | 1413/2500 [12:12:04<7:35:25, 25.14s/it] {'loss': 0.0051, 'grad_norm': 1.3143265948533998, 'learning_rate': 4.348e-07, 'completion_length': 63.46428871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.12841796875, 'epoch': 0.57} 57%|█████▋ | 1413/2500 [12:12:04<7:35:25, 25.14s/it] 57%|█████▋ | 1414/2500 [12:13:39<13:51:53, 45.96s/it] {'loss': 0.0106, 'grad_norm': 1.4222051177150454, 'learning_rate': 4.3439999999999997e-07, 'completion_length': 60.67857551574707, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553572535514832, 'reward_std': 0.08747542649507523, 'kl': 0.265625, 'epoch': 0.57} 57%|█████▋ | 1414/2500 [12:13:39<13:51:53, 45.96s/it] 57%|█████▋ | 1415/2500 [12:14:19<13:18:04, 44.13s/it] {'loss': 0.0161, 'grad_norm': 2.7043188205540303, 'learning_rate': 4.34e-07, 'completion_length': 65.9285774230957, 'rewards/accuracy_reward': 0.848214328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.821428656578064, 'reward_std': 0.19051415473222733, 'kl': 0.40234375, 'epoch': 0.57} 57%|█████▋ | 1415/2500 [12:14:19<13:18:04, 44.13s/it] 57%|█████▋ | 1416/2500 [12:14:43<11:31:43, 38.29s/it] {'loss': 0.0065, 'grad_norm': 1.4087430402202734, 'learning_rate': 4.3359999999999997e-07, 'completion_length': 59.16964530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.162353515625, 'epoch': 0.57} 57%|█████▋ | 1416/2500 [12:14:43<11:31:43, 38.29s/it] 57%|█████▋ | 1417/2500 [12:15:07<10:10:31, 33.82s/it] {'loss': 0.005, 'grad_norm': 0.6952852728208528, 'learning_rate': 4.3319999999999994e-07, 'completion_length': 59.705360412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.12548828125, 'epoch': 0.57} 57%|█████▋ | 1417/2500 [12:15:07<10:10:31, 33.82s/it] 57%|█████▋ | 1418/2500 [12:15:29<9:10:05, 30.50s/it] {'loss': 0.0059, 'grad_norm': 1.3137480031018847, 'learning_rate': 4.328e-07, 'completion_length': 56.16071701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.146728515625, 'epoch': 0.57} 57%|█████▋ | 1418/2500 [12:15:29<9:10:05, 30.50s/it] 57%|█████▋ | 1419/2500 [12:15:53<8:33:37, 28.51s/it] {'loss': 0.0056, 'grad_norm': 0.2402630513500315, 'learning_rate': 4.324e-07, 'completion_length': 60.63393020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1396484375, 'epoch': 0.57} 57%|█████▋ | 1419/2500 [12:15:53<8:33:37, 28.51s/it] 57%|█████▋ | 1420/2500 [12:16:16<7:59:42, 26.65s/it] {'loss': 0.0046, 'grad_norm': 0.24356155417230593, 'learning_rate': 4.3199999999999995e-07, 'completion_length': 57.40178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.114501953125, 'epoch': 0.57} 57%|█████▋ | 1420/2500 [12:16:16<7:59:42, 26.65s/it] 57%|█████▋ | 1421/2500 [12:16:39<7:41:59, 25.69s/it] {'loss': 0.0045, 'grad_norm': 1.4124159024967506, 'learning_rate': 4.316e-07, 'completion_length': 53.18750190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.112548828125, 'epoch': 0.57} 57%|█████▋ | 1421/2500 [12:16:39<7:41:59, 25.69s/it] 57%|█████▋ | 1422/2500 [12:17:04<7:36:18, 25.40s/it] {'loss': 0.0056, 'grad_norm': 1.48944010385179, 'learning_rate': 4.312e-07, 'completion_length': 52.80357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.140625, 'epoch': 0.57} 57%|█████▋ | 1422/2500 [12:17:04<7:36:18, 25.40s/it] 57%|█████▋ | 1423/2500 [12:17:27<7:25:17, 24.81s/it] {'loss': 0.0046, 'grad_norm': 3.1577279446143374, 'learning_rate': 4.308e-07, 'completion_length': 49.22321701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.114990234375, 'epoch': 0.57} 57%|█████▋ | 1423/2500 [12:17:27<7:25:17, 24.81s/it] 57%|█████▋ | 1424/2500 [12:17:51<7:21:01, 24.59s/it] {'loss': 0.0064, 'grad_norm': 0.3713075758695377, 'learning_rate': 4.304e-07, 'completion_length': 54.68750190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.16064453125, 'epoch': 0.57} 57%|█████▋ | 1424/2500 [12:17:51<7:21:01, 24.59s/it] 57%|█████▋ | 1425/2500 [12:18:14<7:09:44, 23.99s/it] {'loss': 0.0045, 'grad_norm': 0.27176443877011364, 'learning_rate': 4.2999999999999996e-07, 'completion_length': 58.19643020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1123046875, 'epoch': 0.57} 57%|█████▋ | 1425/2500 [12:18:14<7:09:44, 23.99s/it] 57%|█████▋ | 1426/2500 [12:18:37<7:07:08, 23.86s/it] {'loss': 0.0046, 'grad_norm': 0.2667154050783699, 'learning_rate': 4.296e-07, 'completion_length': 54.02678871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.114013671875, 'epoch': 0.57} 57%|█████▋ | 1426/2500 [12:18:37<7:07:08, 23.86s/it] 57%|█████▋ | 1427/2500 [12:19:15<8:21:08, 28.02s/it] {'loss': 0.0096, 'grad_norm': 1.5873330236021193, 'learning_rate': 4.292e-07, 'completion_length': 52.37500190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.24072265625, 'epoch': 0.57} 57%|█████▋ | 1427/2500 [12:19:15<8:21:08, 28.02s/it] 57%|█████▋ | 1428/2500 [12:19:42<8:12:35, 27.57s/it] {'loss': 0.006, 'grad_norm': 1.4364402565025771, 'learning_rate': 4.288e-07, 'completion_length': 51.517860412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.14892578125, 'epoch': 0.57} 57%|█████▋ | 1428/2500 [12:19:42<8:12:35, 27.57s/it] 57%|█████▋ | 1429/2500 [12:20:09<8:11:05, 27.51s/it] {'loss': 0.0058, 'grad_norm': 1.0804705903806797, 'learning_rate': 4.284e-07, 'completion_length': 55.89285850524902, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.053144559264183044, 'kl': 0.1455078125, 'epoch': 0.57} 57%|█████▋ | 1429/2500 [12:20:09<8:11:05, 27.51s/it] 57%|█████▋ | 1430/2500 [12:20:32<7:44:11, 26.03s/it] {'loss': 0.0051, 'grad_norm': 3.6352810556601534, 'learning_rate': 4.2799999999999997e-07, 'completion_length': 52.39285850524902, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.0835726335644722, 'kl': 0.128173828125, 'epoch': 0.57} 57%|█████▋ | 1430/2500 [12:20:32<7:44:11, 26.03s/it] 57%|█████▋ | 1431/2500 [12:20:57<7:38:28, 25.73s/it] {'loss': 0.0053, 'grad_norm': 0.33262799325581205, 'learning_rate': 4.2759999999999994e-07, 'completion_length': 48.80357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.133544921875, 'epoch': 0.57} 57%|█████▋ | 1431/2500 [12:20:57<7:38:28, 25.73s/it] 57%|█████▋ | 1432/2500 [12:21:20<7:27:36, 25.15s/it] {'loss': 0.006, 'grad_norm': 1.5050031005605062, 'learning_rate': 4.272e-07, 'completion_length': 51.74107360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.150390625, 'epoch': 0.57} 57%|█████▋ | 1432/2500 [12:21:20<7:27:36, 25.15s/it] 57%|█████▋ | 1433/2500 [12:21:44<7:19:16, 24.70s/it] {'loss': 0.0138, 'grad_norm': 2.7498212577787764, 'learning_rate': 4.268e-07, 'completion_length': 49.99107360839844, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857313156128, 'reward_std': 0.0835726335644722, 'kl': 0.3447265625, 'epoch': 0.57} 57%|█████▋ | 1433/2500 [12:21:44<7:19:16, 24.70s/it] 57%|█████▋ | 1434/2500 [12:22:08<7:16:43, 24.58s/it] {'loss': 0.0061, 'grad_norm': 2.696650763107826, 'learning_rate': 4.264e-07, 'completion_length': 50.500003814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.15185546875, 'epoch': 0.57} 57%|█████▋ | 1434/2500 [12:22:08<7:16:43, 24.58s/it] 57%|█████▋ | 1435/2500 [12:22:32<7:10:26, 24.25s/it] {'loss': 0.0111, 'grad_norm': 1.700147959710534, 'learning_rate': 4.26e-07, 'completion_length': 56.48214530944824, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.27734375, 'epoch': 0.57} 57%|█████▋ | 1435/2500 [12:22:32<7:10:26, 24.25s/it] 57%|█████▋ | 1436/2500 [12:22:55<7:06:23, 24.04s/it] {'loss': 0.0137, 'grad_norm': 5.298681352874627, 'learning_rate': 4.2559999999999995e-07, 'completion_length': 48.080360412597656, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.910714328289032, 'reward_std': 0.09919501841068268, 'kl': 0.3427734375, 'epoch': 0.57} 57%|█████▋ | 1436/2500 [12:22:56<7:06:23, 24.04s/it] 57%|█████▋ | 1437/2500 [12:23:18<6:56:02, 23.48s/it] {'loss': 0.0162, 'grad_norm': 3.9590102239264753, 'learning_rate': 4.252e-07, 'completion_length': 48.56250190734863, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9285714626312256, 'reward_std': 0.14970264583826065, 'kl': 0.4052734375, 'epoch': 0.57} 57%|█████▋ | 1437/2500 [12:23:18<6:56:02, 23.48s/it] 58%|█████▊ | 1438/2500 [12:23:42<6:58:42, 23.66s/it] {'loss': 0.0131, 'grad_norm': 3.607757005427599, 'learning_rate': 4.248e-07, 'completion_length': 53.90178871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.326171875, 'epoch': 0.58} 58%|█████▊ | 1438/2500 [12:23:42<6:58:42, 23.66s/it] 58%|█████▊ | 1439/2500 [12:24:04<6:53:44, 23.40s/it] {'loss': 0.0159, 'grad_norm': 3.3172645358829036, 'learning_rate': 4.2439999999999996e-07, 'completion_length': 54.04464530944824, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.9375001192092896, 'reward_std': 0.15933407098054886, 'kl': 0.3984375, 'epoch': 0.58} 58%|█████▊ | 1439/2500 [12:24:04<6:53:44, 23.40s/it] 58%|█████▊ | 1440/2500 [12:24:28<6:54:53, 23.48s/it] {'loss': 0.0493, 'grad_norm': 7.619155377685216, 'learning_rate': 4.24e-07, 'completion_length': 47.08928871154785, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.9017858505249023, 'reward_std': 0.24290671199560165, 'kl': 1.23046875, 'epoch': 0.58} 58%|█████▊ | 1440/2500 [12:24:28<6:54:53, 23.48s/it] 58%|█████▊ | 1441/2500 [12:24:53<7:03:44, 24.01s/it] {'loss': 0.0403, 'grad_norm': 5.124475644418302, 'learning_rate': 4.2359999999999995e-07, 'completion_length': 54.16071701049805, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.8660715222358704, 'reward_std': 0.3213050812482834, 'kl': 1.005859375, 'epoch': 0.58} 58%|█████▊ | 1441/2500 [12:24:53<7:03:44, 24.01s/it] 58%|█████▊ | 1442/2500 [12:25:19<7:11:37, 24.48s/it] {'loss': 0.0265, 'grad_norm': 3.7160404645045486, 'learning_rate': 4.232e-07, 'completion_length': 55.892860412597656, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.9196429252624512, 'reward_std': 0.1664527878165245, 'kl': 0.6640625, 'epoch': 0.58} 58%|█████▊ | 1442/2500 [12:25:19<7:11:37, 24.48s/it] 58%|█████▊ | 1443/2500 [12:25:42<7:02:25, 23.98s/it] {'loss': 0.0193, 'grad_norm': 2.7700084371039715, 'learning_rate': 4.228e-07, 'completion_length': 56.39285850524902, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.973214328289032, 'reward': 1.910714328289032, 'reward_std': 0.16197101026773453, 'kl': 0.4833984375, 'epoch': 0.58} 58%|█████▊ | 1443/2500 [12:25:42<7:02:25, 23.98s/it] 58%|█████▊ | 1444/2500 [12:26:05<6:57:49, 23.74s/it] {'loss': 0.0158, 'grad_norm': 1.5990081549476554, 'learning_rate': 4.2239999999999997e-07, 'completion_length': 50.15178871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.3935546875, 'epoch': 0.58} 58%|█████▊ | 1444/2500 [12:26:05<6:57:49, 23.74s/it] 58%|█████▊ | 1445/2500 [12:26:29<6:56:38, 23.70s/it] {'loss': 0.0063, 'grad_norm': 1.5935294832336573, 'learning_rate': 4.2199999999999994e-07, 'completion_length': 53.44643211364746, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.158203125, 'epoch': 0.58} 58%|█████▊ | 1445/2500 [12:26:29<6:56:38, 23.70s/it] 58%|█████▊ | 1446/2500 [12:26:59<7:31:22, 25.69s/it] {'loss': 0.0061, 'grad_norm': 0.5054185175339663, 'learning_rate': 4.2159999999999996e-07, 'completion_length': 51.42857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.15283203125, 'epoch': 0.58} 58%|█████▊ | 1446/2500 [12:26:59<7:31:22, 25.69s/it] 58%|█████▊ | 1447/2500 [12:27:22<7:19:11, 25.02s/it] {'loss': 0.0082, 'grad_norm': 0.7044312648164318, 'learning_rate': 4.212e-07, 'completion_length': 51.73214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.20556640625, 'epoch': 0.58} 58%|█████▊ | 1447/2500 [12:27:22<7:19:11, 25.02s/it] 58%|█████▊ | 1448/2500 [12:27:47<7:17:42, 24.96s/it] {'loss': 0.0042, 'grad_norm': 2.491990086067441, 'learning_rate': 4.208e-07, 'completion_length': 56.03571701049805, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.10498046875, 'epoch': 0.58} 58%|█████▊ | 1448/2500 [12:27:47<7:17:42, 24.96s/it] 58%|█████▊ | 1449/2500 [12:28:10<7:05:41, 24.30s/it] {'loss': 0.0048, 'grad_norm': 0.3431470555801394, 'learning_rate': 4.204e-07, 'completion_length': 62.330360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.120849609375, 'epoch': 0.58} 58%|█████▊ | 1449/2500 [12:28:10<7:05:41, 24.30s/it] 58%|█████▊ | 1450/2500 [12:28:33<7:00:59, 24.06s/it] {'loss': 0.0046, 'grad_norm': 5.307330597796585, 'learning_rate': 4.1999999999999995e-07, 'completion_length': 59.75000190734863, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.11572265625, 'epoch': 0.58} 58%|█████▊ | 1450/2500 [12:28:33<7:00:59, 24.06s/it] 58%|█████▊ | 1451/2500 [12:29:02<7:25:53, 25.50s/it] {'loss': 0.0066, 'grad_norm': 2.485773316274263, 'learning_rate': 4.1959999999999997e-07, 'completion_length': 63.00893211364746, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.8571429252624512, 'reward_std': 0.10101525485515594, 'kl': 0.16455078125, 'epoch': 0.58} 58%|█████▊ | 1451/2500 [12:29:02<7:25:53, 25.50s/it] 58%|█████▊ | 1452/2500 [12:29:26<7:14:30, 24.88s/it] {'loss': 0.0039, 'grad_norm': 0.1890353383207547, 'learning_rate': 4.192e-07, 'completion_length': 67.45536041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.096435546875, 'epoch': 0.58} 58%|█████▊ | 1452/2500 [12:29:26<7:14:30, 24.88s/it] 58%|█████▊ | 1453/2500 [12:29:52<7:21:04, 25.28s/it] {'loss': 0.0041, 'grad_norm': 0.20775806548695475, 'learning_rate': 4.1879999999999996e-07, 'completion_length': 67.55357360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.102783203125, 'epoch': 0.58} 58%|█████▊ | 1453/2500 [12:29:52<7:21:04, 25.28s/it] 58%|█████▊ | 1454/2500 [12:30:18<7:24:49, 25.52s/it] {'loss': 0.0039, 'grad_norm': 0.1610159244735762, 'learning_rate': 4.184e-07, 'completion_length': 63.830360412597656, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.0966796875, 'epoch': 0.58} 58%|█████▊ | 1454/2500 [12:30:18<7:24:49, 25.52s/it] 58%|█████▊ | 1455/2500 [12:30:42<7:15:22, 25.00s/it] {'loss': 0.0035, 'grad_norm': 0.17497949627804926, 'learning_rate': 4.1799999999999996e-07, 'completion_length': 75.48214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0888671875, 'epoch': 0.58} 58%|█████▊ | 1455/2500 [12:30:42<7:15:22, 25.00s/it] 58%|█████▊ | 1456/2500 [12:31:54<11:22:38, 39.23s/it] {'loss': 0.0045, 'grad_norm': 1.0505977567008578, 'learning_rate': 4.1760000000000003e-07, 'completion_length': 69.10714721679688, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.11181640625, 'epoch': 0.58} 58%|█████▊ | 1456/2500 [12:31:54<11:22:38, 39.23s/it] 58%|█████▊ | 1457/2500 [12:33:34<16:36:42, 57.34s/it] {'loss': 0.0033, 'grad_norm': 0.18083616025134472, 'learning_rate': 4.172e-07, 'completion_length': 68.04464721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.083740234375, 'epoch': 0.58} 58%|█████▊ | 1457/2500 [12:33:34<16:36:42, 57.34s/it] 58%|█████▊ | 1458/2500 [12:34:42<17:29:39, 60.44s/it] {'loss': 0.0032, 'grad_norm': 0.1660372369991102, 'learning_rate': 4.1679999999999997e-07, 'completion_length': 69.72321701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.081298828125, 'epoch': 0.58} 58%|█████▊ | 1458/2500 [12:34:42<17:29:39, 60.44s/it] 58%|█████▊ | 1459/2500 [12:35:09<14:34:37, 50.41s/it] {'loss': 0.0049, 'grad_norm': 1.9069672311663077, 'learning_rate': 4.164e-07, 'completion_length': 68.70535850524902, 'rewards/accuracy_reward': 0.928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.05050762742757797, 'kl': 0.1220703125, 'epoch': 0.58} 58%|█████▊ | 1459/2500 [12:35:09<14:34:37, 50.41s/it] 58%|█████▊ | 1460/2500 [12:35:33<12:18:28, 42.60s/it] {'loss': 0.0039, 'grad_norm': 0.7853317204555793, 'learning_rate': 4.1599999999999997e-07, 'completion_length': 68.83036041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.097412109375, 'epoch': 0.58} 58%|█████▊ | 1460/2500 [12:35:33<12:18:28, 42.60s/it] 58%|█████▊ | 1461/2500 [12:36:00<10:54:35, 37.80s/it] {'loss': 0.0033, 'grad_norm': 0.1849547555611368, 'learning_rate': 4.156e-07, 'completion_length': 69.10714340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.08251953125, 'epoch': 0.58} 58%|█████▊ | 1461/2500 [12:36:00<10:54:35, 37.80s/it] 58%|█████▊ | 1462/2500 [12:36:24<9:43:26, 33.72s/it] {'loss': 0.0037, 'grad_norm': 0.14925780417730014, 'learning_rate': 4.152e-07, 'completion_length': 67.6160774230957, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09130859375, 'epoch': 0.58} 58%|█████▊ | 1462/2500 [12:36:24<9:43:26, 33.72s/it] 59%|█████▊ | 1463/2500 [12:36:48<8:55:10, 30.96s/it] {'loss': 0.0038, 'grad_norm': 0.1857252573131869, 'learning_rate': 4.148e-07, 'completion_length': 66.32143020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09375, 'epoch': 0.59} 59%|█████▊ | 1463/2500 [12:36:48<8:55:10, 30.96s/it] 59%|█████▊ | 1464/2500 [12:37:13<8:19:51, 28.95s/it] {'loss': 0.0036, 'grad_norm': 0.1319740835954639, 'learning_rate': 4.1439999999999995e-07, 'completion_length': 64.76786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.090576171875, 'epoch': 0.59} 59%|█████▊ | 1464/2500 [12:37:13<8:19:51, 28.95s/it] 59%|█████▊ | 1465/2500 [12:37:36<7:49:11, 27.20s/it] {'loss': 0.0036, 'grad_norm': 0.28704925064621095, 'learning_rate': 4.14e-07, 'completion_length': 62.437503814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09033203125, 'epoch': 0.59} 59%|█████▊ | 1465/2500 [12:37:36<7:49:11, 27.20s/it] 59%|█████▊ | 1466/2500 [12:38:02<7:44:57, 26.98s/it] {'loss': 0.0038, 'grad_norm': 0.18201193317247588, 'learning_rate': 4.136e-07, 'completion_length': 75.85714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09423828125, 'epoch': 0.59} 59%|█████▊ | 1466/2500 [12:38:02<7:44:57, 26.98s/it] 59%|█████▊ | 1467/2500 [12:38:28<7:36:48, 26.53s/it] {'loss': 0.0035, 'grad_norm': 0.18648997096104833, 'learning_rate': 4.1319999999999997e-07, 'completion_length': 69.76786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.08642578125, 'epoch': 0.59} 59%|█████▊ | 1467/2500 [12:38:28<7:36:48, 26.53s/it] 59%|█████▊ | 1468/2500 [12:38:54<7:35:28, 26.48s/it] {'loss': 0.004, 'grad_norm': 0.15084875311414747, 'learning_rate': 4.128e-07, 'completion_length': 64.00893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.099609375, 'epoch': 0.59} 59%|█████▊ | 1468/2500 [12:38:54<7:35:28, 26.48s/it] 59%|█████▉ | 1469/2500 [12:39:17<7:18:31, 25.52s/it] {'loss': 0.0045, 'grad_norm': 0.1992131950706808, 'learning_rate': 4.1239999999999996e-07, 'completion_length': 64.23214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.112548828125, 'epoch': 0.59} 59%|█████▉ | 1469/2500 [12:39:17<7:18:31, 25.52s/it] 59%|█████▉ | 1470/2500 [12:39:41<7:08:38, 24.97s/it] {'loss': 0.003, 'grad_norm': 1.1846263216281947, 'learning_rate': 4.12e-07, 'completion_length': 69.25893020629883, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.075439453125, 'epoch': 0.59} 59%|█████▉ | 1470/2500 [12:39:41<7:08:38, 24.97s/it] 59%|█████▉ | 1471/2500 [12:40:11<7:35:31, 26.56s/it] {'loss': 0.0031, 'grad_norm': 0.15577636747527784, 'learning_rate': 4.116e-07, 'completion_length': 64.78571891784668, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.078857421875, 'epoch': 0.59} 59%|█████▉ | 1471/2500 [12:40:11<7:35:31, 26.56s/it] 59%|█████▉ | 1472/2500 [12:40:35<7:23:01, 25.86s/it] {'loss': 0.0036, 'grad_norm': 0.15040223894135285, 'learning_rate': 4.112e-07, 'completion_length': 68.32143020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.090576171875, 'epoch': 0.59} 59%|█████▉ | 1472/2500 [12:40:36<7:23:01, 25.86s/it] 59%|█████▉ | 1473/2500 [12:40:59<7:10:16, 25.14s/it] {'loss': 0.004, 'grad_norm': 1.3546883960551481, 'learning_rate': 4.108e-07, 'completion_length': 61.18750190734863, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.0986328125, 'epoch': 0.59} 59%|█████▉ | 1473/2500 [12:40:59<7:10:16, 25.14s/it] 59%|█████▉ | 1474/2500 [12:41:22<7:00:24, 24.59s/it] {'loss': 0.0038, 'grad_norm': 0.651751077989198, 'learning_rate': 4.1039999999999997e-07, 'completion_length': 62.312503814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.09423828125, 'epoch': 0.59} 59%|█████▉ | 1474/2500 [12:41:22<7:00:24, 24.59s/it] 59%|█████▉ | 1475/2500 [12:41:47<6:59:19, 24.55s/it] {'loss': 0.0045, 'grad_norm': 1.7005440240343646, 'learning_rate': 4.0999999999999994e-07, 'completion_length': 71.17857360839844, 'rewards/accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9017857909202576, 'reward_std': 0.03696779906749725, 'kl': 0.113525390625, 'epoch': 0.59} 59%|█████▉ | 1475/2500 [12:41:47<6:59:19, 24.55s/it] 59%|█████▉ | 1476/2500 [12:42:10<6:52:08, 24.15s/it] {'loss': 0.0034, 'grad_norm': 1.9846900568675472, 'learning_rate': 4.096e-07, 'completion_length': 68.66964721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.084716796875, 'epoch': 0.59} 59%|█████▉ | 1476/2500 [12:42:10<6:52:08, 24.15s/it] 59%|█████▉ | 1477/2500 [12:42:33<6:47:34, 23.90s/it] {'loss': 0.0038, 'grad_norm': 0.20907489526573203, 'learning_rate': 4.092e-07, 'completion_length': 70.84821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.095458984375, 'epoch': 0.59} 59%|█████▉ | 1477/2500 [12:42:33<6:47:34, 23.90s/it] 59%|█████▉ | 1478/2500 [12:42:58<6:49:50, 24.06s/it] {'loss': 0.0229, 'grad_norm': 2.9814231961156485, 'learning_rate': 4.0879999999999995e-07, 'completion_length': 69.66071701049805, 'rewards/accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.8750000596046448, 'reward_std': 0.0739355981349945, 'kl': 0.57080078125, 'epoch': 0.59} 59%|█████▉ | 1478/2500 [12:42:58<6:49:50, 24.06s/it] 59%|█████▉ | 1479/2500 [12:43:21<6:46:09, 23.87s/it] {'loss': 0.003, 'grad_norm': 2.8833851614682873, 'learning_rate': 4.084e-07, 'completion_length': 77.61607360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.07568359375, 'epoch': 0.59} 59%|█████▉ | 1479/2500 [12:43:21<6:46:09, 23.87s/it] 59%|█████▉ | 1480/2500 [12:43:45<6:46:08, 23.89s/it] {'loss': 0.0053, 'grad_norm': 1.2106194914025685, 'learning_rate': 4.0799999999999995e-07, 'completion_length': 75.42857360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1318359375, 'epoch': 0.59} 59%|█████▉ | 1480/2500 [12:43:45<6:46:08, 23.89s/it] 59%|█████▉ | 1481/2500 [12:44:10<6:50:56, 24.20s/it] {'loss': 0.0041, 'grad_norm': 0.1874292037743049, 'learning_rate': 4.076e-07, 'completion_length': 64.90178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.102294921875, 'epoch': 0.59} 59%|█████▉ | 1481/2500 [12:44:10<6:50:56, 24.20s/it] 59%|█████▉ | 1482/2500 [12:44:33<6:42:54, 23.75s/it] {'loss': 0.0146, 'grad_norm': 1.6625063652728274, 'learning_rate': 4.072e-07, 'completion_length': 73.30357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.364990234375, 'epoch': 0.59} 59%|█████▉ | 1482/2500 [12:44:33<6:42:54, 23.75s/it] 59%|█████▉ | 1483/2500 [12:44:58<6:52:18, 24.32s/it] {'loss': 0.0037, 'grad_norm': 0.13722372835994126, 'learning_rate': 4.0679999999999996e-07, 'completion_length': 71.56250381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09130859375, 'epoch': 0.59} 59%|█████▉ | 1483/2500 [12:44:58<6:52:18, 24.32s/it] 59%|█████▉ | 1484/2500 [12:45:22<6:47:46, 24.08s/it] {'loss': 0.0024, 'grad_norm': 0.10944476709252421, 'learning_rate': 4.064e-07, 'completion_length': 67.33036041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0611572265625, 'epoch': 0.59} 59%|█████▉ | 1484/2500 [12:45:22<6:47:46, 24.08s/it] 59%|█████▉ | 1485/2500 [12:45:45<6:45:00, 23.94s/it] {'loss': 0.0039, 'grad_norm': 0.13711761496880506, 'learning_rate': 4.06e-07, 'completion_length': 69.3660774230957, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.097412109375, 'epoch': 0.59} 59%|█████▉ | 1485/2500 [12:45:45<6:45:00, 23.94s/it] 59%|█████▉ | 1486/2500 [12:46:08<6:38:45, 23.60s/it] {'loss': 0.003, 'grad_norm': 1.8722267816862763, 'learning_rate': 4.056e-07, 'completion_length': 70.1339340209961, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.075927734375, 'epoch': 0.59} 59%|█████▉ | 1486/2500 [12:46:08<6:38:45, 23.60s/it] 59%|█████▉ | 1487/2500 [12:46:32<6:39:42, 23.67s/it] {'loss': 0.0031, 'grad_norm': 4.263433909006657, 'learning_rate': 4.052e-07, 'completion_length': 63.44643211364746, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.077392578125, 'epoch': 0.59} 59%|█████▉ | 1487/2500 [12:46:32<6:39:42, 23.67s/it] 60%|█████▉ | 1488/2500 [12:46:57<6:44:51, 24.00s/it] {'loss': 0.0033, 'grad_norm': 0.11914005360557536, 'learning_rate': 4.0479999999999997e-07, 'completion_length': 72.33036041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.08154296875, 'epoch': 0.6} 60%|█████▉ | 1488/2500 [12:46:57<6:44:51, 24.00s/it] 60%|█████▉ | 1489/2500 [12:47:21<6:46:30, 24.12s/it] {'loss': 0.0341, 'grad_norm': 4.5196149337924405, 'learning_rate': 4.0439999999999994e-07, 'completion_length': 65.51786041259766, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9285714626312256, 'reward_std': 0.06613001227378845, 'kl': 0.853271484375, 'epoch': 0.6} 60%|█████▉ | 1489/2500 [12:47:21<6:46:30, 24.12s/it] 60%|█████▉ | 1490/2500 [12:47:43<6:35:22, 23.49s/it] {'loss': 0.034, 'grad_norm': 4.1010298243807375, 'learning_rate': 4.04e-07, 'completion_length': 68.75000381469727, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9553571939468384, 'reward_std': 0.053144559264183044, 'kl': 0.850830078125, 'epoch': 0.6} 60%|█████▉ | 1490/2500 [12:47:43<6:35:22, 23.49s/it] 60%|█████▉ | 1491/2500 [12:48:07<6:38:14, 23.68s/it] {'loss': 0.1075, 'grad_norm': 8.581789436810896, 'learning_rate': 4.036e-07, 'completion_length': 69.32143020629883, 'rewards/accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.821428656578064, 'reward_std': 0.13622548431158066, 'kl': 2.6875, 'epoch': 0.6} 60%|█████▉ | 1491/2500 [12:48:07<6:38:14, 23.68s/it] 60%|█████▉ | 1492/2500 [12:48:31<6:35:46, 23.56s/it] {'loss': 0.0204, 'grad_norm': 4.495573826702255, 'learning_rate': 4.032e-07, 'completion_length': 75.50893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.50927734375, 'epoch': 0.6} 60%|█████▉ | 1492/2500 [12:48:31<6:35:46, 23.56s/it] 60%|█████▉ | 1493/2500 [12:48:56<6:43:57, 24.07s/it] {'loss': 0.0177, 'grad_norm': 1.2854447505592574, 'learning_rate': 4.028e-07, 'completion_length': 73.06250381469727, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.442138671875, 'epoch': 0.6} 60%|█████▉ | 1493/2500 [12:48:56<6:43:57, 24.07s/it] 60%|█████▉ | 1494/2500 [12:49:19<6:40:34, 23.89s/it] {'loss': 0.0032, 'grad_norm': 1.0619590715203728, 'learning_rate': 4.0239999999999995e-07, 'completion_length': 68.2589340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.0810546875, 'epoch': 0.6} 60%|█████▉ | 1494/2500 [12:49:19<6:40:34, 23.89s/it] 60%|█████▉ | 1495/2500 [12:49:46<6:56:06, 24.84s/it] {'loss': 0.0032, 'grad_norm': 0.09303540048613156, 'learning_rate': 4.02e-07, 'completion_length': 76.9910774230957, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.078857421875, 'epoch': 0.6} 60%|█████▉ | 1495/2500 [12:49:46<6:56:06, 24.84s/it] 60%|█████▉ | 1496/2500 [12:50:11<6:52:09, 24.63s/it] {'loss': 0.0164, 'grad_norm': 6.123804210159077, 'learning_rate': 4.016e-07, 'completion_length': 67.37500381469727, 'rewards/accuracy_reward': 0.866071492433548, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.794642984867096, 'reward_std': 0.025253813713788986, 'kl': 0.4111328125, 'epoch': 0.6} 60%|█████▉ | 1496/2500 [12:50:11<6:52:09, 24.63s/it] 60%|█████▉ | 1497/2500 [12:50:35<6:50:40, 24.57s/it] {'loss': 0.009, 'grad_norm': 11.45896543509154, 'learning_rate': 4.0119999999999997e-07, 'completion_length': 72.40178680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9553571939468384, 'reward_std': 0.06543753296136856, 'kl': 0.225830078125, 'epoch': 0.6} 60%|█████▉ | 1497/2500 [12:50:35<6:50:40, 24.57s/it] 60%|█████▉ | 1498/2500 [12:50:59<6:48:43, 24.47s/it] {'loss': 0.022, 'grad_norm': 1.8198631845688689, 'learning_rate': 4.008e-07, 'completion_length': 67.97322082519531, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.548583984375, 'epoch': 0.6} 60%|█████▉ | 1498/2500 [12:50:59<6:48:43, 24.47s/it] 60%|█████▉ | 1499/2500 [12:51:25<6:52:54, 24.75s/it] {'loss': 0.0041, 'grad_norm': 9.838988656429667, 'learning_rate': 4.0039999999999996e-07, 'completion_length': 68.66071701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9642857313156128, 'reward_std': 0.05399492755532265, 'kl': 0.1015625, 'epoch': 0.6} 60%|█████▉ | 1499/2500 [12:51:25<6:52:54, 24.75s/it] 60%|██████ | 1500/2500 [12:51:49<6:48:45, 24.53s/it] {'loss': 0.0032, 'grad_norm': 0.7551305009742855, 'learning_rate': 4e-07, 'completion_length': 61.16071701049805, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.080078125, 'epoch': 0.6} 60%|██████ | 1500/2500 [12:51:49<6:48:45, 24.53s/it] 60%|██████ | 1501/2500 [12:53:03<10:55:05, 39.34s/it] {'loss': 0.0033, 'grad_norm': 0.728549312637124, 'learning_rate': 3.996e-07, 'completion_length': 74.33929061889648, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.083740234375, 'epoch': 0.6} 60%|██████ | 1501/2500 [12:53:03<10:55:05, 39.34s/it] 60%|██████ | 1502/2500 [12:53:36<10:25:40, 37.62s/it] {'loss': 0.0039, 'grad_norm': 0.16413902042957665, 'learning_rate': 3.992e-07, 'completion_length': 69.46428871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0986328125, 'epoch': 0.6} 60%|██████ | 1502/2500 [12:53:36<10:25:40, 37.62s/it] 60%|██████ | 1503/2500 [12:54:05<9:42:48, 35.07s/it] {'loss': 0.003, 'grad_norm': 3.192302333710487, 'learning_rate': 3.9879999999999994e-07, 'completion_length': 69.30357360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.06222161650657654, 'kl': 0.074951171875, 'epoch': 0.6} 60%|██████ | 1503/2500 [12:54:05<9:42:48, 35.07s/it] 60%|██████ | 1504/2500 [12:54:55<10:54:26, 39.42s/it] {'loss': 0.0031, 'grad_norm': 0.19112920382136367, 'learning_rate': 3.9839999999999997e-07, 'completion_length': 66.81250381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0765380859375, 'epoch': 0.6} 60%|██████ | 1504/2500 [12:54:55<10:54:26, 39.42s/it] 60%|██████ | 1505/2500 [12:55:18<9:34:36, 34.65s/it] {'loss': 0.0035, 'grad_norm': 1.253121515286527, 'learning_rate': 3.98e-07, 'completion_length': 70.54464721679688, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.025253813713788986, 'kl': 0.087890625, 'epoch': 0.6} 60%|██████ | 1505/2500 [12:55:18<9:34:36, 34.65s/it] 60%|██████ | 1506/2500 [12:56:10<10:57:41, 39.70s/it] {'loss': 0.0032, 'grad_norm': 0.7711352534688003, 'learning_rate': 3.976e-07, 'completion_length': 70.20536041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.0791015625, 'epoch': 0.6} 60%|██████ | 1506/2500 [12:56:10<10:57:41, 39.70s/it] 60%|██████ | 1507/2500 [12:57:22<13:36:22, 49.33s/it] {'loss': 0.0037, 'grad_norm': 0.16658191383412788, 'learning_rate': 3.972e-07, 'completion_length': 57.767860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09326171875, 'epoch': 0.6} 60%|██████ | 1507/2500 [12:57:22<13:36:22, 49.33s/it] 60%|██████ | 1508/2500 [12:58:08<13:21:27, 48.48s/it] {'loss': 0.0042, 'grad_norm': 0.1397206087519346, 'learning_rate': 3.9679999999999995e-07, 'completion_length': 70.20536041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.104736328125, 'epoch': 0.6} 60%|██████ | 1508/2500 [12:58:08<13:21:27, 48.48s/it] 60%|██████ | 1509/2500 [12:58:34<11:30:50, 41.83s/it] {'loss': 0.0041, 'grad_norm': 0.9892124404273472, 'learning_rate': 3.964e-07, 'completion_length': 62.78571701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.101806640625, 'epoch': 0.6} 60%|██████ | 1509/2500 [12:58:34<11:30:50, 41.83s/it] 60%|██████ | 1510/2500 [12:59:09<10:54:09, 39.65s/it] {'loss': 0.0033, 'grad_norm': 0.14515216710925027, 'learning_rate': 3.96e-07, 'completion_length': 67.87500381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.082763671875, 'epoch': 0.6} 60%|██████ | 1510/2500 [12:59:09<10:54:09, 39.65s/it] 60%|██████ | 1511/2500 [12:59:36<9:52:44, 35.96s/it] {'loss': 0.0036, 'grad_norm': 0.11158098102169156, 'learning_rate': 3.9559999999999997e-07, 'completion_length': 59.87500190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0888671875, 'epoch': 0.6} 60%|██████ | 1511/2500 [12:59:36<9:52:44, 35.96s/it] 60%|██████ | 1512/2500 [13:00:32<11:28:07, 41.79s/it] {'loss': 0.0036, 'grad_norm': 1.5827299245513977, 'learning_rate': 3.952e-07, 'completion_length': 66.74107360839844, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.089111328125, 'epoch': 0.6} 60%|██████ | 1512/2500 [13:00:32<11:28:07, 41.79s/it] 61%|██████ | 1513/2500 [13:01:07<10:53:36, 39.73s/it] {'loss': 0.0026, 'grad_norm': 0.20343732208181758, 'learning_rate': 3.9479999999999996e-07, 'completion_length': 72.52679061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0654296875, 'epoch': 0.61} 61%|██████ | 1513/2500 [13:01:07<10:53:36, 39.73s/it] 61%|██████ | 1514/2500 [13:01:42<10:33:24, 38.54s/it] {'loss': 0.004, 'grad_norm': 0.16587331169135713, 'learning_rate': 3.9439999999999993e-07, 'completion_length': 68.43750381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.100830078125, 'epoch': 0.61} 61%|██████ | 1514/2500 [13:01:42<10:33:24, 38.54s/it] 61%|██████ | 1515/2500 [13:02:08<9:30:17, 34.74s/it] {'loss': 0.0043, 'grad_norm': 0.19554823412993605, 'learning_rate': 3.94e-07, 'completion_length': 69.16071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1064453125, 'epoch': 0.61} 61%|██████ | 1515/2500 [13:02:08<9:30:17, 34.74s/it] 61%|██████ | 1516/2500 [13:02:32<8:35:40, 31.44s/it] {'loss': 0.0036, 'grad_norm': 0.21947166420779618, 'learning_rate': 3.936e-07, 'completion_length': 64.45536041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.090576171875, 'epoch': 0.61} 61%|██████ | 1516/2500 [13:02:32<8:35:40, 31.44s/it] 61%|██████ | 1517/2500 [13:02:55<7:55:32, 29.03s/it] {'loss': 0.0032, 'grad_norm': 0.1272411891351078, 'learning_rate': 3.932e-07, 'completion_length': 64.30357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0810546875, 'epoch': 0.61} 61%|██████ | 1517/2500 [13:02:55<7:55:32, 29.03s/it] 61%|██████ | 1518/2500 [13:03:20<7:32:02, 27.62s/it] {'loss': 0.0034, 'grad_norm': 0.13306161970509398, 'learning_rate': 3.9279999999999997e-07, 'completion_length': 68.48214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0849609375, 'epoch': 0.61} 61%|██████ | 1518/2500 [13:03:20<7:32:02, 27.62s/it] 61%|██████ | 1519/2500 [13:03:45<7:20:00, 26.91s/it] {'loss': 0.0037, 'grad_norm': 0.17053510545808262, 'learning_rate': 3.924e-07, 'completion_length': 70.12500381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0927734375, 'epoch': 0.61} 61%|██████ | 1519/2500 [13:03:45<7:20:00, 26.91s/it] 61%|██████ | 1520/2500 [13:04:10<7:12:09, 26.46s/it] {'loss': 0.0039, 'grad_norm': 0.1260105342844192, 'learning_rate': 3.92e-07, 'completion_length': 65.93750381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.096435546875, 'epoch': 0.61} 61%|██████ | 1520/2500 [13:04:10<7:12:09, 26.46s/it] 61%|██████ | 1521/2500 [13:04:35<7:02:02, 25.87s/it] {'loss': 0.0039, 'grad_norm': 0.14716527718948152, 'learning_rate': 3.916e-07, 'completion_length': 72.21429061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.097412109375, 'epoch': 0.61} 61%|██████ | 1521/2500 [13:04:35<7:02:02, 25.87s/it] 61%|██████ | 1522/2500 [13:05:29<9:18:25, 34.26s/it] {'loss': 0.0359, 'grad_norm': 10.015623438816561, 'learning_rate': 3.9119999999999996e-07, 'completion_length': 67.51786041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.896484375, 'epoch': 0.61} 61%|██████ | 1522/2500 [13:05:29<9:18:25, 34.26s/it] 61%|██████ | 1523/2500 [13:06:39<12:14:12, 45.09s/it] {'loss': 0.0034, 'grad_norm': 0.13906062941455338, 'learning_rate': 3.908e-07, 'completion_length': 72.2589340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.08447265625, 'epoch': 0.61} 61%|██████ | 1523/2500 [13:06:39<12:14:12, 45.09s/it] 61%|██████ | 1524/2500 [13:07:03<10:29:02, 38.67s/it] {'loss': 0.0037, 'grad_norm': 1.4597807385481705, 'learning_rate': 3.904e-07, 'completion_length': 67.09822082519531, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.092529296875, 'epoch': 0.61} 61%|██████ | 1524/2500 [13:07:03<10:29:02, 38.67s/it] 61%|██████ | 1525/2500 [13:07:28<9:21:41, 34.57s/it] {'loss': 0.003, 'grad_norm': 1.2334623663916053, 'learning_rate': 3.8999999999999997e-07, 'completion_length': 75.09821701049805, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.073974609375, 'epoch': 0.61} 61%|██████ | 1525/2500 [13:07:28<9:21:41, 34.57s/it] 61%|██████ | 1526/2500 [13:07:54<8:42:33, 32.19s/it] {'loss': 0.003, 'grad_norm': 0.16917713854321217, 'learning_rate': 3.896e-07, 'completion_length': 68.65178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.073974609375, 'epoch': 0.61} 61%|██████ | 1526/2500 [13:07:54<8:42:33, 32.19s/it] 61%|██████ | 1527/2500 [13:08:18<8:00:47, 29.65s/it] {'loss': 0.0037, 'grad_norm': 0.1828402036721452, 'learning_rate': 3.8919999999999996e-07, 'completion_length': 68.58036041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09228515625, 'epoch': 0.61} 61%|██████ | 1527/2500 [13:08:18<8:00:47, 29.65s/it] 61%|██████ | 1528/2500 [13:08:48<7:59:14, 29.58s/it] {'loss': 0.0156, 'grad_norm': 1.5969303031219644, 'learning_rate': 3.888e-07, 'completion_length': 66.35714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9553571939468384, 'reward_std': 0.06543753296136856, 'kl': 0.38916015625, 'epoch': 0.61} 61%|██████ | 1528/2500 [13:08:48<7:59:14, 29.58s/it] 61%|██████ | 1529/2500 [13:09:12<7:33:19, 28.01s/it] {'loss': 0.003, 'grad_norm': 0.2631474164528452, 'learning_rate': 3.884e-07, 'completion_length': 65.15178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.076416015625, 'epoch': 0.61} 61%|██████ | 1529/2500 [13:09:12<7:33:19, 28.01s/it] 61%|██████ | 1530/2500 [13:09:36<7:14:08, 26.85s/it] {'loss': 0.004, 'grad_norm': 0.9364191868769529, 'learning_rate': 3.88e-07, 'completion_length': 66.35714721679688, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.10107421875, 'epoch': 0.61} 61%|██████ | 1530/2500 [13:09:36<7:14:08, 26.85s/it] 61%|██████ | 1531/2500 [13:10:00<7:01:30, 26.10s/it] {'loss': 0.0043, 'grad_norm': 0.16264238134441594, 'learning_rate': 3.876e-07, 'completion_length': 61.66071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.108642578125, 'epoch': 0.61} 61%|██████ | 1531/2500 [13:10:00<7:01:30, 26.10s/it] 61%|██████▏ | 1532/2500 [13:10:23<6:42:58, 24.98s/it] {'loss': 0.0036, 'grad_norm': 1.3847381145663828, 'learning_rate': 3.8719999999999997e-07, 'completion_length': 60.63393020629883, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.090087890625, 'epoch': 0.61} 61%|██████▏ | 1532/2500 [13:10:23<6:42:58, 24.98s/it] 61%|██████▏ | 1533/2500 [13:11:05<8:05:08, 30.10s/it] {'loss': 0.0057, 'grad_norm': 2.723428613246699, 'learning_rate': 3.8679999999999994e-07, 'completion_length': 68.02679061889648, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.14208984375, 'epoch': 0.61} 61%|██████▏ | 1533/2500 [13:11:05<8:05:08, 30.10s/it] 61%|██████▏ | 1534/2500 [13:11:31<7:44:16, 28.84s/it] {'loss': 0.0194, 'grad_norm': 8.577863037441626, 'learning_rate': 3.864e-07, 'completion_length': 57.901790618896484, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.06613001227378845, 'kl': 0.485595703125, 'epoch': 0.61} 61%|██████▏ | 1534/2500 [13:11:31<7:44:16, 28.84s/it] 61%|██████▏ | 1535/2500 [13:11:56<7:25:59, 27.73s/it] {'loss': 0.0036, 'grad_norm': 0.12210107282285718, 'learning_rate': 3.86e-07, 'completion_length': 70.98214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.090576171875, 'epoch': 0.61} 61%|██████▏ | 1535/2500 [13:11:56<7:25:59, 27.73s/it] 61%|██████▏ | 1536/2500 [13:12:20<7:10:03, 26.77s/it] {'loss': 0.0049, 'grad_norm': 0.24524474621005488, 'learning_rate': 3.8559999999999996e-07, 'completion_length': 67.03571701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12255859375, 'epoch': 0.61} 61%|██████▏ | 1536/2500 [13:12:20<7:10:03, 26.77s/it] 61%|██████▏ | 1537/2500 [13:12:46<7:02:00, 26.29s/it] {'loss': 0.0038, 'grad_norm': 1.0709770905586151, 'learning_rate': 3.852e-07, 'completion_length': 66.60714721679688, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.095703125, 'epoch': 0.61} 61%|██████▏ | 1537/2500 [13:12:46<7:02:00, 26.29s/it] 62%|██████▏ | 1538/2500 [13:13:09<6:48:12, 25.46s/it] {'loss': 0.0059, 'grad_norm': 2.372524635339071, 'learning_rate': 3.8479999999999995e-07, 'completion_length': 62.61607360839844, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.1474609375, 'epoch': 0.62} 62%|██████▏ | 1538/2500 [13:13:09<6:48:12, 25.46s/it] 62%|██████▏ | 1539/2500 [13:13:36<6:55:47, 25.96s/it] {'loss': 0.004, 'grad_norm': 1.901827536665394, 'learning_rate': 3.8440000000000003e-07, 'completion_length': 61.31250190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.099365234375, 'epoch': 0.62} 62%|██████▏ | 1539/2500 [13:13:36<6:55:47, 25.96s/it] 62%|██████▏ | 1540/2500 [13:14:01<6:51:26, 25.71s/it] {'loss': 0.0104, 'grad_norm': 1.5303502142489405, 'learning_rate': 3.84e-07, 'completion_length': 73.1785774230957, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.257568359375, 'epoch': 0.62} 62%|██████▏ | 1540/2500 [13:14:01<6:51:26, 25.71s/it] 62%|██████▏ | 1541/2500 [13:14:25<6:43:21, 25.24s/it] {'loss': 0.0035, 'grad_norm': 0.15147887719762007, 'learning_rate': 3.8359999999999997e-07, 'completion_length': 67.27679061889648, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.08837890625, 'epoch': 0.62} 62%|██████▏ | 1541/2500 [13:14:25<6:43:21, 25.24s/it] 62%|██████▏ | 1542/2500 [13:14:53<6:55:53, 26.05s/it] {'loss': 0.0045, 'grad_norm': 0.1412343084479251, 'learning_rate': 3.832e-07, 'completion_length': 66.11607360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.112060546875, 'epoch': 0.62} 62%|██████▏ | 1542/2500 [13:14:53<6:55:53, 26.05s/it] 62%|██████▏ | 1543/2500 [13:15:29<7:41:44, 28.95s/it] {'loss': 0.0036, 'grad_norm': 1.800905832932526, 'learning_rate': 3.8279999999999996e-07, 'completion_length': 70.29464721679688, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.089111328125, 'epoch': 0.62} 62%|██████▏ | 1543/2500 [13:15:29<7:41:44, 28.95s/it] 62%|██████▏ | 1544/2500 [13:15:53<7:18:02, 27.49s/it] {'loss': 0.013, 'grad_norm': 4.412544559528582, 'learning_rate': 3.824e-07, 'completion_length': 56.71428871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.32421875, 'epoch': 0.62} 62%|██████▏ | 1544/2500 [13:15:53<7:18:02, 27.49s/it] 62%|██████▏ | 1545/2500 [13:16:17<6:58:23, 26.29s/it] {'loss': 0.0039, 'grad_norm': 0.18222134163052048, 'learning_rate': 3.82e-07, 'completion_length': 70.04464721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09716796875, 'epoch': 0.62} 62%|██████▏ | 1545/2500 [13:16:17<6:58:23, 26.29s/it] 62%|██████▏ | 1546/2500 [13:16:42<6:54:49, 26.09s/it] {'loss': 0.0036, 'grad_norm': 2.2618343383778434, 'learning_rate': 3.816e-07, 'completion_length': 61.80357551574707, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.08935546875, 'epoch': 0.62} 62%|██████▏ | 1546/2500 [13:16:42<6:54:49, 26.09s/it] 62%|██████▏ | 1547/2500 [13:17:07<6:47:40, 25.67s/it] {'loss': 0.0039, 'grad_norm': 0.15780777492819295, 'learning_rate': 3.8119999999999995e-07, 'completion_length': 66.83928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.096435546875, 'epoch': 0.62} 62%|██████▏ | 1547/2500 [13:17:07<6:47:40, 25.67s/it] 62%|██████▏ | 1548/2500 [13:17:31<6:39:28, 25.18s/it] {'loss': 0.0046, 'grad_norm': 0.2527946040061849, 'learning_rate': 3.808e-07, 'completion_length': 65.04464721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.114501953125, 'epoch': 0.62} 62%|██████▏ | 1548/2500 [13:17:31<6:39:28, 25.18s/it] 62%|██████▏ | 1549/2500 [13:17:55<6:33:04, 24.80s/it] {'loss': 0.0046, 'grad_norm': 2.6859949653731015, 'learning_rate': 3.804e-07, 'completion_length': 64.25893211364746, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.115478515625, 'epoch': 0.62} 62%|██████▏ | 1549/2500 [13:17:55<6:33:04, 24.80s/it] 62%|██████▏ | 1550/2500 [13:18:19<6:28:07, 24.51s/it] {'loss': 0.0039, 'grad_norm': 0.6600806477281813, 'learning_rate': 3.7999999999999996e-07, 'completion_length': 70.04464721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.098388671875, 'epoch': 0.62} 62%|██████▏ | 1550/2500 [13:18:19<6:28:07, 24.51s/it] 62%|██████▏ | 1551/2500 [13:18:42<6:23:46, 24.26s/it] {'loss': 0.004, 'grad_norm': 0.23490214610035587, 'learning_rate': 3.796e-07, 'completion_length': 59.080360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.100830078125, 'epoch': 0.62} 62%|██████▏ | 1551/2500 [13:18:43<6:23:46, 24.26s/it] 62%|██████▏ | 1552/2500 [13:19:06<6:19:20, 24.01s/it] {'loss': 0.0052, 'grad_norm': 0.37895879785855013, 'learning_rate': 3.7919999999999995e-07, 'completion_length': 66.41964340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12939453125, 'epoch': 0.62} 62%|██████▏ | 1552/2500 [13:19:06<6:19:20, 24.01s/it] 62%|██████▏ | 1553/2500 [13:19:30<6:18:07, 23.96s/it] {'loss': 0.0037, 'grad_norm': 1.7555480047355765, 'learning_rate': 3.7880000000000003e-07, 'completion_length': 60.517860412597656, 'rewards/accuracy_reward': 0.9017857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9017857313156128, 'reward_std': 0.03696779906749725, 'kl': 0.093505859375, 'epoch': 0.62} 62%|██████▏ | 1553/2500 [13:19:30<6:18:07, 23.96s/it] 62%|██████▏ | 1554/2500 [13:19:55<6:24:12, 24.37s/it] {'loss': 0.0041, 'grad_norm': 1.2556742129503833, 'learning_rate': 3.784e-07, 'completion_length': 70.81250381469727, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.101806640625, 'epoch': 0.62} 62%|██████▏ | 1554/2500 [13:19:55<6:24:12, 24.37s/it] 62%|██████▏ | 1555/2500 [13:20:18<6:17:38, 23.98s/it] {'loss': 0.0057, 'grad_norm': 0.9367732202554531, 'learning_rate': 3.7799999999999997e-07, 'completion_length': 64.92857360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.142578125, 'epoch': 0.62} 62%|██████▏ | 1555/2500 [13:20:18<6:17:38, 23.98s/it] 62%|██████▏ | 1556/2500 [13:20:43<6:21:32, 24.25s/it] {'loss': 0.005, 'grad_norm': 5.545691500091435, 'learning_rate': 3.776e-07, 'completion_length': 65.06250190734863, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.125244140625, 'epoch': 0.62} 62%|██████▏ | 1556/2500 [13:20:43<6:21:32, 24.25s/it] 62%|██████▏ | 1557/2500 [13:21:08<6:22:20, 24.33s/it] {'loss': 0.007, 'grad_norm': 0.2757250716286916, 'learning_rate': 3.7719999999999996e-07, 'completion_length': 60.41071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.17529296875, 'epoch': 0.62} 62%|██████▏ | 1557/2500 [13:21:08<6:22:20, 24.33s/it] 62%|██████▏ | 1558/2500 [13:21:32<6:20:52, 24.26s/it] {'loss': 0.0044, 'grad_norm': 0.12948073015039505, 'learning_rate': 3.768e-07, 'completion_length': 65.80357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10888671875, 'epoch': 0.62} 62%|██████▏ | 1558/2500 [13:21:32<6:20:52, 24.26s/it] 62%|██████▏ | 1559/2500 [13:21:57<6:24:38, 24.53s/it] {'loss': 0.0056, 'grad_norm': 0.24563585589488715, 'learning_rate': 3.764e-07, 'completion_length': 58.705360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.140625, 'epoch': 0.62} 62%|██████▏ | 1559/2500 [13:21:57<6:24:38, 24.53s/it] 62%|██████▏ | 1560/2500 [13:22:22<6:25:31, 24.61s/it] {'loss': 0.0089, 'grad_norm': 2.1784015408872834, 'learning_rate': 3.76e-07, 'completion_length': 62.98214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.22216796875, 'epoch': 0.62} 62%|██████▏ | 1560/2500 [13:22:22<6:25:31, 24.61s/it] 62%|██████▏ | 1561/2500 [13:22:46<6:22:48, 24.46s/it] {'loss': 0.0052, 'grad_norm': 0.20009258035089136, 'learning_rate': 3.7559999999999995e-07, 'completion_length': 62.32143211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.130615234375, 'epoch': 0.62} 62%|██████▏ | 1561/2500 [13:22:46<6:22:48, 24.46s/it] 62%|██████▏ | 1562/2500 [13:23:26<7:38:38, 29.34s/it] {'loss': 0.0043, 'grad_norm': 0.1896239203501147, 'learning_rate': 3.7519999999999997e-07, 'completion_length': 65.35714721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10791015625, 'epoch': 0.62} 62%|██████▏ | 1562/2500 [13:23:26<7:38:38, 29.34s/it] 63%|██████▎ | 1563/2500 [13:23:53<7:23:40, 28.41s/it] {'loss': 0.006, 'grad_norm': 1.2233428444816759, 'learning_rate': 3.748e-07, 'completion_length': 64.76786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.150390625, 'epoch': 0.63} 63%|██████▎ | 1563/2500 [13:23:53<7:23:40, 28.41s/it] 63%|██████▎ | 1564/2500 [13:24:18<7:07:05, 27.38s/it] {'loss': 0.0033, 'grad_norm': 0.1319159101282365, 'learning_rate': 3.744e-07, 'completion_length': 63.562503814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0823974609375, 'epoch': 0.63} 63%|██████▎ | 1564/2500 [13:24:18<7:07:05, 27.38s/it] 63%|██████▎ | 1565/2500 [13:24:43<6:57:44, 26.81s/it] {'loss': 0.0039, 'grad_norm': 0.12593824782809218, 'learning_rate': 3.74e-07, 'completion_length': 75.71428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.097412109375, 'epoch': 0.63} 63%|██████▎ | 1565/2500 [13:24:43<6:57:44, 26.81s/it] 63%|██████▎ | 1566/2500 [13:25:09<6:52:31, 26.50s/it] {'loss': 0.0058, 'grad_norm': 0.3295153566464491, 'learning_rate': 3.7359999999999996e-07, 'completion_length': 60.73214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.145263671875, 'epoch': 0.63} 63%|██████▎ | 1566/2500 [13:25:09<6:52:31, 26.50s/it] 63%|██████▎ | 1567/2500 [13:25:35<6:50:15, 26.38s/it] {'loss': 0.0044, 'grad_norm': 0.7662095560505817, 'learning_rate': 3.732e-07, 'completion_length': 69.50893020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1103515625, 'epoch': 0.63} 63%|██████▎ | 1567/2500 [13:25:35<6:50:15, 26.38s/it] 63%|██████▎ | 1568/2500 [13:26:03<6:59:11, 26.99s/it] {'loss': 0.004, 'grad_norm': 0.20024006671018807, 'learning_rate': 3.728e-07, 'completion_length': 75.83036041259766, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.100341796875, 'epoch': 0.63} 63%|██████▎ | 1568/2500 [13:26:03<6:59:11, 26.99s/it] 63%|██████▎ | 1569/2500 [13:26:29<6:50:08, 26.43s/it] {'loss': 0.0047, 'grad_norm': 0.14818966410366374, 'learning_rate': 3.7239999999999997e-07, 'completion_length': 60.455360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.118408203125, 'epoch': 0.63} 63%|██████▎ | 1569/2500 [13:26:29<6:50:08, 26.43s/it] 63%|██████▎ | 1570/2500 [13:26:54<6:45:04, 26.13s/it] {'loss': 0.0045, 'grad_norm': 0.1604368671401126, 'learning_rate': 3.72e-07, 'completion_length': 70.65178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11279296875, 'epoch': 0.63} 63%|██████▎ | 1570/2500 [13:26:54<6:45:04, 26.13s/it] 63%|██████▎ | 1571/2500 [13:27:20<6:46:09, 26.23s/it] {'loss': 0.0048, 'grad_norm': 1.2997126213512333, 'learning_rate': 3.7159999999999997e-07, 'completion_length': 66.72321701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.119140625, 'epoch': 0.63} 63%|██████▎ | 1571/2500 [13:27:20<6:46:09, 26.23s/it] 63%|██████▎ | 1572/2500 [13:27:45<6:39:13, 25.81s/it] {'loss': 0.0098, 'grad_norm': 1.1709693695772834, 'learning_rate': 3.7119999999999994e-07, 'completion_length': 66.48214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.24609375, 'epoch': 0.63} 63%|██████▎ | 1572/2500 [13:27:45<6:39:13, 25.81s/it] 63%|██████▎ | 1573/2500 [13:28:09<6:29:08, 25.19s/it] {'loss': 0.004, 'grad_norm': 0.7656155581970064, 'learning_rate': 3.708e-07, 'completion_length': 65.25000381469727, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.099609375, 'epoch': 0.63} 63%|██████▎ | 1573/2500 [13:28:09<6:29:08, 25.19s/it] 63%|██████▎ | 1574/2500 [13:28:33<6:23:48, 24.87s/it] {'loss': 0.0032, 'grad_norm': 0.1250605713747409, 'learning_rate': 3.704e-07, 'completion_length': 63.750003814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0791015625, 'epoch': 0.63} 63%|██████▎ | 1574/2500 [13:28:33<6:23:48, 24.87s/it] 63%|██████▎ | 1575/2500 [13:28:59<6:27:00, 25.10s/it] {'loss': 0.0046, 'grad_norm': 0.18872996968143013, 'learning_rate': 3.7e-07, 'completion_length': 66.31250381469727, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.115234375, 'epoch': 0.63} 63%|██████▎ | 1575/2500 [13:28:59<6:27:00, 25.10s/it] 63%|██████▎ | 1576/2500 [13:29:34<7:11:14, 28.00s/it] {'loss': 0.0054, 'grad_norm': 2.1177412811526786, 'learning_rate': 3.696e-07, 'completion_length': 61.062503814697266, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9464285969734192, 'reward_std': 0.13408026099205017, 'kl': 0.13525390625, 'epoch': 0.63} 63%|██████▎ | 1576/2500 [13:29:34<7:11:14, 28.00s/it] 63%|██████▎ | 1577/2500 [13:30:10<7:49:48, 30.54s/it] {'loss': 0.0051, 'grad_norm': 0.193305761974216, 'learning_rate': 3.6919999999999994e-07, 'completion_length': 61.93750190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12744140625, 'epoch': 0.63} 63%|██████▎ | 1577/2500 [13:30:10<7:49:48, 30.54s/it] 63%|██████▎ | 1578/2500 [13:30:47<8:20:14, 32.55s/it] {'loss': 0.004, 'grad_norm': 0.1392134608273183, 'learning_rate': 3.688e-07, 'completion_length': 66.01785850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.099609375, 'epoch': 0.63} 63%|██████▎ | 1578/2500 [13:30:47<8:20:14, 32.55s/it] 63%|██████▎ | 1579/2500 [13:31:16<8:02:29, 31.43s/it] {'loss': 0.0042, 'grad_norm': 1.489913340441126, 'learning_rate': 3.684e-07, 'completion_length': 67.26786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.10546875, 'epoch': 0.63} 63%|██████▎ | 1579/2500 [13:31:16<8:02:29, 31.43s/it] 63%|██████▎ | 1580/2500 [13:31:41<7:33:15, 29.56s/it] {'loss': 0.0056, 'grad_norm': 1.7345107807288938, 'learning_rate': 3.6799999999999996e-07, 'completion_length': 64.76786041259766, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0739355981349945, 'kl': 0.140869140625, 'epoch': 0.63} 63%|██████▎ | 1580/2500 [13:31:41<7:33:15, 29.56s/it] 63%|██████▎ | 1581/2500 [13:32:10<7:29:54, 29.37s/it] {'loss': 0.0057, 'grad_norm': 1.2975821415789979, 'learning_rate': 3.676e-07, 'completion_length': 72.37500381469727, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.142578125, 'epoch': 0.63} 63%|██████▎ | 1581/2500 [13:32:10<7:29:54, 29.37s/it] 63%|██████▎ | 1582/2500 [13:32:49<8:14:23, 32.31s/it] {'loss': 0.0038, 'grad_norm': 0.237085898794575, 'learning_rate': 3.672e-07, 'completion_length': 61.34821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09619140625, 'epoch': 0.63} 63%|██████▎ | 1582/2500 [13:32:49<8:14:23, 32.31s/it] 63%|██████▎ | 1583/2500 [13:34:09<11:52:55, 46.65s/it] {'loss': 0.0036, 'grad_norm': 1.2890458787552703, 'learning_rate': 3.668e-07, 'completion_length': 70.26786041259766, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.090576171875, 'epoch': 0.63} 63%|██████▎ | 1583/2500 [13:34:09<11:52:55, 46.65s/it] 63%|██████▎ | 1584/2500 [13:34:34<10:09:43, 39.94s/it] {'loss': 0.0068, 'grad_norm': 1.7291024240027, 'learning_rate': 3.664e-07, 'completion_length': 65.89286041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.17041015625, 'epoch': 0.63} 63%|██████▎ | 1584/2500 [13:34:34<10:09:43, 39.94s/it] 63%|██████▎ | 1585/2500 [13:34:59<8:59:55, 35.41s/it] {'loss': 0.0052, 'grad_norm': 0.3750994766272289, 'learning_rate': 3.6599999999999997e-07, 'completion_length': 58.580360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13134765625, 'epoch': 0.63} 63%|██████▎ | 1585/2500 [13:34:59<8:59:55, 35.41s/it] 63%|██████▎ | 1586/2500 [13:35:23<8:08:44, 32.08s/it] {'loss': 0.0087, 'grad_norm': 0.8588239299836918, 'learning_rate': 3.6559999999999994e-07, 'completion_length': 57.54464530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.2158203125, 'epoch': 0.63} 63%|██████▎ | 1586/2500 [13:35:23<8:08:44, 32.08s/it] 63%|██████▎ | 1587/2500 [13:35:57<8:18:13, 32.74s/it] {'loss': 0.0054, 'grad_norm': 1.2359011572815641, 'learning_rate': 3.652e-07, 'completion_length': 61.642860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13427734375, 'epoch': 0.63} 63%|██████▎ | 1587/2500 [13:35:57<8:18:13, 32.74s/it] 64%|██████▎ | 1588/2500 [13:36:27<8:03:52, 31.83s/it] {'loss': 0.0047, 'grad_norm': 2.0307156073520622, 'learning_rate': 3.648e-07, 'completion_length': 64.49107360839844, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.11865234375, 'epoch': 0.64} 64%|██████▎ | 1588/2500 [13:36:27<8:03:52, 31.83s/it] 64%|██████▎ | 1589/2500 [13:36:55<7:47:07, 30.77s/it] {'loss': 0.0049, 'grad_norm': 0.17338235495364315, 'learning_rate': 3.644e-07, 'completion_length': 63.29464340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.121826171875, 'epoch': 0.64} 64%|██████▎ | 1589/2500 [13:36:55<7:47:07, 30.77s/it] 64%|██████▎ | 1590/2500 [13:37:39<8:46:12, 34.69s/it] {'loss': 0.0058, 'grad_norm': 1.9368228147741309, 'learning_rate': 3.64e-07, 'completion_length': 60.40178871154785, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.1435546875, 'epoch': 0.64} 64%|██████▎ | 1590/2500 [13:37:39<8:46:12, 34.69s/it] 64%|██████▎ | 1591/2500 [13:38:03<7:56:54, 31.48s/it] {'loss': 0.0045, 'grad_norm': 0.13227781174940756, 'learning_rate': 3.6359999999999995e-07, 'completion_length': 62.357147216796875, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11181640625, 'epoch': 0.64} 64%|██████▎ | 1591/2500 [13:38:03<7:56:54, 31.48s/it] 64%|██████▎ | 1592/2500 [13:38:42<8:30:34, 33.74s/it] {'loss': 0.0121, 'grad_norm': 2.5484913684979857, 'learning_rate': 3.632e-07, 'completion_length': 63.29464530944824, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.946428656578064, 'reward_std': 0.1289060041308403, 'kl': 0.3037109375, 'epoch': 0.64} 64%|██████▎ | 1592/2500 [13:38:42<8:30:34, 33.74s/it] 64%|██████▎ | 1593/2500 [13:39:07<7:52:12, 31.24s/it] {'loss': 0.0126, 'grad_norm': 4.417324512052419, 'learning_rate': 3.628e-07, 'completion_length': 61.86607360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.3154296875, 'epoch': 0.64} 64%|██████▎ | 1593/2500 [13:39:07<7:52:12, 31.24s/it] 64%|██████▍ | 1594/2500 [13:39:37<7:43:38, 30.70s/it] {'loss': 0.0049, 'grad_norm': 0.1844175575047092, 'learning_rate': 3.6239999999999996e-07, 'completion_length': 62.84821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.123291015625, 'epoch': 0.64} 64%|██████▍ | 1594/2500 [13:39:37<7:43:38, 30.70s/it] 64%|██████▍ | 1595/2500 [13:40:03<7:21:53, 29.30s/it] {'loss': 0.0057, 'grad_norm': 1.5497532553187499, 'learning_rate': 3.62e-07, 'completion_length': 67.3035774230957, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.141845703125, 'epoch': 0.64} 64%|██████▍ | 1595/2500 [13:40:03<7:21:53, 29.30s/it] 64%|██████▍ | 1596/2500 [13:40:28<7:02:08, 28.02s/it] {'loss': 0.0042, 'grad_norm': 1.081399202059292, 'learning_rate': 3.6159999999999996e-07, 'completion_length': 65.33929061889648, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.10498046875, 'epoch': 0.64} 64%|██████▍ | 1596/2500 [13:40:28<7:02:08, 28.02s/it] 64%|██████▍ | 1597/2500 [13:41:02<7:26:58, 29.70s/it] {'loss': 0.0109, 'grad_norm': 1.8449755683193672, 'learning_rate': 3.612e-07, 'completion_length': 65.37500381469727, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285714626312256, 'reward_std': 0.14579425007104874, 'kl': 0.2744140625, 'epoch': 0.64} 64%|██████▍ | 1597/2500 [13:41:02<7:26:58, 29.70s/it] 64%|██████▍ | 1598/2500 [13:41:25<7:00:15, 27.96s/it] {'loss': 0.0058, 'grad_norm': 1.0625173266226589, 'learning_rate': 3.608e-07, 'completion_length': 60.05357551574707, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.14453125, 'epoch': 0.64} 64%|██████▍ | 1598/2500 [13:41:25<7:00:15, 27.96s/it] 64%|██████▍ | 1599/2500 [13:41:50<6:42:55, 26.83s/it] {'loss': 0.0048, 'grad_norm': 0.19692191593770808, 'learning_rate': 3.6039999999999997e-07, 'completion_length': 60.267860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.120361328125, 'epoch': 0.64} 64%|██████▍ | 1599/2500 [13:41:50<6:42:55, 26.83s/it] 64%|██████▍ | 1600/2500 [13:42:15<6:36:18, 26.42s/it] {'loss': 0.0051, 'grad_norm': 0.1606234303659649, 'learning_rate': 3.6e-07, 'completion_length': 56.43750190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.128173828125, 'epoch': 0.64} 64%|██████▍ | 1600/2500 [13:42:15<6:36:18, 26.42s/it] 64%|██████▍ | 1601/2500 [13:43:16<9:12:40, 36.89s/it] {'loss': 0.0058, 'grad_norm': 4.736147098435262, 'learning_rate': 3.5959999999999996e-07, 'completion_length': 55.04464530944824, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.14453125, 'epoch': 0.64} 64%|██████▍ | 1601/2500 [13:43:16<9:12:40, 36.89s/it] 64%|██████▍ | 1602/2500 [13:44:18<11:03:00, 44.30s/it] {'loss': 0.0045, 'grad_norm': 1.809158807589496, 'learning_rate': 3.592e-07, 'completion_length': 63.392860412597656, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.11279296875, 'epoch': 0.64} 64%|██████▍ | 1602/2500 [13:44:18<11:03:00, 44.30s/it] 64%|██████▍ | 1603/2500 [13:44:45<9:46:00, 39.20s/it] {'loss': 0.0045, 'grad_norm': 3.400860932754363, 'learning_rate': 3.588e-07, 'completion_length': 54.49107360839844, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.11328125, 'epoch': 0.64} 64%|██████▍ | 1603/2500 [13:44:45<9:46:00, 39.20s/it] 64%|██████▍ | 1604/2500 [13:45:48<11:30:04, 46.21s/it] {'loss': 0.0037, 'grad_norm': 0.14835785677147242, 'learning_rate': 3.584e-07, 'completion_length': 58.91071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09228515625, 'epoch': 0.64} 64%|██████▍ | 1604/2500 [13:45:48<11:30:04, 46.21s/it] 64%|██████▍ | 1605/2500 [13:46:11<9:46:10, 39.30s/it] {'loss': 0.005, 'grad_norm': 0.8578068264542921, 'learning_rate': 3.5799999999999995e-07, 'completion_length': 58.40178680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.12451171875, 'epoch': 0.64} 64%|██████▍ | 1605/2500 [13:46:11<9:46:10, 39.30s/it] 64%|██████▍ | 1606/2500 [13:46:34<8:34:35, 34.54s/it] {'loss': 0.0052, 'grad_norm': 1.6464092106283494, 'learning_rate': 3.5759999999999997e-07, 'completion_length': 62.767860412597656, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642858505249023, 'reward_std': 0.0835726335644722, 'kl': 0.12890625, 'epoch': 0.64} 64%|██████▍ | 1606/2500 [13:46:34<8:34:35, 34.54s/it] 64%|██████▍ | 1607/2500 [13:47:00<7:52:53, 31.77s/it] {'loss': 0.0045, 'grad_norm': 0.16157565658371775, 'learning_rate': 3.572e-07, 'completion_length': 56.15178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11181640625, 'epoch': 0.64} 64%|██████▍ | 1607/2500 [13:47:00<7:52:53, 31.77s/it] 64%|██████▍ | 1608/2500 [13:47:24<7:19:01, 29.53s/it] {'loss': 0.006, 'grad_norm': 4.691557081768352, 'learning_rate': 3.5679999999999997e-07, 'completion_length': 55.32143020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.14990234375, 'epoch': 0.64} 64%|██████▍ | 1608/2500 [13:47:24<7:19:01, 29.53s/it] 64%|██████▍ | 1609/2500 [13:47:48<6:54:26, 27.91s/it] {'loss': 0.004, 'grad_norm': 1.8614913328283555, 'learning_rate': 3.564e-07, 'completion_length': 56.22321701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.098876953125, 'epoch': 0.64} 64%|██████▍ | 1609/2500 [13:47:48<6:54:26, 27.91s/it] 64%|██████▍ | 1610/2500 [13:48:28<7:45:43, 31.40s/it] {'loss': 0.007, 'grad_norm': 0.7675332417484764, 'learning_rate': 3.5599999999999996e-07, 'completion_length': 59.098215103149414, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.17431640625, 'epoch': 0.64} 64%|██████▍ | 1610/2500 [13:48:28<7:45:43, 31.40s/it] 64%|██████▍ | 1611/2500 [13:48:53<7:18:24, 29.59s/it] {'loss': 0.0075, 'grad_norm': 11.627341849844445, 'learning_rate': 3.5560000000000003e-07, 'completion_length': 60.32143211364746, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.1875, 'epoch': 0.64} 64%|██████▍ | 1611/2500 [13:48:53<7:18:24, 29.59s/it] 64%|██████▍ | 1612/2500 [13:49:16<6:48:22, 27.59s/it] {'loss': 0.018, 'grad_norm': 2.500642415341456, 'learning_rate': 3.552e-07, 'completion_length': 61.25000190734863, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8928571939468384, 'reward_std': 0.1201249361038208, 'kl': 0.4501953125, 'epoch': 0.64} 64%|██████▍ | 1612/2500 [13:49:16<6:48:22, 27.59s/it] 65%|██████▍ | 1613/2500 [13:49:40<6:32:09, 26.53s/it] {'loss': 0.0165, 'grad_norm': 3.6411115133704337, 'learning_rate': 3.548e-07, 'completion_length': 56.38393020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.41259765625, 'epoch': 0.65} 65%|██████▍ | 1613/2500 [13:49:40<6:32:09, 26.53s/it] 65%|██████▍ | 1614/2500 [13:50:04<6:18:01, 25.60s/it] {'loss': 0.0209, 'grad_norm': 4.851727836719922, 'learning_rate': 3.544e-07, 'completion_length': 60.29464530944824, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8928572535514832, 'reward_std': 0.1670861840248108, 'kl': 0.5205078125, 'epoch': 0.65} 65%|██████▍ | 1614/2500 [13:50:04<6:18:01, 25.60s/it] 65%|██████▍ | 1615/2500 [13:51:41<11:34:15, 47.07s/it] {'loss': 0.006, 'grad_norm': 1.159659572237781, 'learning_rate': 3.5399999999999997e-07, 'completion_length': 56.55357360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.053144559264183044, 'kl': 0.150390625, 'epoch': 0.65} 65%|██████▍ | 1615/2500 [13:51:41<11:34:15, 47.07s/it] 65%|██████▍ | 1616/2500 [13:52:07<9:59:58, 40.72s/it] {'loss': 0.013, 'grad_norm': 2.560058581849135, 'learning_rate': 3.536e-07, 'completion_length': 52.36607360839844, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9464285969734192, 'reward_std': 0.15152287483215332, 'kl': 0.32470703125, 'epoch': 0.65} 65%|██████▍ | 1616/2500 [13:52:07<9:59:58, 40.72s/it] 65%|██████▍ | 1617/2500 [13:52:30<8:41:13, 35.42s/it] {'loss': 0.0104, 'grad_norm': 1.1836184306586914, 'learning_rate': 3.532e-07, 'completion_length': 59.437503814697266, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.26025390625, 'epoch': 0.65} 65%|██████▍ | 1617/2500 [13:52:30<8:41:13, 35.42s/it] 65%|██████▍ | 1618/2500 [13:52:55<7:57:00, 32.45s/it] {'loss': 0.0106, 'grad_norm': 1.994380353835959, 'learning_rate': 3.528e-07, 'completion_length': 54.74107360839844, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.10882644727826118, 'kl': 0.265625, 'epoch': 0.65} 65%|██████▍ | 1618/2500 [13:52:55<7:57:00, 32.45s/it] 65%|██████▍ | 1619/2500 [13:53:19<7:18:53, 29.89s/it] {'loss': 0.0124, 'grad_norm': 1.563115991923645, 'learning_rate': 3.5239999999999995e-07, 'completion_length': 52.19643211364746, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9375001192092896, 'reward_std': 0.1379830539226532, 'kl': 0.30908203125, 'epoch': 0.65} 65%|██████▍ | 1619/2500 [13:53:19<7:18:53, 29.89s/it] 65%|██████▍ | 1620/2500 [13:53:42<6:45:46, 27.67s/it] {'loss': 0.0051, 'grad_norm': 4.024020857823671, 'learning_rate': 3.52e-07, 'completion_length': 57.339290618896484, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.12841796875, 'epoch': 0.65} 65%|██████▍ | 1620/2500 [13:53:42<6:45:46, 27.67s/it] 65%|██████▍ | 1621/2500 [13:54:05<6:27:02, 26.42s/it] {'loss': 0.0055, 'grad_norm': 0.2755400964826367, 'learning_rate': 3.516e-07, 'completion_length': 60.73214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13720703125, 'epoch': 0.65} 65%|██████▍ | 1621/2500 [13:54:05<6:27:02, 26.42s/it] 65%|██████▍ | 1622/2500 [13:54:28<6:11:12, 25.37s/it] {'loss': 0.0046, 'grad_norm': 0.17848804249487743, 'learning_rate': 3.512e-07, 'completion_length': 59.26785850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.114990234375, 'epoch': 0.65} 65%|██████▍ | 1622/2500 [13:54:28<6:11:12, 25.37s/it] 65%|██████▍ | 1623/2500 [13:54:54<6:15:33, 25.69s/it] {'loss': 0.0056, 'grad_norm': 3.4858456733947913, 'learning_rate': 3.508e-07, 'completion_length': 54.330360412597656, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.140380859375, 'epoch': 0.65} 65%|██████▍ | 1623/2500 [13:54:54<6:15:33, 25.69s/it] 65%|██████▍ | 1624/2500 [13:55:18<6:07:44, 25.19s/it] {'loss': 0.0055, 'grad_norm': 0.25938361878692384, 'learning_rate': 3.5039999999999996e-07, 'completion_length': 54.76785850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.136962890625, 'epoch': 0.65} 65%|██████▍ | 1624/2500 [13:55:18<6:07:44, 25.19s/it] 65%|██████▌ | 1625/2500 [13:55:42<5:59:44, 24.67s/it] {'loss': 0.0044, 'grad_norm': 0.23123066060126102, 'learning_rate': 3.5e-07, 'completion_length': 60.09821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.109130859375, 'epoch': 0.65} 65%|██████▌ | 1625/2500 [13:55:42<5:59:44, 24.67s/it] 65%|██████▌ | 1626/2500 [13:56:05<5:52:34, 24.20s/it] {'loss': 0.0285, 'grad_norm': 5.3112279019154744, 'learning_rate': 3.496e-07, 'completion_length': 48.71428871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.71044921875, 'epoch': 0.65} 65%|██████▌ | 1626/2500 [13:56:05<5:52:34, 24.20s/it] 65%|██████▌ | 1627/2500 [13:56:29<5:50:25, 24.08s/it] {'loss': 0.0071, 'grad_norm': 1.6081426561406862, 'learning_rate': 3.492e-07, 'completion_length': 61.15178871154785, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.17724609375, 'epoch': 0.65} 65%|██████▌ | 1627/2500 [13:56:29<5:50:25, 24.08s/it] 65%|██████▌ | 1628/2500 [13:56:53<5:50:14, 24.10s/it] {'loss': 0.0053, 'grad_norm': 0.253197260301849, 'learning_rate': 3.488e-07, 'completion_length': 50.39285850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.131591796875, 'epoch': 0.65} 65%|██████▌ | 1628/2500 [13:56:53<5:50:14, 24.10s/it] 65%|██████▌ | 1629/2500 [13:57:17<5:48:48, 24.03s/it] {'loss': 0.0051, 'grad_norm': 0.2099900643553055, 'learning_rate': 3.4839999999999997e-07, 'completion_length': 55.375003814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12646484375, 'epoch': 0.65} 65%|██████▌ | 1629/2500 [13:57:17<5:48:48, 24.03s/it] 65%|██████▌ | 1630/2500 [13:57:41<5:51:06, 24.21s/it] {'loss': 0.009, 'grad_norm': 1.6432615812537645, 'learning_rate': 3.4799999999999994e-07, 'completion_length': 55.99107551574707, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.223876953125, 'epoch': 0.65} 65%|██████▌ | 1630/2500 [13:57:42<5:51:06, 24.21s/it] 65%|██████▌ | 1631/2500 [13:58:04<5:42:46, 23.67s/it] {'loss': 0.0044, 'grad_norm': 0.1804091684206068, 'learning_rate': 3.476e-07, 'completion_length': 51.86607551574707, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.111083984375, 'epoch': 0.65} 65%|██████▌ | 1631/2500 [13:58:04<5:42:46, 23.67s/it] 65%|██████▌ | 1632/2500 [13:58:27<5:40:15, 23.52s/it] {'loss': 0.0046, 'grad_norm': 0.24254163493650302, 'learning_rate': 3.472e-07, 'completion_length': 52.11607360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1142578125, 'epoch': 0.65} 65%|██████▌ | 1632/2500 [13:58:27<5:40:15, 23.52s/it] 65%|██████▌ | 1633/2500 [13:59:09<7:00:57, 29.13s/it] {'loss': 0.0048, 'grad_norm': 0.20869931110854384, 'learning_rate': 3.4679999999999996e-07, 'completion_length': 53.16964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11962890625, 'epoch': 0.65} 65%|██████▌ | 1633/2500 [13:59:09<7:00:57, 29.13s/it] 65%|██████▌ | 1634/2500 [14:00:46<11:53:30, 49.43s/it] {'loss': 0.0042, 'grad_norm': 0.17979892360019864, 'learning_rate': 3.464e-07, 'completion_length': 62.25893211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1064453125, 'epoch': 0.65} 65%|██████▌ | 1634/2500 [14:00:46<11:53:30, 49.43s/it] 65%|██████▌ | 1635/2500 [14:02:23<15:16:26, 63.57s/it] {'loss': 0.004, 'grad_norm': 1.1155396901918289, 'learning_rate': 3.4599999999999995e-07, 'completion_length': 58.62500190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.099609375, 'epoch': 0.65} 65%|██████▌ | 1635/2500 [14:02:23<15:16:26, 63.57s/it] 65%|██████▌ | 1636/2500 [14:04:10<18:24:27, 76.70s/it] {'loss': 0.0059, 'grad_norm': 0.2834457988368736, 'learning_rate': 3.456e-07, 'completion_length': 58.32143211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.14697265625, 'epoch': 0.65} 65%|██████▌ | 1636/2500 [14:04:10<18:24:27, 76.70s/it] 65%|██████▌ | 1637/2500 [14:05:06<16:54:35, 70.54s/it] {'loss': 0.0097, 'grad_norm': 4.794077633837228, 'learning_rate': 3.452e-07, 'completion_length': 56.91071701049805, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.946428656578064, 'reward_std': 0.08868780732154846, 'kl': 0.2412109375, 'epoch': 0.65} 65%|██████▌ | 1637/2500 [14:05:06<16:54:35, 70.54s/it] 66%|██████▌ | 1638/2500 [14:05:29<13:29:47, 56.37s/it] {'loss': 0.0054, 'grad_norm': 0.3560464781311533, 'learning_rate': 3.4479999999999996e-07, 'completion_length': 59.482147216796875, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.133544921875, 'epoch': 0.66} 66%|██████▌ | 1638/2500 [14:05:29<13:29:47, 56.37s/it] 66%|██████▌ | 1639/2500 [14:05:54<11:12:58, 46.90s/it] {'loss': 0.0128, 'grad_norm': 7.074588963314438, 'learning_rate': 3.444e-07, 'completion_length': 57.90178680419922, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9375000596046448, 'reward_std': 0.12444883212447166, 'kl': 0.31982421875, 'epoch': 0.66} 66%|██████▌ | 1639/2500 [14:05:54<11:12:58, 46.90s/it] 66%|██████▌ | 1640/2500 [14:06:17<9:29:45, 39.75s/it] {'loss': 0.0065, 'grad_norm': 2.8517266015479366, 'learning_rate': 3.4399999999999996e-07, 'completion_length': 55.97321701049805, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.12054042518138885, 'kl': 0.16162109375, 'epoch': 0.66} 66%|██████▌ | 1640/2500 [14:06:17<9:29:45, 39.75s/it] 66%|██████▌ | 1641/2500 [14:06:43<8:29:28, 35.59s/it] {'loss': 0.0054, 'grad_norm': 1.0831837938417461, 'learning_rate': 3.436e-07, 'completion_length': 65.20536041259766, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.135009765625, 'epoch': 0.66} 66%|██████▌ | 1641/2500 [14:06:43<8:29:28, 35.59s/it] 66%|██████▌ | 1642/2500 [14:07:08<7:44:04, 32.45s/it] {'loss': 0.0072, 'grad_norm': 3.3652777811023937, 'learning_rate': 3.432e-07, 'completion_length': 60.55357360839844, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285715222358704, 'reward_std': 0.12444322556257248, 'kl': 0.180419921875, 'epoch': 0.66} 66%|██████▌ | 1642/2500 [14:07:08<7:44:04, 32.45s/it] 66%|██████▌ | 1643/2500 [14:07:33<7:10:21, 30.13s/it] {'loss': 0.0049, 'grad_norm': 0.9934527145351082, 'learning_rate': 3.4279999999999997e-07, 'completion_length': 60.14285850524902, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.12255859375, 'epoch': 0.66} 66%|██████▌ | 1643/2500 [14:07:33<7:10:21, 30.13s/it] 66%|██████▌ | 1644/2500 [14:07:57<6:44:57, 28.38s/it] {'loss': 0.0067, 'grad_norm': 1.3635923369691796, 'learning_rate': 3.4239999999999994e-07, 'completion_length': 55.812503814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.16650390625, 'epoch': 0.66} 66%|██████▌ | 1644/2500 [14:07:57<6:44:57, 28.38s/it] 66%|██████▌ | 1645/2500 [14:08:21<6:22:46, 26.86s/it] {'loss': 0.0042, 'grad_norm': 0.7950494742190988, 'learning_rate': 3.42e-07, 'completion_length': 59.88393020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.10400390625, 'epoch': 0.66} 66%|██████▌ | 1645/2500 [14:08:21<6:22:46, 26.86s/it] 66%|██████▌ | 1646/2500 [14:08:44<6:09:25, 25.95s/it] {'loss': 0.0055, 'grad_norm': 0.3181095508282648, 'learning_rate': 3.416e-07, 'completion_length': 57.49107551574707, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13671875, 'epoch': 0.66} 66%|██████▌ | 1646/2500 [14:08:45<6:09:25, 25.95s/it] 66%|██████▌ | 1647/2500 [14:09:08<6:00:01, 25.32s/it] {'loss': 0.0058, 'grad_norm': 0.24835907918496855, 'learning_rate': 3.412e-07, 'completion_length': 59.500003814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.144775390625, 'epoch': 0.66} 66%|██████▌ | 1647/2500 [14:09:08<6:00:01, 25.32s/it] 66%|██████▌ | 1648/2500 [14:09:31<5:48:47, 24.56s/it] {'loss': 0.0099, 'grad_norm': 2.325527381070594, 'learning_rate': 3.408e-07, 'completion_length': 58.50000190734863, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857909202576, 'reward_std': 0.10101525112986565, 'kl': 0.2470703125, 'epoch': 0.66} 66%|██████▌ | 1648/2500 [14:09:31<5:48:47, 24.56s/it] 66%|██████▌ | 1649/2500 [14:09:57<5:54:00, 24.96s/it] {'loss': 0.0042, 'grad_norm': 0.28143504749085885, 'learning_rate': 3.4039999999999995e-07, 'completion_length': 61.56250190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.105224609375, 'epoch': 0.66} 66%|██████▌ | 1649/2500 [14:09:57<5:54:00, 24.96s/it] 66%|██████▌ | 1650/2500 [14:10:50<7:51:26, 33.28s/it] {'loss': 0.0105, 'grad_norm': 1.041937388395897, 'learning_rate': 3.4000000000000003e-07, 'completion_length': 57.750003814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.261474609375, 'epoch': 0.66} 66%|██████▌ | 1650/2500 [14:10:50<7:51:26, 33.28s/it] 66%|██████▌ | 1651/2500 [14:11:14<7:11:57, 30.53s/it] {'loss': 0.0067, 'grad_norm': 1.7151331290193048, 'learning_rate': 3.396e-07, 'completion_length': 57.31250190734863, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285714626312256, 'reward_std': 0.0835726335644722, 'kl': 0.16796875, 'epoch': 0.66} 66%|██████▌ | 1651/2500 [14:11:14<7:11:57, 30.53s/it] 66%|██████▌ | 1652/2500 [14:11:38<6:43:34, 28.55s/it] {'loss': 0.0068, 'grad_norm': 2.5758180654736376, 'learning_rate': 3.3919999999999997e-07, 'completion_length': 52.50000190734863, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.169921875, 'epoch': 0.66} 66%|██████▌ | 1652/2500 [14:11:38<6:43:34, 28.55s/it] 66%|██████▌ | 1653/2500 [14:12:02<6:26:13, 27.36s/it] {'loss': 0.0053, 'grad_norm': 0.8024074927632744, 'learning_rate': 3.388e-07, 'completion_length': 58.12500190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13232421875, 'epoch': 0.66} 66%|██████▌ | 1653/2500 [14:12:02<6:26:13, 27.36s/it] 66%|██████▌ | 1654/2500 [14:12:26<6:10:49, 26.30s/it] {'loss': 0.0096, 'grad_norm': 2.5274266018152853, 'learning_rate': 3.3839999999999996e-07, 'completion_length': 60.94643211364746, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9375000596046448, 'reward_std': 0.11594516783952713, 'kl': 0.239501953125, 'epoch': 0.66} 66%|██████▌ | 1654/2500 [14:12:26<6:10:49, 26.30s/it] 66%|██████▌ | 1655/2500 [14:12:50<5:58:51, 25.48s/it] {'loss': 0.0112, 'grad_norm': 3.619175951258607, 'learning_rate': 3.38e-07, 'completion_length': 50.50000190734863, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.910714328289032, 'reward_std': 0.13408026099205017, 'kl': 0.281005859375, 'epoch': 0.66} 66%|██████▌ | 1655/2500 [14:12:50<5:58:51, 25.48s/it] 66%|██████▌ | 1656/2500 [14:13:14<5:53:16, 25.11s/it] {'loss': 0.0053, 'grad_norm': 1.1803821823330183, 'learning_rate': 3.376e-07, 'completion_length': 55.93750190734863, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.133056640625, 'epoch': 0.66} 66%|██████▌ | 1656/2500 [14:13:14<5:53:16, 25.11s/it] 66%|██████▋ | 1657/2500 [14:13:38<5:47:26, 24.73s/it] {'loss': 0.0167, 'grad_norm': 2.220807245145482, 'learning_rate': 3.372e-07, 'completion_length': 54.13393211364746, 'rewards/accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.8571429252624512, 'reward_std': 0.19503602385520935, 'kl': 0.4169921875, 'epoch': 0.66} 66%|██████▋ | 1657/2500 [14:13:38<5:47:26, 24.73s/it] 66%|██████▋ | 1658/2500 [14:14:02<5:44:22, 24.54s/it] {'loss': 0.0059, 'grad_norm': 2.5496559619315486, 'learning_rate': 3.368e-07, 'completion_length': 55.09821701049805, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.1484375, 'epoch': 0.66} 66%|██████▋ | 1658/2500 [14:14:02<5:44:22, 24.54s/it] 66%|██████▋ | 1659/2500 [14:14:24<5:35:20, 23.92s/it] {'loss': 0.0082, 'grad_norm': 2.305496237436785, 'learning_rate': 3.3639999999999997e-07, 'completion_length': 53.44643020629883, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9553572535514832, 'reward_std': 0.12626906484365463, 'kl': 0.205078125, 'epoch': 0.66} 66%|██████▋ | 1659/2500 [14:14:24<5:35:20, 23.92s/it] 66%|██████▋ | 1660/2500 [14:14:45<5:20:22, 22.88s/it] {'loss': 0.0314, 'grad_norm': 6.443409986225945, 'learning_rate': 3.36e-07, 'completion_length': 56.56250190734863, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9196429252624512, 'reward_std': 0.20466744899749756, 'kl': 0.7861328125, 'epoch': 0.66} 66%|██████▋ | 1660/2500 [14:14:45<5:20:22, 22.88s/it] 66%|██████▋ | 1661/2500 [14:15:06<5:10:37, 22.21s/it] {'loss': 0.0355, 'grad_norm': 3.6629463865178384, 'learning_rate': 3.356e-07, 'completion_length': 57.24107360839844, 'rewards/accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.8660715222358704, 'reward_std': 0.1767766997218132, 'kl': 0.88671875, 'epoch': 0.66} 66%|██████▋ | 1661/2500 [14:15:06<5:10:37, 22.21s/it] 66%|██████▋ | 1662/2500 [14:15:26<5:01:39, 21.60s/it] {'loss': 0.0225, 'grad_norm': 5.538879245718847, 'learning_rate': 3.352e-07, 'completion_length': 52.267860412597656, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.9196429252624512, 'reward_std': 0.22728431224822998, 'kl': 0.56201171875, 'epoch': 0.66} 66%|██████▋ | 1662/2500 [14:15:26<5:01:39, 21.60s/it] 67%|██████▋ | 1663/2500 [14:15:46<4:55:17, 21.17s/it] {'loss': 0.0151, 'grad_norm': 4.843773133796462, 'learning_rate': 3.3479999999999995e-07, 'completion_length': 54.07143020629883, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.955357164144516, 'reward': 1.910714328289032, 'reward_std': 0.25253812968730927, 'kl': 0.3779296875, 'epoch': 0.67} 67%|██████▋ | 1663/2500 [14:15:46<4:55:17, 21.17s/it] 67%|██████▋ | 1664/2500 [14:16:06<4:50:08, 20.82s/it] {'loss': 0.0075, 'grad_norm': 3.51678358731097, 'learning_rate': 3.344e-07, 'completion_length': 56.49107360839844, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9375000596046448, 'reward_std': 0.08747543022036552, 'kl': 0.1865234375, 'epoch': 0.67} 67%|██████▋ | 1664/2500 [14:16:06<4:50:08, 20.82s/it] 67%|██████▋ | 1665/2500 [14:16:28<4:57:06, 21.35s/it] {'loss': 0.014, 'grad_norm': 3.5674750385618914, 'learning_rate': 3.34e-07, 'completion_length': 51.67857360839844, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.919642984867096, 'reward_std': 0.1872248351573944, 'kl': 0.3505859375, 'epoch': 0.67} 67%|██████▋ | 1665/2500 [14:16:28<4:57:06, 21.35s/it] 67%|██████▋ | 1666/2500 [14:16:49<4:52:00, 21.01s/it] {'loss': 0.0243, 'grad_norm': 3.285766611744747, 'learning_rate': 3.3359999999999997e-07, 'completion_length': 53.33928871154785, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9285714626312256, 'reward_std': 0.18458788841962814, 'kl': 0.6083984375, 'epoch': 0.67} 67%|██████▋ | 1666/2500 [14:16:49<4:52:00, 21.01s/it] 67%|██████▋ | 1667/2500 [14:17:08<4:44:32, 20.50s/it] {'loss': 0.0067, 'grad_norm': 1.519269750018896, 'learning_rate': 3.332e-07, 'completion_length': 57.517860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.16650390625, 'epoch': 0.67} 67%|██████▋ | 1667/2500 [14:17:08<4:44:32, 20.50s/it] 67%|██████▋ | 1668/2500 [14:17:30<4:52:44, 21.11s/it] {'loss': 0.0115, 'grad_norm': 1.792159019489227, 'learning_rate': 3.3279999999999996e-07, 'completion_length': 48.22321701049805, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.28759765625, 'epoch': 0.67} 67%|██████▋ | 1668/2500 [14:17:31<4:52:44, 21.11s/it] 67%|██████▋ | 1669/2500 [14:18:20<6:50:43, 29.66s/it] {'loss': 0.0054, 'grad_norm': 0.7753760420708053, 'learning_rate': 3.3239999999999993e-07, 'completion_length': 46.43750190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.134033203125, 'epoch': 0.67} 67%|██████▋ | 1669/2500 [14:18:20<6:50:43, 29.66s/it] 67%|██████▋ | 1670/2500 [14:19:03<7:45:50, 33.67s/it] {'loss': 0.0143, 'grad_norm': 20.513278296942055, 'learning_rate': 3.32e-07, 'completion_length': 47.392860412597656, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9017857909202576, 'reward_std': 0.05831881985068321, 'kl': 0.357421875, 'epoch': 0.67} 67%|██████▋ | 1670/2500 [14:19:03<7:45:50, 33.67s/it] 67%|██████▋ | 1671/2500 [14:19:22<6:43:26, 29.20s/it] {'loss': 0.0105, 'grad_norm': 1.01180115230599, 'learning_rate': 3.316e-07, 'completion_length': 46.80357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.26123046875, 'epoch': 0.67} 67%|██████▋ | 1671/2500 [14:19:22<6:43:26, 29.20s/it] 67%|██████▋ | 1672/2500 [14:19:41<5:59:38, 26.06s/it] {'loss': 0.0047, 'grad_norm': 1.080803119212503, 'learning_rate': 3.312e-07, 'completion_length': 53.02678680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.118408203125, 'epoch': 0.67} 67%|██████▋ | 1672/2500 [14:19:41<5:59:38, 26.06s/it] 67%|██████▋ | 1673/2500 [14:20:26<7:20:32, 31.96s/it] {'loss': 0.0144, 'grad_norm': 4.09229065045252, 'learning_rate': 3.3079999999999997e-07, 'completion_length': 50.74107360839844, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375001192092896, 'reward_std': 0.07003280520439148, 'kl': 0.359375, 'epoch': 0.67} 67%|██████▋ | 1673/2500 [14:20:26<7:20:32, 31.96s/it] 67%|██████▋ | 1674/2500 [14:20:52<6:52:51, 29.99s/it] {'loss': 0.0094, 'grad_norm': 4.080036040691627, 'learning_rate': 3.304e-07, 'completion_length': 55.19643020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.23486328125, 'epoch': 0.67} 67%|██████▋ | 1674/2500 [14:20:52<6:52:51, 29.99s/it] 67%|██████▋ | 1675/2500 [14:21:23<6:57:51, 30.39s/it] {'loss': 0.0046, 'grad_norm': 1.0151536203310039, 'learning_rate': 3.3e-07, 'completion_length': 48.43750190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.11572265625, 'epoch': 0.67} 67%|██████▋ | 1675/2500 [14:21:23<6:57:51, 30.39s/it] 67%|██████▋ | 1676/2500 [14:21:42<6:10:22, 26.97s/it] {'loss': 0.0037, 'grad_norm': 0.18259392935633104, 'learning_rate': 3.296e-07, 'completion_length': 55.80357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09375, 'epoch': 0.67} 67%|██████▋ | 1676/2500 [14:21:42<6:10:22, 26.97s/it] 67%|██████▋ | 1677/2500 [14:22:00<5:34:45, 24.40s/it] {'loss': 0.0086, 'grad_norm': 0.8686740075027628, 'learning_rate': 3.2919999999999996e-07, 'completion_length': 60.15178871154785, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.053144559264183044, 'kl': 0.21435546875, 'epoch': 0.67} 67%|██████▋ | 1677/2500 [14:22:00<5:34:45, 24.40s/it] 67%|██████▋ | 1678/2500 [14:22:19<5:09:33, 22.60s/it] {'loss': 0.0063, 'grad_norm': 0.3085681267690047, 'learning_rate': 3.288e-07, 'completion_length': 57.33928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.157958984375, 'epoch': 0.67} 67%|██████▋ | 1678/2500 [14:22:19<5:09:33, 22.60s/it] 67%|██████▋ | 1679/2500 [14:22:54<5:59:09, 26.25s/it] {'loss': 0.0055, 'grad_norm': 0.2957508676050862, 'learning_rate': 3.284e-07, 'completion_length': 54.31250190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13818359375, 'epoch': 0.67} 67%|██████▋ | 1679/2500 [14:22:54<5:59:09, 26.25s/it] 67%|██████▋ | 1680/2500 [14:23:13<5:29:25, 24.10s/it] {'loss': 0.0046, 'grad_norm': 0.20474658031019255, 'learning_rate': 3.28e-07, 'completion_length': 61.16071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1142578125, 'epoch': 0.67} 67%|██████▋ | 1680/2500 [14:23:13<5:29:25, 24.10s/it] 67%|██████▋ | 1681/2500 [14:23:32<5:07:48, 22.55s/it] {'loss': 0.0054, 'grad_norm': 0.29688150214080145, 'learning_rate': 3.276e-07, 'completion_length': 58.22321701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.135986328125, 'epoch': 0.67} 67%|██████▋ | 1681/2500 [14:23:32<5:07:48, 22.55s/it] 67%|██████▋ | 1682/2500 [14:23:50<4:48:54, 21.19s/it] {'loss': 0.0056, 'grad_norm': 5.192854319195633, 'learning_rate': 3.2719999999999997e-07, 'completion_length': 54.83035850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.14013671875, 'epoch': 0.67} 67%|██████▋ | 1682/2500 [14:23:50<4:48:54, 21.19s/it] 67%|██████▋ | 1683/2500 [14:24:13<4:55:22, 21.69s/it] {'loss': 0.005, 'grad_norm': 1.7823042497772945, 'learning_rate': 3.268e-07, 'completion_length': 51.77678871154785, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.06343398988246918, 'kl': 0.12451171875, 'epoch': 0.67} 67%|██████▋ | 1683/2500 [14:24:13<4:55:22, 21.69s/it] 67%|██████▋ | 1684/2500 [14:24:31<4:40:21, 20.61s/it] {'loss': 0.005, 'grad_norm': 0.2864824374360529, 'learning_rate': 3.264e-07, 'completion_length': 52.10714530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12548828125, 'epoch': 0.67} 67%|██████▋ | 1684/2500 [14:24:31<4:40:21, 20.61s/it] 67%|██████▋ | 1685/2500 [14:25:08<5:47:53, 25.61s/it] {'loss': 0.0072, 'grad_norm': 2.4687149165580924, 'learning_rate': 3.26e-07, 'completion_length': 56.47321701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.180419921875, 'epoch': 0.67} 67%|██████▋ | 1685/2500 [14:25:08<5:47:53, 25.61s/it] 67%|██████▋ | 1686/2500 [14:25:27<5:20:07, 23.60s/it] {'loss': 0.0062, 'grad_norm': 0.3327989964227514, 'learning_rate': 3.256e-07, 'completion_length': 57.67857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.15380859375, 'epoch': 0.67} 67%|██████▋ | 1686/2500 [14:25:27<5:20:07, 23.60s/it] 67%|██████▋ | 1687/2500 [14:25:46<5:00:57, 22.21s/it] {'loss': 0.0053, 'grad_norm': 0.2909188564464754, 'learning_rate': 3.252e-07, 'completion_length': 62.16964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.133056640625, 'epoch': 0.67} 67%|██████▋ | 1687/2500 [14:25:46<5:00:57, 22.21s/it] 68%|██████▊ | 1688/2500 [14:26:04<4:45:16, 21.08s/it] {'loss': 0.0047, 'grad_norm': 1.171280722852935, 'learning_rate': 3.2479999999999994e-07, 'completion_length': 54.98214530944824, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.11767578125, 'epoch': 0.68} 68%|██████▊ | 1688/2500 [14:26:04<4:45:16, 21.08s/it] 68%|██████▊ | 1689/2500 [14:26:27<4:49:53, 21.45s/it] {'loss': 0.0081, 'grad_norm': 0.49474965618473504, 'learning_rate': 3.244e-07, 'completion_length': 57.25893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.201416015625, 'epoch': 0.68} 68%|██████▊ | 1689/2500 [14:26:27<4:49:53, 21.45s/it] 68%|██████▊ | 1690/2500 [14:27:01<5:40:49, 25.25s/it] {'loss': 0.0053, 'grad_norm': 1.5470637280433432, 'learning_rate': 3.24e-07, 'completion_length': 61.89285850524902, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.1318359375, 'epoch': 0.68} 68%|██████▊ | 1690/2500 [14:27:01<5:40:49, 25.25s/it] 68%|██████▊ | 1691/2500 [14:27:26<5:40:11, 25.23s/it] {'loss': 0.0069, 'grad_norm': 0.4868643623484184, 'learning_rate': 3.2359999999999996e-07, 'completion_length': 55.37500190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1728515625, 'epoch': 0.68} 68%|██████▊ | 1691/2500 [14:27:26<5:40:11, 25.23s/it] 68%|██████▊ | 1692/2500 [14:28:08<6:50:14, 30.46s/it] {'loss': 0.0048, 'grad_norm': 0.14578778223402133, 'learning_rate': 3.232e-07, 'completion_length': 57.83035850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.118896484375, 'epoch': 0.68} 68%|██████▊ | 1692/2500 [14:28:09<6:50:14, 30.46s/it] 68%|██████▊ | 1693/2500 [14:28:45<7:13:16, 32.21s/it] {'loss': 0.0054, 'grad_norm': 0.2573612939596985, 'learning_rate': 3.2279999999999995e-07, 'completion_length': 51.830360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1357421875, 'epoch': 0.68} 68%|██████▊ | 1693/2500 [14:28:45<7:13:16, 32.21s/it] 68%|██████▊ | 1694/2500 [14:29:04<6:19:56, 28.28s/it] {'loss': 0.0045, 'grad_norm': 0.17802381952787205, 'learning_rate': 3.2240000000000003e-07, 'completion_length': 56.68750190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.112548828125, 'epoch': 0.68} 68%|██████▊ | 1694/2500 [14:29:04<6:19:56, 28.28s/it] 68%|██████▊ | 1695/2500 [14:29:23<5:44:09, 25.65s/it] {'loss': 0.0057, 'grad_norm': 1.5681732101343206, 'learning_rate': 3.22e-07, 'completion_length': 61.01786231994629, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.14208984375, 'epoch': 0.68} 68%|██████▊ | 1695/2500 [14:29:23<5:44:09, 25.65s/it] 68%|██████▊ | 1696/2500 [14:29:42<5:15:20, 23.53s/it] {'loss': 0.0052, 'grad_norm': 0.2510530710779614, 'learning_rate': 3.2159999999999997e-07, 'completion_length': 50.95535850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.130859375, 'epoch': 0.68} 68%|██████▊ | 1696/2500 [14:29:42<5:15:20, 23.53s/it] 68%|██████▊ | 1697/2500 [14:30:01<4:56:44, 22.17s/it] {'loss': 0.0051, 'grad_norm': 2.0557420112683737, 'learning_rate': 3.212e-07, 'completion_length': 60.02678871154785, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.126953125, 'epoch': 0.68} 68%|██████▊ | 1697/2500 [14:30:01<4:56:44, 22.17s/it] 68%|██████▊ | 1698/2500 [14:30:33<5:35:38, 25.11s/it] {'loss': 0.0057, 'grad_norm': 1.8302286105165002, 'learning_rate': 3.2079999999999996e-07, 'completion_length': 56.99107360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.14306640625, 'epoch': 0.68} 68%|██████▊ | 1698/2500 [14:30:33<5:35:38, 25.11s/it] 68%|██████▊ | 1699/2500 [14:30:52<5:10:55, 23.29s/it] {'loss': 0.0046, 'grad_norm': 0.25092924166326463, 'learning_rate': 3.204e-07, 'completion_length': 62.36607551574707, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.115478515625, 'epoch': 0.68} 68%|██████▊ | 1699/2500 [14:30:52<5:10:55, 23.29s/it] 68%|██████▊ | 1700/2500 [14:31:11<4:51:39, 21.87s/it] {'loss': 0.0085, 'grad_norm': 1.1317576467401478, 'learning_rate': 3.2e-07, 'completion_length': 56.63393211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.211669921875, 'epoch': 0.68} 68%|██████▊ | 1700/2500 [14:31:11<4:51:39, 21.87s/it] 68%|██████▊ | 1701/2500 [14:32:51<10:02:23, 45.24s/it] {'loss': 0.0068, 'grad_norm': 1.0998863403638304, 'learning_rate': 3.196e-07, 'completion_length': 64.6339340209961, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.17041015625, 'epoch': 0.68} 68%|██████▊ | 1701/2500 [14:32:51<10:02:23, 45.24s/it] 68%|██████▊ | 1702/2500 [14:33:12<8:26:44, 38.10s/it] {'loss': 0.0034, 'grad_norm': 0.9184717265878984, 'learning_rate': 3.1919999999999995e-07, 'completion_length': 69.78571701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.084716796875, 'epoch': 0.68} 68%|██████▊ | 1702/2500 [14:33:12<8:26:44, 38.10s/it] 68%|██████▊ | 1703/2500 [14:33:36<7:30:06, 33.88s/it] {'loss': 0.0032, 'grad_norm': 3.047417562941879, 'learning_rate': 3.1879999999999997e-07, 'completion_length': 61.437503814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.078857421875, 'epoch': 0.68} 68%|██████▊ | 1703/2500 [14:33:36<7:30:06, 33.88s/it] 68%|██████▊ | 1704/2500 [14:33:55<6:31:59, 29.55s/it] {'loss': 0.0049, 'grad_norm': 2.063851051340682, 'learning_rate': 3.184e-07, 'completion_length': 57.50000190734863, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.12353515625, 'epoch': 0.68} 68%|██████▊ | 1704/2500 [14:33:55<6:31:59, 29.55s/it] 68%|██████▊ | 1705/2500 [14:34:18<6:04:08, 27.48s/it] {'loss': 0.006, 'grad_norm': 2.3430451851831626, 'learning_rate': 3.18e-07, 'completion_length': 56.66964530944824, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.149658203125, 'epoch': 0.68} 68%|██████▊ | 1705/2500 [14:34:18<6:04:08, 27.48s/it] 68%|██████▊ | 1706/2500 [14:34:42<5:50:54, 26.52s/it] {'loss': 0.0093, 'grad_norm': 1.625000488860597, 'learning_rate': 3.176e-07, 'completion_length': 61.035715103149414, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.2333984375, 'epoch': 0.68} 68%|██████▊ | 1706/2500 [14:34:42<5:50:54, 26.52s/it] 68%|██████▊ | 1707/2500 [14:35:03<5:27:14, 24.76s/it] {'loss': 0.0102, 'grad_norm': 2.5608736128600715, 'learning_rate': 3.1719999999999996e-07, 'completion_length': 54.71428871154785, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.255859375, 'epoch': 0.68} 68%|██████▊ | 1707/2500 [14:35:03<5:27:14, 24.76s/it] 68%|██████▊ | 1708/2500 [14:35:22<5:05:44, 23.16s/it] {'loss': 0.0069, 'grad_norm': 3.8204225062028723, 'learning_rate': 3.1680000000000003e-07, 'completion_length': 54.91071701049805, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857313156128, 'reward_std': 0.0835726335644722, 'kl': 0.173583984375, 'epoch': 0.68} 68%|██████▊ | 1708/2500 [14:35:22<5:05:44, 23.16s/it] 68%|██████▊ | 1709/2500 [14:35:42<4:50:20, 22.02s/it] {'loss': 0.0083, 'grad_norm': 2.702617034456442, 'learning_rate': 3.164e-07, 'completion_length': 57.96428871154785, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9375001192092896, 'reward_std': 0.1541598215699196, 'kl': 0.2080078125, 'epoch': 0.68} 68%|██████▊ | 1709/2500 [14:35:42<4:50:20, 22.02s/it] 68%|██████▊ | 1710/2500 [14:36:02<4:43:41, 21.55s/it] {'loss': 0.0075, 'grad_norm': 2.7726126146250323, 'learning_rate': 3.1599999999999997e-07, 'completion_length': 53.73214530944824, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285714626312256, 'reward_std': 0.13919542729854584, 'kl': 0.18701171875, 'epoch': 0.68} 68%|██████▊ | 1710/2500 [14:36:02<4:43:41, 21.55s/it] 68%|██████▊ | 1711/2500 [14:36:21<4:34:05, 20.84s/it] {'loss': 0.008, 'grad_norm': 0.618121928066461, 'learning_rate': 3.156e-07, 'completion_length': 56.74107360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.20068359375, 'epoch': 0.68} 68%|██████▊ | 1711/2500 [14:36:21<4:34:05, 20.84s/it] 68%|██████▊ | 1712/2500 [14:36:41<4:29:55, 20.55s/it] {'loss': 0.0079, 'grad_norm': 1.5056560195984854, 'learning_rate': 3.1519999999999996e-07, 'completion_length': 50.68750190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.19677734375, 'epoch': 0.68} 68%|██████▊ | 1712/2500 [14:36:41<4:29:55, 20.55s/it] 69%|██████▊ | 1713/2500 [14:37:00<4:24:10, 20.14s/it] {'loss': 0.0125, 'grad_norm': 1.9019203078442908, 'learning_rate': 3.148e-07, 'completion_length': 52.01785850524902, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.31396484375, 'epoch': 0.69} 69%|██████▊ | 1713/2500 [14:37:00<4:24:10, 20.14s/it] 69%|██████▊ | 1714/2500 [14:37:20<4:22:28, 20.04s/it] {'loss': 0.0074, 'grad_norm': 0.6284328498261892, 'learning_rate': 3.144e-07, 'completion_length': 52.95535850524902, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.185546875, 'epoch': 0.69} 69%|██████▊ | 1714/2500 [14:37:20<4:22:28, 20.04s/it] 69%|██████▊ | 1715/2500 [14:37:41<4:24:08, 20.19s/it] {'loss': 0.0071, 'grad_norm': 2.3331725181523173, 'learning_rate': 3.14e-07, 'completion_length': 57.35714530944824, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.1787109375, 'epoch': 0.69} 69%|██████▊ | 1715/2500 [14:37:41<4:24:08, 20.19s/it] 69%|██████▊ | 1716/2500 [14:38:00<4:20:42, 19.95s/it] {'loss': 0.0224, 'grad_norm': 3.0303863331846634, 'learning_rate': 3.1359999999999995e-07, 'completion_length': 57.36607360839844, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.560546875, 'epoch': 0.69} 69%|██████▊ | 1716/2500 [14:38:00<4:20:42, 19.95s/it] 69%|██████▊ | 1717/2500 [14:38:21<4:23:11, 20.17s/it] {'loss': 0.0091, 'grad_norm': 2.947175322806932, 'learning_rate': 3.1319999999999997e-07, 'completion_length': 57.96428871154785, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9375001192092896, 'reward_std': 0.1379830539226532, 'kl': 0.22607421875, 'epoch': 0.69} 69%|██████▊ | 1717/2500 [14:38:21<4:23:11, 20.17s/it] 69%|██████▊ | 1718/2500 [14:38:41<4:22:41, 20.16s/it] {'loss': 0.0099, 'grad_norm': 5.71170395472267, 'learning_rate': 3.128e-07, 'completion_length': 52.98214530944824, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.24755859375, 'epoch': 0.69} 69%|██████▊ | 1718/2500 [14:38:41<4:22:41, 20.16s/it] 69%|██████▉ | 1719/2500 [14:39:02<4:25:58, 20.43s/it] {'loss': 0.0439, 'grad_norm': 7.249860830361686, 'learning_rate': 3.124e-07, 'completion_length': 50.07143020629883, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9464285969734192, 'reward_std': 0.0739355981349945, 'kl': 1.09326171875, 'epoch': 0.69} 69%|██████▉ | 1719/2500 [14:39:02<4:25:58, 20.43s/it] 69%|██████▉ | 1720/2500 [14:39:21<4:21:11, 20.09s/it] {'loss': 0.0068, 'grad_norm': 1.670603855924059, 'learning_rate': 3.12e-07, 'completion_length': 54.57143211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1689453125, 'epoch': 0.69} 69%|██████▉ | 1720/2500 [14:39:21<4:21:11, 20.09s/it] 69%|██████▉ | 1721/2500 [14:39:43<4:27:23, 20.59s/it] {'loss': 0.0141, 'grad_norm': 1.3740857221382308, 'learning_rate': 3.1159999999999996e-07, 'completion_length': 54.82143211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.3525390625, 'epoch': 0.69} 69%|██████▉ | 1721/2500 [14:39:43<4:27:23, 20.59s/it] 69%|██████▉ | 1722/2500 [14:40:03<4:25:56, 20.51s/it] {'loss': 0.0128, 'grad_norm': 1.3919466754549916, 'learning_rate': 3.112e-07, 'completion_length': 51.82143020629883, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.3193359375, 'epoch': 0.69} 69%|██████▉ | 1722/2500 [14:40:03<4:25:56, 20.51s/it] 69%|██████▉ | 1723/2500 [14:40:23<4:23:25, 20.34s/it] {'loss': 0.0172, 'grad_norm': 2.2263543515396513, 'learning_rate': 3.108e-07, 'completion_length': 58.32143020629883, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9375000596046448, 'reward_std': 0.13671719282865524, 'kl': 0.4296875, 'epoch': 0.69} 69%|██████▉ | 1723/2500 [14:40:23<4:23:25, 20.34s/it] 69%|██████▉ | 1724/2500 [14:40:43<4:20:51, 20.17s/it] {'loss': 0.0142, 'grad_norm': 1.57438025407426, 'learning_rate': 3.104e-07, 'completion_length': 59.205360412597656, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.35498046875, 'epoch': 0.69} 69%|██████▉ | 1724/2500 [14:40:43<4:20:51, 20.17s/it] 69%|██████▉ | 1725/2500 [14:41:03<4:18:35, 20.02s/it] {'loss': 0.0097, 'grad_norm': 1.1935903111545803, 'learning_rate': 3.1e-07, 'completion_length': 52.67857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.24365234375, 'epoch': 0.69} 69%|██████▉ | 1725/2500 [14:41:03<4:18:35, 20.02s/it] 69%|██████▉ | 1726/2500 [14:41:23<4:17:33, 19.97s/it] {'loss': 0.0066, 'grad_norm': 0.6229588689626657, 'learning_rate': 3.0959999999999997e-07, 'completion_length': 58.63393211364746, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.16455078125, 'epoch': 0.69} 69%|██████▉ | 1726/2500 [14:41:23<4:17:33, 19.97s/it] 69%|██████▉ | 1727/2500 [14:41:44<4:21:56, 20.33s/it] {'loss': 0.0079, 'grad_norm': 0.8162812248197652, 'learning_rate': 3.0919999999999994e-07, 'completion_length': 52.75893211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1962890625, 'epoch': 0.69} 69%|██████▉ | 1727/2500 [14:41:44<4:21:56, 20.33s/it] 69%|██████▉ | 1728/2500 [14:42:04<4:21:26, 20.32s/it] {'loss': 0.0061, 'grad_norm': 1.7637287898559397, 'learning_rate': 3.088e-07, 'completion_length': 54.75000190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1513671875, 'epoch': 0.69} 69%|██████▉ | 1728/2500 [14:42:04<4:21:26, 20.32s/it] 69%|██████▉ | 1729/2500 [14:42:24<4:21:25, 20.34s/it] {'loss': 0.0049, 'grad_norm': 0.20965142735875386, 'learning_rate': 3.084e-07, 'completion_length': 49.64285850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12353515625, 'epoch': 0.69} 69%|██████▉ | 1729/2500 [14:42:24<4:21:25, 20.34s/it] 69%|██████▉ | 1730/2500 [14:42:47<4:29:23, 20.99s/it] {'loss': 0.0056, 'grad_norm': 2.285364763863449, 'learning_rate': 3.08e-07, 'completion_length': 53.49107551574707, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1396484375, 'epoch': 0.69} 69%|██████▉ | 1730/2500 [14:42:47<4:29:23, 20.99s/it] 69%|██████▉ | 1731/2500 [14:43:09<4:32:30, 21.26s/it] {'loss': 0.0058, 'grad_norm': 0.25740196428648193, 'learning_rate': 3.076e-07, 'completion_length': 53.38393211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.144287109375, 'epoch': 0.69} 69%|██████▉ | 1731/2500 [14:43:09<4:32:30, 21.26s/it] 69%|██████▉ | 1732/2500 [14:43:30<4:32:45, 21.31s/it] {'loss': 0.0087, 'grad_norm': 1.7142068761960139, 'learning_rate': 3.0719999999999995e-07, 'completion_length': 59.50893211364746, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.21826171875, 'epoch': 0.69} 69%|██████▉ | 1732/2500 [14:43:30<4:32:45, 21.31s/it] 69%|██████▉ | 1733/2500 [14:43:52<4:35:10, 21.53s/it] {'loss': 0.0045, 'grad_norm': 0.18153451232958454, 'learning_rate': 3.068e-07, 'completion_length': 55.125003814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.113037109375, 'epoch': 0.69} 69%|██████▉ | 1733/2500 [14:43:53<4:35:10, 21.53s/it] 69%|██████▉ | 1734/2500 [14:44:14<4:33:49, 21.45s/it] {'loss': 0.0086, 'grad_norm': 3.704996429205709, 'learning_rate': 3.064e-07, 'completion_length': 52.38393020629883, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.21533203125, 'epoch': 0.69} 69%|██████▉ | 1734/2500 [14:44:14<4:33:49, 21.45s/it] 69%|██████▉ | 1735/2500 [14:44:35<4:31:49, 21.32s/it] {'loss': 0.0054, 'grad_norm': 1.8683354991728331, 'learning_rate': 3.0599999999999996e-07, 'completion_length': 52.96428871154785, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.135009765625, 'epoch': 0.69} 69%|██████▉ | 1735/2500 [14:44:35<4:31:49, 21.32s/it] 69%|██████▉ | 1736/2500 [14:44:58<4:38:58, 21.91s/it] {'loss': 0.0067, 'grad_norm': 2.1588990964413717, 'learning_rate': 3.056e-07, 'completion_length': 60.026790618896484, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1669921875, 'epoch': 0.69} 69%|██████▉ | 1736/2500 [14:44:58<4:38:58, 21.91s/it] 69%|██████▉ | 1737/2500 [14:45:19<4:35:23, 21.66s/it] {'loss': 0.0051, 'grad_norm': 0.2629656750572834, 'learning_rate': 3.052e-07, 'completion_length': 57.34821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12744140625, 'epoch': 0.69} 69%|██████▉ | 1737/2500 [14:45:19<4:35:23, 21.66s/it] 70%|██████▉ | 1738/2500 [14:45:41<4:36:28, 21.77s/it] {'loss': 0.0058, 'grad_norm': 9.667987406753285, 'learning_rate': 3.048e-07, 'completion_length': 55.59821701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.14453125, 'epoch': 0.7} 70%|██████▉ | 1738/2500 [14:45:41<4:36:28, 21.77s/it] 70%|██████▉ | 1739/2500 [14:46:02<4:34:24, 21.64s/it] {'loss': 0.0043, 'grad_norm': 0.19958139187289753, 'learning_rate': 3.044e-07, 'completion_length': 52.017860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1064453125, 'epoch': 0.7} 70%|██████▉ | 1739/2500 [14:46:02<4:34:24, 21.64s/it] 70%|██████▉ | 1740/2500 [14:46:24<4:32:29, 21.51s/it] {'loss': 0.0047, 'grad_norm': 1.2351485594093587, 'learning_rate': 3.0399999999999997e-07, 'completion_length': 52.21428871154785, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.117431640625, 'epoch': 0.7} 70%|██████▉ | 1740/2500 [14:46:24<4:32:29, 21.51s/it] 70%|██████▉ | 1741/2500 [14:46:45<4:33:11, 21.60s/it] {'loss': 0.0109, 'grad_norm': 1.3101474600044514, 'learning_rate': 3.036e-07, 'completion_length': 52.52678871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.272705078125, 'epoch': 0.7} 70%|██████▉ | 1741/2500 [14:46:45<4:33:11, 21.60s/it] 70%|██████▉ | 1742/2500 [14:47:07<4:32:02, 21.53s/it] {'loss': 0.0055, 'grad_norm': 0.984368350569565, 'learning_rate': 3.032e-07, 'completion_length': 55.75893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.13720703125, 'epoch': 0.7} 70%|██████▉ | 1742/2500 [14:47:07<4:32:02, 21.53s/it] 70%|██████▉ | 1743/2500 [14:47:30<4:36:33, 21.92s/it] {'loss': 0.0043, 'grad_norm': 0.9608612844608386, 'learning_rate': 3.028e-07, 'completion_length': 54.48214530944824, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.107666015625, 'epoch': 0.7} 70%|██████▉ | 1743/2500 [14:47:30<4:36:33, 21.92s/it] 70%|██████▉ | 1744/2500 [14:47:50<4:31:56, 21.58s/it] {'loss': 0.0035, 'grad_norm': 0.12869718526352444, 'learning_rate': 3.024e-07, 'completion_length': 55.66964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0869140625, 'epoch': 0.7} 70%|██████▉ | 1744/2500 [14:47:50<4:31:56, 21.58s/it] 70%|██████▉ | 1745/2500 [14:48:12<4:33:33, 21.74s/it] {'loss': 0.0066, 'grad_norm': 1.3805410492702659, 'learning_rate': 3.02e-07, 'completion_length': 61.09821701049805, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.165771484375, 'epoch': 0.7} 70%|██████▉ | 1745/2500 [14:48:12<4:33:33, 21.74s/it] 70%|██████▉ | 1746/2500 [14:48:34<4:32:19, 21.67s/it] {'loss': 0.0331, 'grad_norm': 4.2202559666798445, 'learning_rate': 3.0159999999999995e-07, 'completion_length': 54.92857360839844, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9375000596046448, 'reward_std': 0.1192745715379715, 'kl': 0.828125, 'epoch': 0.7} 70%|██████▉ | 1746/2500 [14:48:34<4:32:19, 21.67s/it] 70%|██████▉ | 1747/2500 [14:48:57<4:36:42, 22.05s/it] {'loss': 0.0045, 'grad_norm': 0.1771446022365083, 'learning_rate': 3.012e-07, 'completion_length': 51.84821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.112060546875, 'epoch': 0.7} 70%|██████▉ | 1747/2500 [14:48:57<4:36:42, 22.05s/it] 70%|██████▉ | 1748/2500 [14:49:18<4:34:00, 21.86s/it] {'loss': 0.0057, 'grad_norm': 1.0974063778215026, 'learning_rate': 3.008e-07, 'completion_length': 55.75000190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.14208984375, 'epoch': 0.7} 70%|██████▉ | 1748/2500 [14:49:18<4:34:00, 21.86s/it] 70%|██████▉ | 1749/2500 [14:49:40<4:33:18, 21.84s/it] {'loss': 0.0079, 'grad_norm': 2.0948614348765533, 'learning_rate': 3.0039999999999996e-07, 'completion_length': 56.642860412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.197265625, 'epoch': 0.7} 70%|██████▉ | 1749/2500 [14:49:40<4:33:18, 21.84s/it] 70%|███████ | 1750/2500 [14:50:02<4:33:22, 21.87s/it] {'loss': 0.0052, 'grad_norm': 1.5204952227685418, 'learning_rate': 3e-07, 'completion_length': 56.58928871154785, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.129638671875, 'epoch': 0.7} 70%|███████ | 1750/2500 [14:50:02<4:33:22, 21.87s/it] 70%|███████ | 1751/2500 [14:50:24<4:31:49, 21.77s/it] {'loss': 0.0063, 'grad_norm': 0.41684466334497833, 'learning_rate': 2.9959999999999996e-07, 'completion_length': 56.76785850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.158203125, 'epoch': 0.7} 70%|███████ | 1751/2500 [14:50:24<4:31:49, 21.77s/it] 70%|███████ | 1752/2500 [14:50:47<4:39:31, 22.42s/it] {'loss': 0.005, 'grad_norm': 1.7170101409002716, 'learning_rate': 2.9920000000000003e-07, 'completion_length': 58.55357551574707, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.946428656578064, 'reward_std': 0.08868780732154846, 'kl': 0.1259765625, 'epoch': 0.7} 70%|███████ | 1752/2500 [14:50:48<4:39:31, 22.42s/it] 70%|███████ | 1753/2500 [14:51:10<4:38:07, 22.34s/it] {'loss': 0.0056, 'grad_norm': 0.24590768190203552, 'learning_rate': 2.988e-07, 'completion_length': 52.99107551574707, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.14013671875, 'epoch': 0.7} 70%|███████ | 1753/2500 [14:51:10<4:38:07, 22.34s/it] 70%|███████ | 1754/2500 [14:51:32<4:36:02, 22.20s/it] {'loss': 0.0048, 'grad_norm': 1.070864128173798, 'learning_rate': 2.9839999999999997e-07, 'completion_length': 52.830360412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.121337890625, 'epoch': 0.7} 70%|███████ | 1754/2500 [14:51:32<4:36:02, 22.20s/it] 70%|███████ | 1755/2500 [14:51:55<4:41:13, 22.65s/it] {'loss': 0.0131, 'grad_norm': 4.883894540263249, 'learning_rate': 2.98e-07, 'completion_length': 57.32143211364746, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.946428656578064, 'reward_std': 0.08868780732154846, 'kl': 0.3271484375, 'epoch': 0.7} 70%|███████ | 1755/2500 [14:51:55<4:41:13, 22.65s/it] 70%|███████ | 1756/2500 [14:52:18<4:42:53, 22.81s/it] {'loss': 0.0057, 'grad_norm': 0.2686633420435744, 'learning_rate': 2.9759999999999996e-07, 'completion_length': 53.68750190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.142822265625, 'epoch': 0.7} 70%|███████ | 1756/2500 [14:52:18<4:42:53, 22.81s/it] 70%|███████ | 1757/2500 [14:52:41<4:41:07, 22.70s/it] {'loss': 0.0058, 'grad_norm': 1.4747861172292263, 'learning_rate': 2.972e-07, 'completion_length': 52.75000190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.14599609375, 'epoch': 0.7} 70%|███████ | 1757/2500 [14:52:41<4:41:07, 22.70s/it] 70%|███████ | 1758/2500 [14:53:03<4:40:30, 22.68s/it] {'loss': 0.005, 'grad_norm': 0.39723438210013035, 'learning_rate': 2.968e-07, 'completion_length': 56.88393020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12646484375, 'epoch': 0.7} 70%|███████ | 1758/2500 [14:53:04<4:40:30, 22.68s/it] 70%|███████ | 1759/2500 [14:53:25<4:37:19, 22.46s/it] {'loss': 0.0243, 'grad_norm': 2.3019798700960914, 'learning_rate': 2.964e-07, 'completion_length': 55.20535850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.61181640625, 'epoch': 0.7} 70%|███████ | 1759/2500 [14:53:25<4:37:19, 22.46s/it] 70%|███████ | 1760/2500 [14:53:47<4:35:15, 22.32s/it] {'loss': 0.0148, 'grad_norm': 1.065648201206162, 'learning_rate': 2.9599999999999995e-07, 'completion_length': 55.19643020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.368896484375, 'epoch': 0.7} 70%|███████ | 1760/2500 [14:53:47<4:35:15, 22.32s/it] 70%|███████ | 1761/2500 [14:54:09<4:32:16, 22.11s/it] {'loss': 0.0047, 'grad_norm': 0.21056695564515596, 'learning_rate': 2.9559999999999997e-07, 'completion_length': 60.21428871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11669921875, 'epoch': 0.7} 70%|███████ | 1761/2500 [14:54:09<4:32:16, 22.11s/it] 70%|███████ | 1762/2500 [14:54:31<4:31:35, 22.08s/it] {'loss': 0.0049, 'grad_norm': 0.8398391770468904, 'learning_rate': 2.952e-07, 'completion_length': 54.03571701049805, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.025253813713788986, 'kl': 0.122802734375, 'epoch': 0.7} 70%|███████ | 1762/2500 [14:54:31<4:31:35, 22.08s/it] 71%|███████ | 1763/2500 [14:54:54<4:35:46, 22.45s/it] {'loss': 0.0057, 'grad_norm': 3.785285901176575, 'learning_rate': 2.948e-07, 'completion_length': 52.87500190734863, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.07514797151088715, 'kl': 0.142578125, 'epoch': 0.71} 71%|███████ | 1763/2500 [14:54:54<4:35:46, 22.45s/it] 71%|███████ | 1764/2500 [14:55:16<4:34:03, 22.34s/it] {'loss': 0.004, 'grad_norm': 0.12569650509595975, 'learning_rate': 2.944e-07, 'completion_length': 62.464290618896484, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.099609375, 'epoch': 0.71} 71%|███████ | 1764/2500 [14:55:16<4:34:03, 22.34s/it] 71%|███████ | 1765/2500 [14:55:39<4:34:22, 22.40s/it] {'loss': 0.0196, 'grad_norm': 2.871481687849634, 'learning_rate': 2.9399999999999996e-07, 'completion_length': 59.05357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.48828125, 'epoch': 0.71} 71%|███████ | 1765/2500 [14:55:39<4:34:22, 22.40s/it] 71%|███████ | 1766/2500 [14:56:02<4:35:27, 22.52s/it] {'loss': 0.005, 'grad_norm': 0.1993491344786532, 'learning_rate': 2.9360000000000003e-07, 'completion_length': 57.830360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12451171875, 'epoch': 0.71} 71%|███████ | 1766/2500 [14:56:02<4:35:27, 22.52s/it] 71%|███████ | 1767/2500 [14:56:24<4:32:25, 22.30s/it] {'loss': 0.0044, 'grad_norm': 2.237838036403834, 'learning_rate': 2.932e-07, 'completion_length': 62.66964530944824, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.08868780359625816, 'kl': 0.10986328125, 'epoch': 0.71} 71%|███████ | 1767/2500 [14:56:24<4:32:25, 22.30s/it] 71%|███████ | 1768/2500 [14:56:48<4:40:10, 22.96s/it] {'loss': 0.0043, 'grad_norm': 0.2976267916101487, 'learning_rate': 2.928e-07, 'completion_length': 54.71428871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.108154296875, 'epoch': 0.71} 71%|███████ | 1768/2500 [14:56:48<4:40:10, 22.96s/it] 71%|███████ | 1769/2500 [14:57:10<4:35:58, 22.65s/it] {'loss': 0.0047, 'grad_norm': 0.2305619040444755, 'learning_rate': 2.924e-07, 'completion_length': 57.22321701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1181640625, 'epoch': 0.71} 71%|███████ | 1769/2500 [14:57:10<4:35:58, 22.65s/it] 71%|███████ | 1770/2500 [14:57:32<4:34:30, 22.56s/it] {'loss': 0.0085, 'grad_norm': 1.6439459017780151, 'learning_rate': 2.9199999999999997e-07, 'completion_length': 59.25893020629883, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.212890625, 'epoch': 0.71} 71%|███████ | 1770/2500 [14:57:32<4:34:30, 22.56s/it] 71%|███████ | 1771/2500 [14:57:55<4:35:24, 22.67s/it] {'loss': 0.0052, 'grad_norm': 0.2611388291158165, 'learning_rate': 2.916e-07, 'completion_length': 57.87500190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1298828125, 'epoch': 0.71} 71%|███████ | 1771/2500 [14:57:55<4:35:24, 22.67s/it] 71%|███████ | 1772/2500 [14:58:18<4:35:58, 22.75s/it] {'loss': 0.0047, 'grad_norm': 1.8927326690116688, 'learning_rate': 2.912e-07, 'completion_length': 64.34821701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1171875, 'epoch': 0.71} 71%|███████ | 1772/2500 [14:58:18<4:35:58, 22.75s/it] 71%|███████ | 1773/2500 [14:58:42<4:38:41, 23.00s/it] {'loss': 0.0055, 'grad_norm': 1.1860929099777233, 'learning_rate': 2.908e-07, 'completion_length': 59.580360412597656, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.13818359375, 'epoch': 0.71} 71%|███████ | 1773/2500 [14:58:42<4:38:41, 23.00s/it] 71%|███████ | 1774/2500 [14:59:05<4:37:45, 22.96s/it] {'loss': 0.005, 'grad_norm': 0.16270445030191882, 'learning_rate': 2.9039999999999995e-07, 'completion_length': 64.17857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12451171875, 'epoch': 0.71} 71%|███████ | 1774/2500 [14:59:05<4:37:45, 22.96s/it] 71%|███████ | 1775/2500 [14:59:27<4:34:43, 22.74s/it] {'loss': 0.0045, 'grad_norm': 1.2676422239566654, 'learning_rate': 2.9e-07, 'completion_length': 65.00000381469727, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.112548828125, 'epoch': 0.71} 71%|███████ | 1775/2500 [14:59:27<4:34:43, 22.74s/it] 71%|███████ | 1776/2500 [14:59:52<4:41:31, 23.33s/it] {'loss': 0.0064, 'grad_norm': 1.030986476820134, 'learning_rate': 2.896e-07, 'completion_length': 62.04464530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1591796875, 'epoch': 0.71} 71%|███████ | 1776/2500 [14:59:52<4:41:31, 23.33s/it] 71%|███████ | 1777/2500 [15:00:13<4:33:23, 22.69s/it] {'loss': 0.0053, 'grad_norm': 1.766270189271136, 'learning_rate': 2.892e-07, 'completion_length': 58.580360412597656, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.133056640625, 'epoch': 0.71} 71%|███████ | 1777/2500 [15:00:13<4:33:23, 22.69s/it] 71%|███████ | 1778/2500 [15:00:34<4:28:29, 22.31s/it] {'loss': 0.0057, 'grad_norm': 1.3080309956169165, 'learning_rate': 2.888e-07, 'completion_length': 51.267860412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.142578125, 'epoch': 0.71} 71%|███████ | 1778/2500 [15:00:34<4:28:29, 22.31s/it] 71%|███████ | 1779/2500 [15:00:58<4:32:09, 22.65s/it] {'loss': 0.0068, 'grad_norm': 1.2430141462530377, 'learning_rate': 2.8839999999999996e-07, 'completion_length': 50.99107360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.16943359375, 'epoch': 0.71} 71%|███████ | 1779/2500 [15:00:58<4:32:09, 22.65s/it] 71%|███████ | 1780/2500 [15:01:21<4:33:26, 22.79s/it] {'loss': 0.0047, 'grad_norm': 2.0881123197134186, 'learning_rate': 2.88e-07, 'completion_length': 57.642860412597656, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.11767578125, 'epoch': 0.71} 71%|███████ | 1780/2500 [15:01:21<4:33:26, 22.79s/it] 71%|███████ | 1781/2500 [15:01:44<4:34:17, 22.89s/it] {'loss': 0.01, 'grad_norm': 2.606883913055399, 'learning_rate': 2.876e-07, 'completion_length': 53.63393020629883, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9553572535514832, 'reward_std': 0.10882644355297089, 'kl': 0.25146484375, 'epoch': 0.71} 71%|███████ | 1781/2500 [15:01:44<4:34:17, 22.89s/it] 71%|███████▏ | 1782/2500 [15:02:06<4:29:30, 22.52s/it] {'loss': 0.0048, 'grad_norm': 0.3508733829751012, 'learning_rate': 2.872e-07, 'completion_length': 47.32143020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1201171875, 'epoch': 0.71} 71%|███████▏ | 1782/2500 [15:02:06<4:29:30, 22.52s/it] 71%|███████▏ | 1783/2500 [15:02:28<4:27:14, 22.36s/it] {'loss': 0.0058, 'grad_norm': 3.2952900858869456, 'learning_rate': 2.868e-07, 'completion_length': 50.34821701049805, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.144775390625, 'epoch': 0.71} 71%|███████▏ | 1783/2500 [15:02:28<4:27:14, 22.36s/it] 71%|███████▏ | 1784/2500 [15:02:51<4:31:26, 22.75s/it] {'loss': 0.0164, 'grad_norm': 2.2675818112008175, 'learning_rate': 2.8639999999999997e-07, 'completion_length': 53.80357360839844, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9553572535514832, 'reward_std': 0.12626906484365463, 'kl': 0.4091796875, 'epoch': 0.71} 71%|███████▏ | 1784/2500 [15:02:51<4:31:26, 22.75s/it] 71%|███████▏ | 1785/2500 [15:03:15<4:35:12, 23.09s/it] {'loss': 0.0194, 'grad_norm': 2.8078204910397906, 'learning_rate': 2.8599999999999994e-07, 'completion_length': 58.982147216796875, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.9017857909202576, 'reward_std': 0.20020467042922974, 'kl': 0.486328125, 'epoch': 0.71} 71%|███████▏ | 1785/2500 [15:03:15<4:35:12, 23.09s/it] 71%|███████▏ | 1786/2500 [15:03:38<4:33:27, 22.98s/it] {'loss': 0.0319, 'grad_norm': 4.677690129885302, 'learning_rate': 2.856e-07, 'completion_length': 50.77678871154785, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9464285969734192, 'reward_std': 0.1289059966802597, 'kl': 0.794921875, 'epoch': 0.71} 71%|███████▏ | 1786/2500 [15:03:38<4:33:27, 22.98s/it] 71%|███████▏ | 1787/2500 [15:04:00<4:31:50, 22.88s/it] {'loss': 0.0142, 'grad_norm': 2.5747797452707593, 'learning_rate': 2.852e-07, 'completion_length': 47.50893211364746, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.35498046875, 'epoch': 0.71} 71%|███████▏ | 1787/2500 [15:04:00<4:31:50, 22.88s/it] 72%|███████▏ | 1788/2500 [15:04:22<4:28:02, 22.59s/it] {'loss': 0.0088, 'grad_norm': 0.846629892383582, 'learning_rate': 2.848e-07, 'completion_length': 50.09821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.220703125, 'epoch': 0.72} 72%|███████▏ | 1788/2500 [15:04:22<4:28:02, 22.59s/it] 72%|███████▏ | 1789/2500 [15:04:44<4:24:40, 22.34s/it] {'loss': 0.0082, 'grad_norm': 1.8639569003671053, 'learning_rate': 2.844e-07, 'completion_length': 53.12500190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.205078125, 'epoch': 0.72} 72%|███████▏ | 1789/2500 [15:04:44<4:24:40, 22.34s/it] 72%|███████▏ | 1790/2500 [15:05:06<4:24:09, 22.32s/it] {'loss': 0.0075, 'grad_norm': 1.7288857219352347, 'learning_rate': 2.8399999999999995e-07, 'completion_length': 53.66071701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.18798828125, 'epoch': 0.72} 72%|███████▏ | 1790/2500 [15:05:06<4:24:09, 22.32s/it] 72%|███████▏ | 1791/2500 [15:05:28<4:21:17, 22.11s/it] {'loss': 0.007, 'grad_norm': 1.6764466316853026, 'learning_rate': 2.836e-07, 'completion_length': 51.52678680419922, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.17431640625, 'epoch': 0.72} 72%|███████▏ | 1791/2500 [15:05:28<4:21:17, 22.11s/it] 72%|███████▏ | 1792/2500 [15:05:52<4:29:13, 22.82s/it] {'loss': 0.0096, 'grad_norm': 2.4902450304491226, 'learning_rate': 2.832e-07, 'completion_length': 60.41964530944824, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.24072265625, 'epoch': 0.72} 72%|███████▏ | 1792/2500 [15:05:52<4:29:13, 22.82s/it] 72%|███████▏ | 1793/2500 [15:06:15<4:29:33, 22.88s/it] {'loss': 0.0045, 'grad_norm': 0.31868141592046845, 'learning_rate': 2.8279999999999996e-07, 'completion_length': 50.05357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.111328125, 'epoch': 0.72} 72%|███████▏ | 1793/2500 [15:06:15<4:29:33, 22.88s/it] 72%|███████▏ | 1794/2500 [15:06:37<4:24:23, 22.47s/it] {'loss': 0.0044, 'grad_norm': 2.1267098146148777, 'learning_rate': 2.824e-07, 'completion_length': 54.14285850524902, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.110595703125, 'epoch': 0.72} 72%|███████▏ | 1794/2500 [15:06:37<4:24:23, 22.47s/it] 72%|███████▏ | 1795/2500 [15:06:59<4:23:53, 22.46s/it] {'loss': 0.0062, 'grad_norm': 1.3132740153641482, 'learning_rate': 2.8199999999999996e-07, 'completion_length': 56.71428871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.15478515625, 'epoch': 0.72} 72%|███████▏ | 1795/2500 [15:06:59<4:23:53, 22.46s/it] 72%|███████▏ | 1796/2500 [15:07:22<4:22:13, 22.35s/it] {'loss': 0.0042, 'grad_norm': 0.22934071876612866, 'learning_rate': 2.816e-07, 'completion_length': 57.205360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10498046875, 'epoch': 0.72} 72%|███████▏ | 1796/2500 [15:07:22<4:22:13, 22.35s/it] 72%|███████▏ | 1797/2500 [15:07:44<4:20:46, 22.26s/it] {'loss': 0.0102, 'grad_norm': 2.204055856940215, 'learning_rate': 2.812e-07, 'completion_length': 56.55357360839844, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857909202576, 'reward_std': 0.10101525112986565, 'kl': 0.2529296875, 'epoch': 0.72} 72%|███████▏ | 1797/2500 [15:07:44<4:20:46, 22.26s/it] 72%|███████▏ | 1798/2500 [15:08:05<4:18:39, 22.11s/it] {'loss': 0.0053, 'grad_norm': 1.543595722394112, 'learning_rate': 2.8079999999999997e-07, 'completion_length': 52.625003814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1318359375, 'epoch': 0.72} 72%|███████▏ | 1798/2500 [15:08:05<4:18:39, 22.11s/it] 72%|███████▏ | 1799/2500 [15:08:28<4:18:48, 22.15s/it] {'loss': 0.0091, 'grad_norm': 1.2979091828652658, 'learning_rate': 2.804e-07, 'completion_length': 54.42857360839844, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857909202576, 'reward_std': 0.07839837297797203, 'kl': 0.22705078125, 'epoch': 0.72} 72%|███████▏ | 1799/2500 [15:08:28<4:18:48, 22.15s/it] 72%|███████▏ | 1800/2500 [15:08:51<4:23:54, 22.62s/it] {'loss': 0.0074, 'grad_norm': 1.2449149433412434, 'learning_rate': 2.8e-07, 'completion_length': 54.83035850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.18505859375, 'epoch': 0.72} 72%|███████▏ | 1800/2500 [15:08:51<4:23:54, 22.62s/it] 72%|███████▏ | 1801/2500 [15:10:02<7:12:41, 37.14s/it] {'loss': 0.0062, 'grad_norm': 0.2843066590488176, 'learning_rate': 2.796e-07, 'completion_length': 47.410715103149414, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.15625, 'epoch': 0.72} 72%|███████▏ | 1801/2500 [15:10:02<7:12:41, 37.14s/it] 72%|███████▏ | 1802/2500 [15:10:24<6:18:57, 32.58s/it] {'loss': 0.0053, 'grad_norm': 0.24375898571135093, 'learning_rate': 2.792e-07, 'completion_length': 52.64285850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1337890625, 'epoch': 0.72} 72%|███████▏ | 1802/2500 [15:10:24<6:18:57, 32.58s/it] 72%|███████▏ | 1803/2500 [15:10:46<5:42:13, 29.46s/it] {'loss': 0.0101, 'grad_norm': 1.2770738050553267, 'learning_rate': 2.788e-07, 'completion_length': 52.77678871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.25390625, 'epoch': 0.72} 72%|███████▏ | 1803/2500 [15:10:46<5:42:13, 29.46s/it] 72%|███████▏ | 1804/2500 [15:11:09<5:18:30, 27.46s/it] {'loss': 0.0082, 'grad_norm': 1.892598261049553, 'learning_rate': 2.7839999999999995e-07, 'completion_length': 59.437503814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.205810546875, 'epoch': 0.72} 72%|███████▏ | 1804/2500 [15:11:09<5:18:30, 27.46s/it] 72%|███████▏ | 1805/2500 [15:11:31<4:58:17, 25.75s/it] {'loss': 0.0075, 'grad_norm': 2.2100832443486587, 'learning_rate': 2.7800000000000003e-07, 'completion_length': 57.91071701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.18798828125, 'epoch': 0.72} 72%|███████▏ | 1805/2500 [15:11:31<4:58:17, 25.75s/it] 72%|███████▏ | 1806/2500 [15:11:54<4:47:19, 24.84s/it] {'loss': 0.0052, 'grad_norm': 0.31223411402954254, 'learning_rate': 2.776e-07, 'completion_length': 64.48214721679688, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.129150390625, 'epoch': 0.72} 72%|███████▏ | 1806/2500 [15:11:54<4:47:19, 24.84s/it] 72%|███████▏ | 1807/2500 [15:12:16<4:37:23, 24.02s/it] {'loss': 0.006, 'grad_norm': 1.5064995667948653, 'learning_rate': 2.7719999999999997e-07, 'completion_length': 52.03571701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.14990234375, 'epoch': 0.72} 72%|███████▏ | 1807/2500 [15:12:16<4:37:23, 24.02s/it] 72%|███████▏ | 1808/2500 [15:12:39<4:35:36, 23.90s/it] {'loss': 0.0184, 'grad_norm': 2.878649221477408, 'learning_rate': 2.768e-07, 'completion_length': 50.66071701049805, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9375000596046448, 'reward_std': 0.1767766922712326, 'kl': 0.458984375, 'epoch': 0.72} 72%|███████▏ | 1808/2500 [15:12:39<4:35:36, 23.90s/it] 72%|███████▏ | 1809/2500 [15:13:03<4:34:38, 23.85s/it] {'loss': 0.0113, 'grad_norm': 1.3165695605058396, 'learning_rate': 2.7639999999999996e-07, 'completion_length': 51.38393020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.28173828125, 'epoch': 0.72} 72%|███████▏ | 1809/2500 [15:13:03<4:34:38, 23.85s/it] 72%|███████▏ | 1810/2500 [15:13:24<4:23:47, 22.94s/it] {'loss': 0.0044, 'grad_norm': 1.8555459727563335, 'learning_rate': 2.7600000000000004e-07, 'completion_length': 53.66964530944824, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.111083984375, 'epoch': 0.72} 72%|███████▏ | 1810/2500 [15:13:24<4:23:47, 22.94s/it] 72%|███████▏ | 1811/2500 [15:13:46<4:20:26, 22.68s/it] {'loss': 0.0052, 'grad_norm': 2.179505979686376, 'learning_rate': 2.756e-07, 'completion_length': 57.15178871154785, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.12890625, 'epoch': 0.72} 72%|███████▏ | 1811/2500 [15:13:46<4:20:26, 22.68s/it] 72%|███████▏ | 1812/2500 [15:14:08<4:17:45, 22.48s/it] {'loss': 0.0083, 'grad_norm': 0.6810349084099613, 'learning_rate': 2.752e-07, 'completion_length': 52.50893211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.20654296875, 'epoch': 0.72} 72%|███████▏ | 1812/2500 [15:14:08<4:17:45, 22.48s/it] 73%|███████▎ | 1813/2500 [15:14:30<4:15:28, 22.31s/it] {'loss': 0.0168, 'grad_norm': 2.064159275274745, 'learning_rate': 2.748e-07, 'completion_length': 57.78571701049805, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.8928571939468384, 'reward_std': 0.22545847296714783, 'kl': 0.4189453125, 'epoch': 0.73} 73%|███████▎ | 1813/2500 [15:14:30<4:15:28, 22.31s/it] 73%|███████▎ | 1814/2500 [15:14:52<4:15:46, 22.37s/it] {'loss': 0.0294, 'grad_norm': 3.1894087887489198, 'learning_rate': 2.7439999999999997e-07, 'completion_length': 52.74107360839844, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.8750000596046448, 'reward_std': 0.15152287483215332, 'kl': 0.734375, 'epoch': 0.73} 73%|███████▎ | 1814/2500 [15:14:53<4:15:46, 22.37s/it] 73%|███████▎ | 1815/2500 [15:15:15<4:16:41, 22.48s/it] {'loss': 0.0155, 'grad_norm': 2.934248780670352, 'learning_rate': 2.74e-07, 'completion_length': 45.00000190734863, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8928572535514832, 'reward_std': 0.10101525112986565, 'kl': 0.3876953125, 'epoch': 0.73} 73%|███████▎ | 1815/2500 [15:15:15<4:16:41, 22.48s/it] 73%|███████▎ | 1816/2500 [15:15:38<4:17:08, 22.56s/it] {'loss': 0.0157, 'grad_norm': 1.7336947502111804, 'learning_rate': 2.736e-07, 'completion_length': 53.42857360839844, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642858505249023, 'reward_std': 0.0835726335644722, 'kl': 0.392578125, 'epoch': 0.73} 73%|███████▎ | 1816/2500 [15:15:38<4:17:08, 22.56s/it] 73%|███████▎ | 1817/2500 [15:16:00<4:16:33, 22.54s/it] {'loss': 0.0105, 'grad_norm': 15.16056994468964, 'learning_rate': 2.732e-07, 'completion_length': 43.93750190734863, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.2626953125, 'epoch': 0.73} 73%|███████▎ | 1817/2500 [15:16:00<4:16:33, 22.54s/it] 73%|███████▎ | 1818/2500 [15:16:22<4:13:13, 22.28s/it] {'loss': 0.011, 'grad_norm': 4.2479058103035685, 'learning_rate': 2.7279999999999995e-07, 'completion_length': 54.812503814697266, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.275390625, 'epoch': 0.73} 73%|███████▎ | 1818/2500 [15:16:22<4:13:13, 22.28s/it] 73%|███████▎ | 1819/2500 [15:16:46<4:19:09, 22.83s/it] {'loss': 0.0255, 'grad_norm': 2.4893191522055726, 'learning_rate': 2.724e-07, 'completion_length': 52.017860412597656, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9375000596046448, 'reward_std': 0.09918941557407379, 'kl': 0.638671875, 'epoch': 0.73} 73%|███████▎ | 1819/2500 [15:16:46<4:19:09, 22.83s/it] 73%|███████▎ | 1820/2500 [15:17:08<4:15:37, 22.56s/it] {'loss': 0.0143, 'grad_norm': 2.3058299135409, 'learning_rate': 2.72e-07, 'completion_length': 49.44643020629883, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.3583984375, 'epoch': 0.73} 73%|███████▎ | 1820/2500 [15:17:08<4:15:37, 22.56s/it] 73%|███████▎ | 1821/2500 [15:17:29<4:09:02, 22.01s/it] {'loss': 0.0105, 'grad_norm': 2.231103468382558, 'learning_rate': 2.7159999999999997e-07, 'completion_length': 47.91964530944824, 'rewards/accuracy_reward': 0.9107142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8928571939468384, 'reward_std': 0.10101525485515594, 'kl': 0.26318359375, 'epoch': 0.73} 73%|███████▎ | 1821/2500 [15:17:29<4:09:02, 22.01s/it] 73%|███████▎ | 1822/2500 [15:17:51<4:10:09, 22.14s/it] {'loss': 0.0111, 'grad_norm': 4.465796668603516, 'learning_rate': 2.712e-07, 'completion_length': 42.93750190734863, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.946428656578064, 'reward_std': 0.13408026099205017, 'kl': 0.27783203125, 'epoch': 0.73} 73%|███████▎ | 1822/2500 [15:17:51<4:10:09, 22.14s/it] 73%|███████▎ | 1823/2500 [15:18:12<4:05:32, 21.76s/it] {'loss': 0.0259, 'grad_norm': 2.5612571432787665, 'learning_rate': 2.7079999999999996e-07, 'completion_length': 50.85714530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.6474609375, 'epoch': 0.73} 73%|███████▎ | 1823/2500 [15:18:12<4:05:32, 21.76s/it] 73%|███████▎ | 1824/2500 [15:18:33<4:00:38, 21.36s/it] {'loss': 0.007, 'grad_norm': 0.5574060950218468, 'learning_rate': 2.704e-07, 'completion_length': 48.80357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.175048828125, 'epoch': 0.73} 73%|███████▎ | 1824/2500 [15:18:33<4:00:38, 21.36s/it] 73%|███████▎ | 1825/2500 [15:18:54<4:01:23, 21.46s/it] {'loss': 0.0117, 'grad_norm': 1.9403709919634893, 'learning_rate': 2.7e-07, 'completion_length': 45.59821701049805, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9553572535514832, 'reward_std': 0.10882644355297089, 'kl': 0.291015625, 'epoch': 0.73} 73%|███████▎ | 1825/2500 [15:18:54<4:01:23, 21.46s/it] 73%|███████▎ | 1826/2500 [15:19:15<3:58:08, 21.20s/it] {'loss': 0.0066, 'grad_norm': 0.4730343570659675, 'learning_rate': 2.696e-07, 'completion_length': 48.98214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.166015625, 'epoch': 0.73} 73%|███████▎ | 1826/2500 [15:19:15<3:58:08, 21.20s/it] 73%|███████▎ | 1827/2500 [15:19:36<3:56:48, 21.11s/it] {'loss': 0.0217, 'grad_norm': 2.905646053072243, 'learning_rate': 2.692e-07, 'completion_length': 52.22321701049805, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.5419921875, 'epoch': 0.73} 73%|███████▎ | 1827/2500 [15:19:36<3:56:48, 21.11s/it] 73%|███████▎ | 1828/2500 [15:19:57<3:58:13, 21.27s/it] {'loss': 0.0065, 'grad_norm': 1.2625039973023153, 'learning_rate': 2.6879999999999997e-07, 'completion_length': 55.55357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.163818359375, 'epoch': 0.73} 73%|███████▎ | 1828/2500 [15:19:57<3:58:13, 21.27s/it] 73%|███████▎ | 1829/2500 [15:20:19<3:57:44, 21.26s/it] {'loss': 0.0063, 'grad_norm': 0.814166726008938, 'learning_rate': 2.684e-07, 'completion_length': 48.723215103149414, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1572265625, 'epoch': 0.73} 73%|███████▎ | 1829/2500 [15:20:19<3:57:44, 21.26s/it] 73%|███████▎ | 1830/2500 [15:20:40<3:58:21, 21.35s/it] {'loss': 0.0056, 'grad_norm': 3.6165397926281924, 'learning_rate': 2.68e-07, 'completion_length': 52.02678680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.138671875, 'epoch': 0.73} 73%|███████▎ | 1830/2500 [15:20:40<3:58:21, 21.35s/it] 73%|███████▎ | 1831/2500 [15:21:02<3:59:26, 21.48s/it] {'loss': 0.012, 'grad_norm': 1.18627764811029, 'learning_rate': 2.676e-07, 'completion_length': 49.58035850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.30078125, 'epoch': 0.73} 73%|███████▎ | 1831/2500 [15:21:02<3:59:26, 21.48s/it] 73%|███████▎ | 1832/2500 [15:21:24<4:00:33, 21.61s/it] {'loss': 0.0096, 'grad_norm': 1.4273171444011317, 'learning_rate': 2.6719999999999996e-07, 'completion_length': 53.41964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.2392578125, 'epoch': 0.73} 73%|███████▎ | 1832/2500 [15:21:24<4:00:33, 21.61s/it] 73%|███████▎ | 1833/2500 [15:21:47<4:05:17, 22.06s/it] {'loss': 0.0063, 'grad_norm': 0.4041013107101521, 'learning_rate': 2.668e-07, 'completion_length': 53.00000190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.158203125, 'epoch': 0.73} 73%|███████▎ | 1833/2500 [15:21:47<4:05:17, 22.06s/it] 73%|███████▎ | 1834/2500 [15:22:09<4:03:48, 21.97s/it] {'loss': 0.0049, 'grad_norm': 0.28745250386370874, 'learning_rate': 2.664e-07, 'completion_length': 47.125003814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12158203125, 'epoch': 0.73} 73%|███████▎ | 1834/2500 [15:22:09<4:03:48, 21.97s/it] 73%|███████▎ | 1835/2500 [15:22:31<4:02:58, 21.92s/it] {'loss': 0.0059, 'grad_norm': 1.8785218511452717, 'learning_rate': 2.66e-07, 'completion_length': 51.24107360839844, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285714626312256, 'reward_std': 0.0835726335644722, 'kl': 0.1474609375, 'epoch': 0.73} 73%|███████▎ | 1835/2500 [15:22:31<4:02:58, 21.92s/it] 73%|███████▎ | 1836/2500 [15:22:53<4:05:41, 22.20s/it] {'loss': 0.005, 'grad_norm': 3.257554166236943, 'learning_rate': 2.656e-07, 'completion_length': 46.330360412597656, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.124755859375, 'epoch': 0.73} 73%|███████▎ | 1836/2500 [15:22:53<4:05:41, 22.20s/it] 73%|███████▎ | 1837/2500 [15:23:15<4:03:03, 22.00s/it] {'loss': 0.0145, 'grad_norm': 1.874188342122581, 'learning_rate': 2.6519999999999997e-07, 'completion_length': 50.11607360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.36279296875, 'epoch': 0.73} 73%|███████▎ | 1837/2500 [15:23:15<4:03:03, 22.00s/it] 74%|███████▎ | 1838/2500 [15:23:36<3:59:24, 21.70s/it] {'loss': 0.0089, 'grad_norm': 3.637601610819644, 'learning_rate': 2.648e-07, 'completion_length': 54.74107360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.1379830539226532, 'kl': 0.22314453125, 'epoch': 0.74} 74%|███████▎ | 1838/2500 [15:23:36<3:59:24, 21.70s/it] 74%|███████▎ | 1839/2500 [15:23:58<4:00:48, 21.86s/it] {'loss': 0.006, 'grad_norm': 0.4060357932356188, 'learning_rate': 2.644e-07, 'completion_length': 47.68750190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.150390625, 'epoch': 0.74} 74%|███████▎ | 1839/2500 [15:23:58<4:00:48, 21.86s/it] 74%|███████▎ | 1840/2500 [15:24:20<4:00:21, 21.85s/it] {'loss': 0.0088, 'grad_norm': 1.0731333023856762, 'learning_rate': 2.64e-07, 'completion_length': 54.285715103149414, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.21923828125, 'epoch': 0.74} 74%|███████▎ | 1840/2500 [15:24:20<4:00:21, 21.85s/it] 74%|███████▎ | 1841/2500 [15:24:42<3:59:23, 21.80s/it] {'loss': 0.0083, 'grad_norm': 0.6240079692765116, 'learning_rate': 2.636e-07, 'completion_length': 47.76785850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.20654296875, 'epoch': 0.74} 74%|███████▎ | 1841/2500 [15:24:42<3:59:23, 21.80s/it] 74%|███████▎ | 1842/2500 [15:25:04<4:00:29, 21.93s/it] {'loss': 0.0035, 'grad_norm': 0.18213322768433998, 'learning_rate': 2.632e-07, 'completion_length': 53.38393020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.08740234375, 'epoch': 0.74} 74%|███████▎ | 1842/2500 [15:25:04<4:00:29, 21.93s/it] 74%|███████▎ | 1843/2500 [15:25:26<3:59:44, 21.89s/it] {'loss': 0.0121, 'grad_norm': 1.0085632526017607, 'learning_rate': 2.6279999999999994e-07, 'completion_length': 54.48214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.302734375, 'epoch': 0.74} 74%|███████▎ | 1843/2500 [15:25:26<3:59:44, 21.89s/it] 74%|███████▍ | 1844/2500 [15:25:49<4:03:42, 22.29s/it] {'loss': 0.0148, 'grad_norm': 1.5846675844344627, 'learning_rate': 2.624e-07, 'completion_length': 53.36607360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.06613001227378845, 'kl': 0.36962890625, 'epoch': 0.74} 74%|███████▍ | 1844/2500 [15:25:49<4:03:42, 22.29s/it] 74%|███████▍ | 1845/2500 [15:26:11<4:01:23, 22.11s/it] {'loss': 0.0081, 'grad_norm': 0.7841249604661188, 'learning_rate': 2.62e-07, 'completion_length': 49.52678871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.20263671875, 'epoch': 0.74} 74%|███████▍ | 1845/2500 [15:26:11<4:01:23, 22.11s/it] 74%|███████▍ | 1846/2500 [15:26:32<3:58:28, 21.88s/it] {'loss': 0.0072, 'grad_norm': 0.9223559772710527, 'learning_rate': 2.616e-07, 'completion_length': 51.660715103149414, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.17919921875, 'epoch': 0.74} 74%|███████▍ | 1846/2500 [15:26:32<3:58:28, 21.88s/it] 74%|███████▍ | 1847/2500 [15:26:55<4:02:17, 22.26s/it] {'loss': 0.0076, 'grad_norm': 1.813871040995592, 'learning_rate': 2.612e-07, 'completion_length': 51.580360412597656, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.18994140625, 'epoch': 0.74} 74%|███████▍ | 1847/2500 [15:26:55<4:02:17, 22.26s/it] 74%|███████▍ | 1848/2500 [15:27:16<3:57:30, 21.86s/it] {'loss': 0.012, 'grad_norm': 2.9920313191547203, 'learning_rate': 2.6079999999999995e-07, 'completion_length': 49.59821701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.2998046875, 'epoch': 0.74} 74%|███████▍ | 1848/2500 [15:27:16<3:57:30, 21.86s/it] 74%|███████▍ | 1849/2500 [15:27:39<3:59:53, 22.11s/it] {'loss': 0.006, 'grad_norm': 2.025172841944384, 'learning_rate': 2.6040000000000003e-07, 'completion_length': 51.49107360839844, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.1513671875, 'epoch': 0.74} 74%|███████▍ | 1849/2500 [15:27:39<3:59:53, 22.11s/it] 74%|███████▍ | 1850/2500 [15:28:00<3:57:59, 21.97s/it] {'loss': 0.006, 'grad_norm': 38.73437600152145, 'learning_rate': 2.6e-07, 'completion_length': 48.517860412597656, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.151123046875, 'epoch': 0.74} 74%|███████▍ | 1850/2500 [15:28:00<3:57:59, 21.97s/it] 74%|███████▍ | 1851/2500 [15:28:24<4:03:00, 22.47s/it] {'loss': 0.0081, 'grad_norm': 1.372938275438763, 'learning_rate': 2.5959999999999997e-07, 'completion_length': 51.18750190734863, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.946428656578064, 'reward_std': 0.08868780732154846, 'kl': 0.203125, 'epoch': 0.74} 74%|███████▍ | 1851/2500 [15:28:24<4:03:00, 22.47s/it] 74%|███████▍ | 1852/2500 [15:28:46<3:59:32, 22.18s/it] {'loss': 0.0276, 'grad_norm': 4.457515854296181, 'learning_rate': 2.592e-07, 'completion_length': 46.05357360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9553571939468384, 'reward_std': 0.06543753296136856, 'kl': 0.68701171875, 'epoch': 0.74} 74%|███████▍ | 1852/2500 [15:28:46<3:59:32, 22.18s/it] 74%|███████▍ | 1853/2500 [15:29:07<3:56:08, 21.90s/it] {'loss': 0.0071, 'grad_norm': 2.5174236079504535, 'learning_rate': 2.5879999999999996e-07, 'completion_length': 55.66964530944824, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.17626953125, 'epoch': 0.74} 74%|███████▍ | 1853/2500 [15:29:07<3:56:08, 21.90s/it] 74%|███████▍ | 1854/2500 [15:29:29<3:55:50, 21.91s/it] {'loss': 0.0058, 'grad_norm': 2.720976333891396, 'learning_rate': 2.584e-07, 'completion_length': 52.88393020629883, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.946428656578064, 'reward_std': 0.11272924020886421, 'kl': 0.1455078125, 'epoch': 0.74} 74%|███████▍ | 1854/2500 [15:29:29<3:55:50, 21.91s/it] 74%|███████▍ | 1855/2500 [15:29:53<4:02:32, 22.56s/it] {'loss': 0.0045, 'grad_norm': 0.15984017653977994, 'learning_rate': 2.58e-07, 'completion_length': 47.65178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.111572265625, 'epoch': 0.74} 74%|███████▍ | 1855/2500 [15:29:53<4:02:32, 22.56s/it] 74%|███████▍ | 1856/2500 [15:30:15<4:00:59, 22.45s/it] {'loss': 0.0047, 'grad_norm': 5.206903830836278, 'learning_rate': 2.576e-07, 'completion_length': 50.642860412597656, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.118408203125, 'epoch': 0.74} 74%|███████▍ | 1856/2500 [15:30:15<4:00:59, 22.45s/it] 74%|███████▍ | 1857/2500 [15:30:37<3:58:11, 22.23s/it] {'loss': 0.0063, 'grad_norm': 1.3670044684046063, 'learning_rate': 2.5719999999999995e-07, 'completion_length': 49.75000190734863, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.025253813713788986, 'kl': 0.158203125, 'epoch': 0.74} 74%|███████▍ | 1857/2500 [15:30:37<3:58:11, 22.23s/it] 74%|███████▍ | 1858/2500 [15:30:58<3:54:15, 21.89s/it] {'loss': 0.0053, 'grad_norm': 0.2639950806983036, 'learning_rate': 2.5679999999999997e-07, 'completion_length': 54.96428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.132568359375, 'epoch': 0.74} 74%|███████▍ | 1858/2500 [15:30:58<3:54:15, 21.89s/it] 74%|███████▍ | 1859/2500 [15:31:18<3:49:30, 21.48s/it] {'loss': 0.0052, 'grad_norm': 0.1743169024073221, 'learning_rate': 2.564e-07, 'completion_length': 52.89285850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12939453125, 'epoch': 0.74} 74%|███████▍ | 1859/2500 [15:31:18<3:49:30, 21.48s/it] 74%|███████▍ | 1860/2500 [15:31:39<3:45:18, 21.12s/it] {'loss': 0.0054, 'grad_norm': 0.18263486913721622, 'learning_rate': 2.56e-07, 'completion_length': 51.84821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1337890625, 'epoch': 0.74} 74%|███████▍ | 1860/2500 [15:31:39<3:45:18, 21.12s/it] 74%|███████▍ | 1861/2500 [15:32:01<3:48:21, 21.44s/it] {'loss': 0.009, 'grad_norm': 0.7176701808258977, 'learning_rate': 2.556e-07, 'completion_length': 52.17857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.224609375, 'epoch': 0.74} 74%|███████▍ | 1861/2500 [15:32:01<3:48:21, 21.44s/it] 74%|███████▍ | 1862/2500 [15:32:23<3:50:19, 21.66s/it] {'loss': 0.0046, 'grad_norm': 0.8412736110800344, 'learning_rate': 2.5519999999999996e-07, 'completion_length': 52.910715103149414, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.114990234375, 'epoch': 0.74} 74%|███████▍ | 1862/2500 [15:32:23<3:50:19, 21.66s/it] 75%|███████▍ | 1863/2500 [15:32:45<3:50:06, 21.67s/it] {'loss': 0.0048, 'grad_norm': 0.25767307161399666, 'learning_rate': 2.5480000000000003e-07, 'completion_length': 53.41071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.118896484375, 'epoch': 0.75} 75%|███████▍ | 1863/2500 [15:32:45<3:50:06, 21.67s/it] 75%|███████▍ | 1864/2500 [15:33:06<3:49:22, 21.64s/it] {'loss': 0.0072, 'grad_norm': 1.4961463556010568, 'learning_rate': 2.544e-07, 'completion_length': 56.16964530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1796875, 'epoch': 0.75} 75%|███████▍ | 1864/2500 [15:33:06<3:49:22, 21.64s/it] 75%|███████▍ | 1865/2500 [15:33:28<3:50:14, 21.76s/it] {'loss': 0.0051, 'grad_norm': 0.24894955961583773, 'learning_rate': 2.5399999999999997e-07, 'completion_length': 54.51785850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12744140625, 'epoch': 0.75} 75%|███████▍ | 1865/2500 [15:33:28<3:50:14, 21.76s/it] 75%|███████▍ | 1866/2500 [15:33:52<3:57:16, 22.46s/it] {'loss': 0.0182, 'grad_norm': 0.9451057476811138, 'learning_rate': 2.536e-07, 'completion_length': 50.29464530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.45703125, 'epoch': 0.75} 75%|███████▍ | 1866/2500 [15:33:52<3:57:16, 22.46s/it] 75%|███████▍ | 1867/2500 [15:34:14<3:54:42, 22.25s/it] {'loss': 0.0056, 'grad_norm': 0.24796355855332672, 'learning_rate': 2.5319999999999996e-07, 'completion_length': 53.06250190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.14013671875, 'epoch': 0.75} 75%|███████▍ | 1867/2500 [15:34:14<3:54:42, 22.25s/it] 75%|███████▍ | 1868/2500 [15:34:37<3:55:17, 22.34s/it] {'loss': 0.0053, 'grad_norm': 1.2383083062143474, 'learning_rate': 2.528e-07, 'completion_length': 51.74107551574707, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857313156128, 'reward_std': 0.05399492755532265, 'kl': 0.132080078125, 'epoch': 0.75} 75%|███████▍ | 1868/2500 [15:34:37<3:55:17, 22.34s/it] 75%|███████▍ | 1869/2500 [15:35:00<3:57:45, 22.61s/it] {'loss': 0.0127, 'grad_norm': 1.9184369417196412, 'learning_rate': 2.524e-07, 'completion_length': 50.90178871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.31787109375, 'epoch': 0.75} 75%|███████▍ | 1869/2500 [15:35:00<3:57:45, 22.61s/it] 75%|███████▍ | 1870/2500 [15:35:24<4:01:15, 22.98s/it] {'loss': 0.0061, 'grad_norm': 0.42345194025678595, 'learning_rate': 2.52e-07, 'completion_length': 53.93750190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.15234375, 'epoch': 0.75} 75%|███████▍ | 1870/2500 [15:35:24<4:01:15, 22.98s/it] 75%|███████▍ | 1871/2500 [15:35:46<3:57:42, 22.68s/it] {'loss': 0.0053, 'grad_norm': 0.26344323018460913, 'learning_rate': 2.516e-07, 'completion_length': 52.187503814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.132568359375, 'epoch': 0.75} 75%|███████▍ | 1871/2500 [15:35:46<3:57:42, 22.68s/it] 75%|███████▍ | 1872/2500 [15:36:10<4:03:15, 23.24s/it] {'loss': 0.0068, 'grad_norm': 0.36158708743438467, 'learning_rate': 2.5119999999999997e-07, 'completion_length': 54.15178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.16845703125, 'epoch': 0.75} 75%|███████▍ | 1872/2500 [15:36:10<4:03:15, 23.24s/it] 75%|███████▍ | 1873/2500 [15:36:32<3:59:09, 22.89s/it] {'loss': 0.0151, 'grad_norm': 2.6175926390866406, 'learning_rate': 2.508e-07, 'completion_length': 55.08035850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.37744140625, 'epoch': 0.75} 75%|███████▍ | 1873/2500 [15:36:32<3:59:09, 22.89s/it] 75%|███████▍ | 1874/2500 [15:36:58<4:06:53, 23.66s/it] {'loss': 0.0053, 'grad_norm': 0.325437813663332, 'learning_rate': 2.504e-07, 'completion_length': 51.03571701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.132568359375, 'epoch': 0.75} 75%|███████▍ | 1874/2500 [15:36:58<4:06:53, 23.66s/it] 75%|███████▌ | 1875/2500 [15:37:17<3:52:51, 22.35s/it] {'loss': 0.0055, 'grad_norm': 1.316262171345082, 'learning_rate': 2.5e-07, 'completion_length': 51.73214530944824, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.137939453125, 'epoch': 0.75} 75%|███████▌ | 1875/2500 [15:37:17<3:52:51, 22.35s/it] 75%|███████▌ | 1876/2500 [15:37:36<3:40:51, 21.24s/it] {'loss': 0.005, 'grad_norm': 0.2341535305007084, 'learning_rate': 2.4959999999999996e-07, 'completion_length': 51.90178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.126220703125, 'epoch': 0.75} 75%|███████▌ | 1876/2500 [15:37:36<3:40:51, 21.24s/it] 75%|███████▌ | 1877/2500 [15:37:55<3:34:13, 20.63s/it] {'loss': 0.0073, 'grad_norm': 5.730232149400196, 'learning_rate': 2.492e-07, 'completion_length': 57.830360412597656, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285714626312256, 'reward_std': 0.0835726335644722, 'kl': 0.18359375, 'epoch': 0.75} 75%|███████▌ | 1877/2500 [15:37:55<3:34:13, 20.63s/it] 75%|███████▌ | 1878/2500 [15:38:14<3:29:39, 20.22s/it] {'loss': 0.0055, 'grad_norm': 0.3369811167902739, 'learning_rate': 2.488e-07, 'completion_length': 53.46428871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13818359375, 'epoch': 0.75} 75%|███████▌ | 1878/2500 [15:38:14<3:29:39, 20.22s/it] 75%|███████▌ | 1879/2500 [15:38:33<3:25:26, 19.85s/it] {'loss': 0.0423, 'grad_norm': 5.395869067539904, 'learning_rate': 2.484e-07, 'completion_length': 55.54464530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 1.0595703125, 'epoch': 0.75} 75%|███████▌ | 1879/2500 [15:38:33<3:25:26, 19.85s/it] 75%|███████▌ | 1880/2500 [15:38:52<3:22:46, 19.62s/it] {'loss': 0.0053, 'grad_norm': 3.6607909444120104, 'learning_rate': 2.48e-07, 'completion_length': 54.23214530944824, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.131591796875, 'epoch': 0.75} 75%|███████▌ | 1880/2500 [15:38:52<3:22:46, 19.62s/it] 75%|███████▌ | 1881/2500 [15:39:11<3:19:29, 19.34s/it] {'loss': 0.0074, 'grad_norm': 0.5590416811632096, 'learning_rate': 2.4759999999999997e-07, 'completion_length': 58.87500190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1865234375, 'epoch': 0.75} 75%|███████▌ | 1881/2500 [15:39:11<3:19:29, 19.34s/it] 75%|███████▌ | 1882/2500 [15:39:31<3:20:03, 19.42s/it] {'loss': 0.0058, 'grad_norm': 2.478025197650616, 'learning_rate': 2.472e-07, 'completion_length': 60.17857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.144287109375, 'epoch': 0.75} 75%|███████▌ | 1882/2500 [15:39:31<3:20:03, 19.42s/it] 75%|███████▌ | 1883/2500 [15:39:50<3:18:23, 19.29s/it] {'loss': 0.0089, 'grad_norm': 1.8129300998463314, 'learning_rate': 2.4679999999999996e-07, 'completion_length': 58.687503814697266, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857313156128, 'reward_std': 0.07839837670326233, 'kl': 0.2216796875, 'epoch': 0.75} 75%|███████▌ | 1883/2500 [15:39:50<3:18:23, 19.29s/it] 75%|███████▌ | 1884/2500 [15:40:09<3:17:16, 19.21s/it] {'loss': 0.052, 'grad_norm': 11.03459687678307, 'learning_rate': 2.464e-07, 'completion_length': 55.29464530944824, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9553571939468384, 'reward_std': 0.10365219414234161, 'kl': 1.29296875, 'epoch': 0.75} 75%|███████▌ | 1884/2500 [15:40:09<3:17:16, 19.21s/it] 75%|███████▌ | 1885/2500 [15:40:28<3:17:01, 19.22s/it] {'loss': 0.0285, 'grad_norm': 4.706113475208539, 'learning_rate': 2.46e-07, 'completion_length': 55.660715103149414, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.71484375, 'epoch': 0.75} 75%|███████▌ | 1885/2500 [15:40:28<3:17:01, 19.22s/it] 75%|███████▌ | 1886/2500 [15:40:47<3:15:25, 19.10s/it] {'loss': 0.0059, 'grad_norm': 1.3231344795395128, 'learning_rate': 2.456e-07, 'completion_length': 55.37500190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.146484375, 'epoch': 0.75} 75%|███████▌ | 1886/2500 [15:40:47<3:15:25, 19.10s/it] 75%|███████▌ | 1887/2500 [15:41:06<3:14:29, 19.04s/it] {'loss': 0.0106, 'grad_norm': 2.6631335408264323, 'learning_rate': 2.452e-07, 'completion_length': 57.61607551574707, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9553572535514832, 'reward_std': 0.10365218669176102, 'kl': 0.26513671875, 'epoch': 0.75} 75%|███████▌ | 1887/2500 [15:41:06<3:14:29, 19.04s/it] 76%|███████▌ | 1888/2500 [15:41:25<3:14:30, 19.07s/it] {'loss': 0.0057, 'grad_norm': 1.877985972136206, 'learning_rate': 2.4479999999999997e-07, 'completion_length': 57.83928871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.142822265625, 'epoch': 0.76} 76%|███████▌ | 1888/2500 [15:41:25<3:14:30, 19.07s/it] 76%|███████▌ | 1889/2500 [15:41:44<3:13:22, 18.99s/it] {'loss': 0.0269, 'grad_norm': 6.586621497163516, 'learning_rate': 2.444e-07, 'completion_length': 53.85714530944824, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9375000596046448, 'reward_std': 0.09918941184878349, 'kl': 0.671875, 'epoch': 0.76} 76%|███████▌ | 1889/2500 [15:41:44<3:13:22, 18.99s/it] 76%|███████▌ | 1890/2500 [15:42:03<3:15:56, 19.27s/it] {'loss': 0.0061, 'grad_norm': 1.188945302586787, 'learning_rate': 2.4399999999999996e-07, 'completion_length': 61.75893211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.15234375, 'epoch': 0.76} 76%|███████▌ | 1890/2500 [15:42:03<3:15:56, 19.27s/it] 76%|███████▌ | 1891/2500 [15:42:23<3:15:34, 19.27s/it] {'loss': 0.0049, 'grad_norm': 0.36375620536686915, 'learning_rate': 2.436e-07, 'completion_length': 57.00893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.122802734375, 'epoch': 0.76} 76%|███████▌ | 1891/2500 [15:42:23<3:15:34, 19.27s/it] 76%|███████▌ | 1892/2500 [15:42:42<3:14:15, 19.17s/it] {'loss': 0.0113, 'grad_norm': 1.8014148961845788, 'learning_rate': 2.432e-07, 'completion_length': 58.48214530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.28125, 'epoch': 0.76} 76%|███████▌ | 1892/2500 [15:42:42<3:14:15, 19.17s/it] 76%|███████▌ | 1893/2500 [15:43:00<3:11:21, 18.92s/it] {'loss': 0.0177, 'grad_norm': 3.488921714972026, 'learning_rate': 2.428e-07, 'completion_length': 53.60714530944824, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9196429252624512, 'reward_std': 0.1379830539226532, 'kl': 0.44287109375, 'epoch': 0.76} 76%|███████▌ | 1893/2500 [15:43:00<3:11:21, 18.92s/it] 76%|███████▌ | 1894/2500 [15:43:20<3:13:49, 19.19s/it] {'loss': 0.0071, 'grad_norm': 1.9086009117300786, 'learning_rate': 2.424e-07, 'completion_length': 60.017860412597656, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.177734375, 'epoch': 0.76} 76%|███████▌ | 1894/2500 [15:43:20<3:13:49, 19.19s/it] 76%|███████▌ | 1895/2500 [15:43:40<3:16:05, 19.45s/it] {'loss': 0.0163, 'grad_norm': 1.5564692347519375, 'learning_rate': 2.4199999999999997e-07, 'completion_length': 53.26785850524902, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9375000596046448, 'reward_std': 0.08856847509741783, 'kl': 0.40576171875, 'epoch': 0.76} 76%|███████▌ | 1895/2500 [15:43:40<3:16:05, 19.45s/it] 76%|███████▌ | 1896/2500 [15:44:00<3:17:05, 19.58s/it] {'loss': 0.0164, 'grad_norm': 3.229752912910265, 'learning_rate': 2.416e-07, 'completion_length': 57.01785850524902, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.4091796875, 'epoch': 0.76} 76%|███████▌ | 1896/2500 [15:44:00<3:17:05, 19.58s/it] 76%|███████▌ | 1897/2500 [15:44:19<3:15:34, 19.46s/it] {'loss': 0.0174, 'grad_norm': 1.4926206248627014, 'learning_rate': 2.4119999999999996e-07, 'completion_length': 52.08035850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.053144559264183044, 'kl': 0.435791015625, 'epoch': 0.76} 76%|███████▌ | 1897/2500 [15:44:19<3:15:34, 19.46s/it] 76%|███████▌ | 1898/2500 [15:44:39<3:17:24, 19.68s/it] {'loss': 0.0842, 'grad_norm': 7.377673837422486, 'learning_rate': 2.408e-07, 'completion_length': 54.062503814697266, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.8928572535514832, 'reward_std': 0.1802247166633606, 'kl': 2.10546875, 'epoch': 0.76} 76%|███████▌ | 1898/2500 [15:44:39<3:17:24, 19.68s/it] 76%|███████▌ | 1899/2500 [15:44:59<3:18:36, 19.83s/it] {'loss': 0.0411, 'grad_norm': 3.217192141904128, 'learning_rate': 2.404e-07, 'completion_length': 58.28571701049805, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.883928656578064, 'reward_std': 0.23591221868991852, 'kl': 1.0322265625, 'epoch': 0.76} 76%|███████▌ | 1899/2500 [15:44:59<3:18:36, 19.83s/it] 76%|███████▌ | 1900/2500 [15:45:19<3:17:00, 19.70s/it] {'loss': 0.0055, 'grad_norm': 2.582270306121465, 'learning_rate': 2.4e-07, 'completion_length': 53.89285850524902, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.06222161650657654, 'kl': 0.138671875, 'epoch': 0.76} 76%|███████▌ | 1900/2500 [15:45:19<3:17:00, 19.70s/it] 76%|███████▌ | 1901/2500 [15:46:20<5:22:12, 32.27s/it] {'loss': 0.0036, 'grad_norm': 0.15563692471919807, 'learning_rate': 2.396e-07, 'completion_length': 59.38393211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.08935546875, 'epoch': 0.76} 76%|███████▌ | 1901/2500 [15:46:20<5:22:12, 32.27s/it] 76%|███████▌ | 1902/2500 [15:46:41<4:47:55, 28.89s/it] {'loss': 0.0081, 'grad_norm': 1.6898378905593332, 'learning_rate': 2.3919999999999997e-07, 'completion_length': 47.99107360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.201171875, 'epoch': 0.76} 76%|███████▌ | 1902/2500 [15:46:41<4:47:55, 28.89s/it] 76%|███████▌ | 1903/2500 [15:47:02<4:23:12, 26.45s/it] {'loss': 0.0194, 'grad_norm': 3.05010852508556, 'learning_rate': 2.388e-07, 'completion_length': 61.062503814697266, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.946428656578064, 'reward_std': 0.15152288600802422, 'kl': 0.484130859375, 'epoch': 0.76} 76%|███████▌ | 1903/2500 [15:47:02<4:23:12, 26.45s/it] 76%|███████▌ | 1904/2500 [15:47:21<4:01:44, 24.34s/it] {'loss': 0.0083, 'grad_norm': 4.430669079765138, 'learning_rate': 2.384e-07, 'completion_length': 54.17857360839844, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9464285969734192, 'reward_std': 0.07124518603086472, 'kl': 0.20703125, 'epoch': 0.76} 76%|███████▌ | 1904/2500 [15:47:21<4:01:44, 24.34s/it] 76%|███████▌ | 1905/2500 [15:47:42<3:48:46, 23.07s/it] {'loss': 0.0055, 'grad_norm': 0.3000349158639262, 'learning_rate': 2.38e-07, 'completion_length': 54.39285850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13671875, 'epoch': 0.76} 76%|███████▌ | 1905/2500 [15:47:42<3:48:46, 23.07s/it] 76%|███████▌ | 1906/2500 [15:48:03<3:42:02, 22.43s/it] {'loss': 0.0082, 'grad_norm': 1.1712754707092967, 'learning_rate': 2.3759999999999998e-07, 'completion_length': 57.02678871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.206787109375, 'epoch': 0.76} 76%|███████▌ | 1906/2500 [15:48:03<3:42:02, 22.43s/it] 76%|███████▋ | 1907/2500 [15:48:22<3:33:39, 21.62s/it] {'loss': 0.0065, 'grad_norm': 0.4050263296178699, 'learning_rate': 2.3719999999999998e-07, 'completion_length': 50.00893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.162841796875, 'epoch': 0.76} 76%|███████▋ | 1907/2500 [15:48:22<3:33:39, 21.62s/it] 76%|███████▋ | 1908/2500 [15:48:43<3:29:13, 21.21s/it] {'loss': 0.0125, 'grad_norm': 1.3986367288651336, 'learning_rate': 2.368e-07, 'completion_length': 53.91071701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.3134765625, 'epoch': 0.76} 76%|███████▋ | 1908/2500 [15:48:43<3:29:13, 21.21s/it] 76%|███████▋ | 1909/2500 [15:49:02<3:24:17, 20.74s/it] {'loss': 0.0038, 'grad_norm': 3.240205915058362, 'learning_rate': 2.364e-07, 'completion_length': 57.23214530944824, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9107143878936768, 'reward_std': 0.10101525112986565, 'kl': 0.094482421875, 'epoch': 0.76} 76%|███████▋ | 1909/2500 [15:49:02<3:24:17, 20.74s/it] 76%|███████▋ | 1910/2500 [15:49:21<3:19:32, 20.29s/it] {'loss': 0.0047, 'grad_norm': 1.7934581417607973, 'learning_rate': 2.3599999999999997e-07, 'completion_length': 51.02678680419922, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.11767578125, 'epoch': 0.76} 76%|███████▋ | 1910/2500 [15:49:21<3:19:32, 20.29s/it] 76%|███████▋ | 1911/2500 [15:49:42<3:18:45, 20.25s/it] {'loss': 0.0076, 'grad_norm': 0.9785941980890637, 'learning_rate': 2.356e-07, 'completion_length': 53.24107360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.19140625, 'epoch': 0.76} 76%|███████▋ | 1911/2500 [15:49:42<3:18:45, 20.25s/it] 76%|███████▋ | 1912/2500 [15:50:01<3:17:21, 20.14s/it] {'loss': 0.0052, 'grad_norm': 0.26622280840106166, 'learning_rate': 2.352e-07, 'completion_length': 55.66071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.131103515625, 'epoch': 0.76} 76%|███████▋ | 1912/2500 [15:50:01<3:17:21, 20.14s/it] 77%|███████▋ | 1913/2500 [15:50:22<3:18:22, 20.28s/it] {'loss': 0.0116, 'grad_norm': 2.2811748132273038, 'learning_rate': 2.3479999999999998e-07, 'completion_length': 57.312503814697266, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857909202576, 'reward_std': 0.10101525112986565, 'kl': 0.2890625, 'epoch': 0.77} 77%|███████▋ | 1913/2500 [15:50:22<3:18:22, 20.28s/it] 77%|███████▋ | 1914/2500 [15:50:42<3:15:59, 20.07s/it] {'loss': 0.0035, 'grad_norm': 0.2173862206667848, 'learning_rate': 2.3439999999999998e-07, 'completion_length': 56.01785850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.08740234375, 'epoch': 0.77} 77%|███████▋ | 1914/2500 [15:50:42<3:15:59, 20.07s/it] 77%|███████▋ | 1915/2500 [15:51:04<3:21:13, 20.64s/it] {'loss': 0.004, 'grad_norm': 0.832660254021462, 'learning_rate': 2.34e-07, 'completion_length': 54.50893211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.100830078125, 'epoch': 0.77} 77%|███████▋ | 1915/2500 [15:51:04<3:21:13, 20.64s/it] 77%|███████▋ | 1916/2500 [15:51:23<3:17:54, 20.33s/it] {'loss': 0.0038, 'grad_norm': 0.30197365018340216, 'learning_rate': 2.336e-07, 'completion_length': 50.41964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.094970703125, 'epoch': 0.77} 77%|███████▋ | 1916/2500 [15:51:23<3:17:54, 20.33s/it] 77%|███████▋ | 1917/2500 [15:51:44<3:19:21, 20.52s/it] {'loss': 0.0054, 'grad_norm': 3.06124927728769, 'learning_rate': 2.3319999999999997e-07, 'completion_length': 52.49107551574707, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285715222358704, 'reward_std': 0.06613001227378845, 'kl': 0.134765625, 'epoch': 0.77} 77%|███████▋ | 1917/2500 [15:51:44<3:19:21, 20.52s/it] 77%|███████▋ | 1918/2500 [15:52:06<3:24:16, 21.06s/it] {'loss': 0.009, 'grad_norm': 0.9993526022144569, 'learning_rate': 2.328e-07, 'completion_length': 56.35714530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.22509765625, 'epoch': 0.77} 77%|███████▋ | 1918/2500 [15:52:06<3:24:16, 21.06s/it] 77%|███████▋ | 1919/2500 [15:52:30<3:31:21, 21.83s/it] {'loss': 0.0083, 'grad_norm': 3.003595215788205, 'learning_rate': 2.324e-07, 'completion_length': 55.13393211364746, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.2060546875, 'epoch': 0.77} 77%|███████▋ | 1919/2500 [15:52:30<3:31:21, 21.83s/it] 77%|███████▋ | 1920/2500 [15:52:53<3:34:26, 22.18s/it] {'loss': 0.0038, 'grad_norm': 0.20067246672227324, 'learning_rate': 2.32e-07, 'completion_length': 51.96428871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0947265625, 'epoch': 0.77} 77%|███████▋ | 1920/2500 [15:52:53<3:34:26, 22.18s/it] 77%|███████▋ | 1921/2500 [15:53:16<3:34:59, 22.28s/it] {'loss': 0.0068, 'grad_norm': 7.155150402256017, 'learning_rate': 2.3159999999999998e-07, 'completion_length': 54.73214530944824, 'rewards/accuracy_reward': 0.9017857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.883928656578064, 'reward_std': 0.08620956540107727, 'kl': 0.17041015625, 'epoch': 0.77} 77%|███████▋ | 1921/2500 [15:53:16<3:34:59, 22.28s/it] 77%|███████▋ | 1922/2500 [15:53:37<3:32:45, 22.09s/it] {'loss': 0.0077, 'grad_norm': 1.1030493854222752, 'learning_rate': 2.3119999999999998e-07, 'completion_length': 60.69643020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.193359375, 'epoch': 0.77} 77%|███████▋ | 1922/2500 [15:53:37<3:32:45, 22.09s/it] 77%|███████▋ | 1923/2500 [15:53:59<3:32:24, 22.09s/it] {'loss': 0.0054, 'grad_norm': 0.25902439136395067, 'learning_rate': 2.308e-07, 'completion_length': 49.16964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13525390625, 'epoch': 0.77} 77%|███████▋ | 1923/2500 [15:53:59<3:32:24, 22.09s/it] 77%|███████▋ | 1924/2500 [15:54:21<3:30:18, 21.91s/it] {'loss': 0.1494, 'grad_norm': 22.431282638167108, 'learning_rate': 2.3039999999999997e-07, 'completion_length': 52.27678680419922, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9375000596046448, 'reward_std': 0.15933407098054886, 'kl': 3.73828125, 'epoch': 0.77} 77%|███████▋ | 1924/2500 [15:54:21<3:30:18, 21.91s/it] 77%|███████▋ | 1925/2500 [15:54:45<3:35:41, 22.51s/it] {'loss': 0.0066, 'grad_norm': 1.9946635558683692, 'learning_rate': 2.3e-07, 'completion_length': 58.66964530944824, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9553571939468384, 'reward_std': 0.10365219414234161, 'kl': 0.166015625, 'epoch': 0.77} 77%|███████▋ | 1925/2500 [15:54:45<3:35:41, 22.51s/it] 77%|███████▋ | 1926/2500 [15:55:07<3:34:25, 22.41s/it] {'loss': 0.0054, 'grad_norm': 1.3936626739909688, 'learning_rate': 2.296e-07, 'completion_length': 58.25000190734863, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.134765625, 'epoch': 0.77} 77%|███████▋ | 1926/2500 [15:55:07<3:34:25, 22.41s/it] 77%|███████▋ | 1927/2500 [15:55:29<3:33:03, 22.31s/it] {'loss': 0.0036, 'grad_norm': 0.1496836877739568, 'learning_rate': 2.292e-07, 'completion_length': 55.66071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.08984375, 'epoch': 0.77} 77%|███████▋ | 1927/2500 [15:55:29<3:33:03, 22.31s/it] 77%|███████▋ | 1928/2500 [15:55:52<3:33:27, 22.39s/it] {'loss': 0.0046, 'grad_norm': 0.24149469603828808, 'learning_rate': 2.2879999999999998e-07, 'completion_length': 52.00000190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.114990234375, 'epoch': 0.77} 77%|███████▋ | 1928/2500 [15:55:52<3:33:27, 22.39s/it] 77%|███████▋ | 1929/2500 [15:56:13<3:31:05, 22.18s/it] {'loss': 0.0066, 'grad_norm': 2.0104520507731025, 'learning_rate': 2.2839999999999998e-07, 'completion_length': 49.57143020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.16455078125, 'epoch': 0.77} 77%|███████▋ | 1929/2500 [15:56:13<3:31:05, 22.18s/it] 77%|███████▋ | 1930/2500 [15:56:34<3:27:26, 21.84s/it] {'loss': 0.0088, 'grad_norm': 2.571717724509484, 'learning_rate': 2.28e-07, 'completion_length': 56.40178871154785, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.2197265625, 'epoch': 0.77} 77%|███████▋ | 1930/2500 [15:56:34<3:27:26, 21.84s/it] 77%|███████▋ | 1931/2500 [15:56:56<3:25:50, 21.71s/it] {'loss': 0.0124, 'grad_norm': 1.6289576217179402, 'learning_rate': 2.2759999999999997e-07, 'completion_length': 54.79464530944824, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.31103515625, 'epoch': 0.77} 77%|███████▋ | 1931/2500 [15:56:56<3:25:50, 21.71s/it] 77%|███████▋ | 1932/2500 [15:57:16<3:20:47, 21.21s/it] {'loss': 0.0049, 'grad_norm': 0.8628617305404055, 'learning_rate': 2.272e-07, 'completion_length': 56.05357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.121337890625, 'epoch': 0.77} 77%|███████▋ | 1932/2500 [15:57:16<3:20:47, 21.21s/it] 77%|███████▋ | 1933/2500 [15:57:37<3:21:33, 21.33s/it] {'loss': 0.0043, 'grad_norm': 0.26513077599377083, 'learning_rate': 2.268e-07, 'completion_length': 60.30357551574707, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.107177734375, 'epoch': 0.77} 77%|███████▋ | 1933/2500 [15:57:37<3:21:33, 21.33s/it] 77%|███████▋ | 1934/2500 [15:57:58<3:19:29, 21.15s/it] {'loss': 0.0075, 'grad_norm': 1.3616023474572059, 'learning_rate': 2.264e-07, 'completion_length': 58.90178680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.053144559264183044, 'kl': 0.18701171875, 'epoch': 0.77} 77%|███████▋ | 1934/2500 [15:57:58<3:19:29, 21.15s/it] 77%|███████▋ | 1935/2500 [15:58:19<3:19:31, 21.19s/it] {'loss': 0.0146, 'grad_norm': 4.323164555694374, 'learning_rate': 2.2599999999999999e-07, 'completion_length': 51.41964530944824, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.973214328289032, 'reward': 1.946428656578064, 'reward_std': 0.15152288600802422, 'kl': 0.36328125, 'epoch': 0.77} 77%|███████▋ | 1935/2500 [15:58:19<3:19:31, 21.19s/it] 77%|███████▋ | 1936/2500 [15:58:41<3:19:40, 21.24s/it] {'loss': 0.008, 'grad_norm': 1.7719102592892058, 'learning_rate': 2.2559999999999998e-07, 'completion_length': 59.75000190734863, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553572535514832, 'reward_std': 0.08747542649507523, 'kl': 0.19921875, 'epoch': 0.77} 77%|███████▋ | 1936/2500 [15:58:41<3:19:40, 21.24s/it] 77%|███████▋ | 1937/2500 [15:59:03<3:21:14, 21.45s/it] {'loss': 0.0047, 'grad_norm': 1.272140193138457, 'learning_rate': 2.252e-07, 'completion_length': 58.875003814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.116943359375, 'epoch': 0.77} 77%|███████▋ | 1937/2500 [15:59:03<3:21:14, 21.45s/it] 78%|███████▊ | 1938/2500 [15:59:24<3:19:59, 21.35s/it] {'loss': 0.005, 'grad_norm': 2.761766723360585, 'learning_rate': 2.248e-07, 'completion_length': 64.56250190734863, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.125, 'epoch': 0.78} 78%|███████▊ | 1938/2500 [15:59:24<3:19:59, 21.35s/it] 78%|███████▊ | 1939/2500 [15:59:46<3:22:07, 21.62s/it] {'loss': 0.004, 'grad_norm': 1.8176642317821368, 'learning_rate': 2.2439999999999997e-07, 'completion_length': 53.46428871154785, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.10107421875, 'epoch': 0.78} 78%|███████▊ | 1939/2500 [15:59:46<3:22:07, 21.62s/it] 78%|███████▊ | 1940/2500 [16:00:09<3:24:45, 21.94s/it] {'loss': 0.0089, 'grad_norm': 2.3325020857570946, 'learning_rate': 2.24e-07, 'completion_length': 57.973215103149414, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9017858505249023, 'reward_std': 0.10882645100355148, 'kl': 0.22119140625, 'epoch': 0.78} 78%|███████▊ | 1940/2500 [16:00:09<3:24:45, 21.94s/it] 78%|███████▊ | 1941/2500 [16:00:30<3:23:43, 21.87s/it] {'loss': 0.0098, 'grad_norm': 3.4711510434528505, 'learning_rate': 2.236e-07, 'completion_length': 62.43750190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.24609375, 'epoch': 0.78} 78%|███████▊ | 1941/2500 [16:00:30<3:23:43, 21.87s/it] 78%|███████▊ | 1942/2500 [16:00:57<3:36:34, 23.29s/it] {'loss': 0.0065, 'grad_norm': 1.9736094349951583, 'learning_rate': 2.232e-07, 'completion_length': 57.892860412597656, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.06222161650657654, 'kl': 0.162109375, 'epoch': 0.78} 78%|███████▊ | 1942/2500 [16:00:57<3:36:34, 23.29s/it] 78%|███████▊ | 1943/2500 [16:01:19<3:32:14, 22.86s/it] {'loss': 0.0156, 'grad_norm': 1.9282661960637015, 'learning_rate': 2.2279999999999998e-07, 'completion_length': 58.54464530944824, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857909202576, 'reward_std': 0.10101525112986565, 'kl': 0.3896484375, 'epoch': 0.78} 78%|███████▊ | 1943/2500 [16:01:19<3:32:14, 22.86s/it] 78%|███████▊ | 1944/2500 [16:01:46<3:44:02, 24.18s/it] {'loss': 0.0094, 'grad_norm': 2.385959043464703, 'learning_rate': 2.2239999999999998e-07, 'completion_length': 61.50000190734863, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9107143878936768, 'reward_std': 0.05050762742757797, 'kl': 0.2353515625, 'epoch': 0.78} 78%|███████▊ | 1944/2500 [16:01:46<3:44:02, 24.18s/it] 78%|███████▊ | 1945/2500 [16:02:10<3:42:14, 24.03s/it] {'loss': 0.0056, 'grad_norm': 0.8933246311115481, 'learning_rate': 2.22e-07, 'completion_length': 59.84821701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.053144559264183044, 'kl': 0.138916015625, 'epoch': 0.78} 78%|███████▊ | 1945/2500 [16:02:10<3:42:14, 24.03s/it] 78%|███████▊ | 1946/2500 [16:02:32<3:36:00, 23.39s/it] {'loss': 0.0068, 'grad_norm': 1.755050962549167, 'learning_rate': 2.2159999999999997e-07, 'completion_length': 60.99107360839844, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642858505249023, 'reward_std': 0.0835726335644722, 'kl': 0.1689453125, 'epoch': 0.78} 78%|███████▊ | 1946/2500 [16:02:32<3:36:00, 23.39s/it] 78%|███████▊ | 1947/2500 [16:02:55<3:35:31, 23.38s/it] {'loss': 0.0096, 'grad_norm': 3.930969427011947, 'learning_rate': 2.212e-07, 'completion_length': 59.23214530944824, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9375000596046448, 'reward_std': 0.1767766922712326, 'kl': 0.23974609375, 'epoch': 0.78} 78%|███████▊ | 1947/2500 [16:02:55<3:35:31, 23.38s/it] 78%|███████▊ | 1948/2500 [16:03:18<3:32:44, 23.12s/it] {'loss': 0.0209, 'grad_norm': 4.185856382032292, 'learning_rate': 2.208e-07, 'completion_length': 64.27678871154785, 'rewards/accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.8571429252624512, 'reward_std': 0.3044358566403389, 'kl': 0.5234375, 'epoch': 0.78} 78%|███████▊ | 1948/2500 [16:03:18<3:32:44, 23.12s/it] 78%|███████▊ | 1949/2500 [16:03:40<3:31:38, 23.05s/it] {'loss': 0.0104, 'grad_norm': 2.7835853258166634, 'learning_rate': 2.2040000000000001e-07, 'completion_length': 66.00893020629883, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9553571939468384, 'reward_std': 0.12626906856894493, 'kl': 0.260498046875, 'epoch': 0.78} 78%|███████▊ | 1949/2500 [16:03:40<3:31:38, 23.05s/it] 78%|███████▊ | 1950/2500 [16:04:05<3:35:58, 23.56s/it] {'loss': 0.0292, 'grad_norm': 4.5704349167637774, 'learning_rate': 2.1999999999999998e-07, 'completion_length': 55.41964530944824, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9285714626312256, 'reward_std': 0.17941363155841827, 'kl': 0.73046875, 'epoch': 0.78} 78%|███████▊ | 1950/2500 [16:04:05<3:35:58, 23.56s/it] 78%|███████▊ | 1951/2500 [16:04:26<3:28:55, 22.83s/it] {'loss': 0.0066, 'grad_norm': 1.3164088091415551, 'learning_rate': 2.1959999999999998e-07, 'completion_length': 57.66071701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.166015625, 'epoch': 0.78} 78%|███████▊ | 1951/2500 [16:04:26<3:28:55, 22.83s/it] 78%|███████▊ | 1952/2500 [16:04:50<3:29:26, 22.93s/it] {'loss': 0.006, 'grad_norm': 0.9354365120726098, 'learning_rate': 2.192e-07, 'completion_length': 54.61607360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.053144559264183044, 'kl': 0.149169921875, 'epoch': 0.78} 78%|███████▊ | 1952/2500 [16:04:50<3:29:26, 22.93s/it] 78%|███████▊ | 1953/2500 [16:05:12<3:27:31, 22.76s/it] {'loss': 0.0094, 'grad_norm': 2.915968688235676, 'learning_rate': 2.1879999999999997e-07, 'completion_length': 58.05357360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9375000596046448, 'reward_std': 0.1767766922712326, 'kl': 0.2333984375, 'epoch': 0.78} 78%|███████▊ | 1953/2500 [16:05:12<3:27:31, 22.76s/it] 78%|███████▊ | 1954/2500 [16:05:34<3:24:56, 22.52s/it] {'loss': 0.006, 'grad_norm': 1.348292378316263, 'learning_rate': 2.184e-07, 'completion_length': 58.30357551574707, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.149169921875, 'epoch': 0.78} 78%|███████▊ | 1954/2500 [16:05:34<3:24:56, 22.52s/it] 78%|███████▊ | 1955/2500 [16:05:57<3:27:36, 22.86s/it] {'loss': 0.013, 'grad_norm': 3.274351376342991, 'learning_rate': 2.18e-07, 'completion_length': 57.38393211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.32568359375, 'epoch': 0.78} 78%|███████▊ | 1955/2500 [16:05:57<3:27:36, 22.86s/it] 78%|███████▊ | 1956/2500 [16:06:19<3:24:21, 22.54s/it] {'loss': 0.0098, 'grad_norm': 2.4636404811478285, 'learning_rate': 2.176e-07, 'completion_length': 53.07143211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.24560546875, 'epoch': 0.78} 78%|███████▊ | 1956/2500 [16:06:19<3:24:21, 22.54s/it] 78%|███████▊ | 1957/2500 [16:06:41<3:21:12, 22.23s/it] {'loss': 0.0047, 'grad_norm': 16.427645601437096, 'learning_rate': 2.1719999999999999e-07, 'completion_length': 55.27678680419922, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.10040178894996643, 'kl': 0.117919921875, 'epoch': 0.78} 78%|███████▊ | 1957/2500 [16:06:41<3:21:12, 22.23s/it] 78%|███████▊ | 1958/2500 [16:07:04<3:22:53, 22.46s/it] {'loss': 0.01, 'grad_norm': 0.9796607266738057, 'learning_rate': 2.1679999999999998e-07, 'completion_length': 55.50000190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.24951171875, 'epoch': 0.78} 78%|███████▊ | 1958/2500 [16:07:04<3:22:53, 22.46s/it] 78%|███████▊ | 1959/2500 [16:07:25<3:18:24, 22.00s/it] {'loss': 0.0063, 'grad_norm': 2.1928781688991426, 'learning_rate': 2.164e-07, 'completion_length': 50.58928680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.15771484375, 'epoch': 0.78} 78%|███████▊ | 1959/2500 [16:07:25<3:18:24, 22.00s/it] 78%|███████▊ | 1960/2500 [16:07:48<3:21:27, 22.38s/it] {'loss': 0.0129, 'grad_norm': 1.8814455790514473, 'learning_rate': 2.1599999999999998e-07, 'completion_length': 50.42857360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.32373046875, 'epoch': 0.78} 78%|███████▊ | 1960/2500 [16:07:48<3:21:27, 22.38s/it] 78%|███████▊ | 1961/2500 [16:08:10<3:20:40, 22.34s/it] {'loss': 0.0063, 'grad_norm': 0.6623609691949378, 'learning_rate': 2.156e-07, 'completion_length': 52.74107360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.15869140625, 'epoch': 0.78} 78%|███████▊ | 1961/2500 [16:08:10<3:20:40, 22.34s/it] 78%|███████▊ | 1962/2500 [16:08:33<3:20:50, 22.40s/it] {'loss': 0.0206, 'grad_norm': 1.6694906511624699, 'learning_rate': 2.152e-07, 'completion_length': 50.580360412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.5166015625, 'epoch': 0.78} 78%|███████▊ | 1962/2500 [16:08:33<3:20:50, 22.40s/it] 79%|███████▊ | 1963/2500 [16:08:57<3:25:41, 22.98s/it] {'loss': 0.0108, 'grad_norm': 1.710647018552604, 'learning_rate': 2.148e-07, 'completion_length': 53.54464530944824, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.26904296875, 'epoch': 0.79} 79%|███████▊ | 1963/2500 [16:08:57<3:25:41, 22.98s/it] 79%|███████▊ | 1964/2500 [16:09:19<3:23:07, 22.74s/it] {'loss': 0.0149, 'grad_norm': 0.8312132839729334, 'learning_rate': 2.144e-07, 'completion_length': 52.11607360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.37158203125, 'epoch': 0.79} 79%|███████▊ | 1964/2500 [16:09:19<3:23:07, 22.74s/it] 79%|███████▊ | 1965/2500 [16:09:40<3:18:38, 22.28s/it] {'loss': 0.0082, 'grad_norm': 1.9670507412899574, 'learning_rate': 2.1399999999999998e-07, 'completion_length': 54.70535850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.205078125, 'epoch': 0.79} 79%|███████▊ | 1965/2500 [16:09:41<3:18:38, 22.28s/it] 79%|███████▊ | 1966/2500 [16:10:03<3:20:11, 22.49s/it] {'loss': 0.0118, 'grad_norm': 1.044355943681459, 'learning_rate': 2.136e-07, 'completion_length': 49.88393211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.2958984375, 'epoch': 0.79} 79%|███████▊ | 1966/2500 [16:10:04<3:20:11, 22.49s/it] 79%|███████▊ | 1967/2500 [16:10:25<3:17:57, 22.28s/it] {'loss': 0.005, 'grad_norm': 0.9902697593074881, 'learning_rate': 2.132e-07, 'completion_length': 52.33928871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.12548828125, 'epoch': 0.79} 79%|███████▊ | 1967/2500 [16:10:25<3:17:57, 22.28s/it] 79%|███████▊ | 1968/2500 [16:10:50<3:24:01, 23.01s/it] {'loss': 0.0081, 'grad_norm': 1.2157711338585375, 'learning_rate': 2.1279999999999997e-07, 'completion_length': 50.37500190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.203125, 'epoch': 0.79} 79%|███████▊ | 1968/2500 [16:10:50<3:24:01, 23.01s/it] 79%|███████▉ | 1969/2500 [16:11:12<3:22:19, 22.86s/it] {'loss': 0.0087, 'grad_norm': 1.115818871677455, 'learning_rate': 2.124e-07, 'completion_length': 55.01785850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.21826171875, 'epoch': 0.79} 79%|███████▉ | 1969/2500 [16:11:12<3:22:19, 22.86s/it] 79%|███████▉ | 1970/2500 [16:11:35<3:22:14, 22.90s/it] {'loss': 0.0061, 'grad_norm': 1.8272207267301264, 'learning_rate': 2.12e-07, 'completion_length': 51.267860412597656, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.06222161650657654, 'kl': 0.15185546875, 'epoch': 0.79} 79%|███████▉ | 1970/2500 [16:11:35<3:22:14, 22.90s/it] 79%|███████▉ | 1971/2500 [16:11:58<3:20:41, 22.76s/it] {'loss': 0.0062, 'grad_norm': 0.600054982542014, 'learning_rate': 2.116e-07, 'completion_length': 55.74107360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.154052734375, 'epoch': 0.79} 79%|███████▉ | 1971/2500 [16:11:58<3:20:41, 22.76s/it] 79%|███████▉ | 1972/2500 [16:12:19<3:15:52, 22.26s/it] {'loss': 0.0072, 'grad_norm': 2.4925519823345055, 'learning_rate': 2.1119999999999999e-07, 'completion_length': 58.687503814697266, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.17919921875, 'epoch': 0.79} 79%|███████▉ | 1972/2500 [16:12:19<3:15:52, 22.26s/it] 79%|███████▉ | 1973/2500 [16:12:42<3:17:20, 22.47s/it] {'loss': 0.0065, 'grad_norm': 0.8356061548553075, 'learning_rate': 2.1079999999999998e-07, 'completion_length': 54.80357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.16162109375, 'epoch': 0.79} 79%|███████▉ | 1973/2500 [16:12:42<3:17:20, 22.47s/it] 79%|███████▉ | 1974/2500 [16:13:05<3:18:55, 22.69s/it] {'loss': 0.0042, 'grad_norm': 0.22523672351808485, 'learning_rate': 2.104e-07, 'completion_length': 60.15178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.103759765625, 'epoch': 0.79} 79%|███████▉ | 1974/2500 [16:13:05<3:18:55, 22.69s/it] 79%|███████▉ | 1975/2500 [16:13:28<3:18:03, 22.64s/it] {'loss': 0.0052, 'grad_norm': 0.2868939428340259, 'learning_rate': 2.0999999999999997e-07, 'completion_length': 54.36607551574707, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12890625, 'epoch': 0.79} 79%|███████▉ | 1975/2500 [16:13:28<3:18:03, 22.64s/it] 79%|███████▉ | 1976/2500 [16:13:50<3:17:48, 22.65s/it] {'loss': 0.0056, 'grad_norm': 0.2865814128039107, 'learning_rate': 2.096e-07, 'completion_length': 54.04464530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.14111328125, 'epoch': 0.79} 79%|███████▉ | 1976/2500 [16:13:50<3:17:48, 22.65s/it] 79%|███████▉ | 1977/2500 [16:14:12<3:14:21, 22.30s/it] {'loss': 0.0041, 'grad_norm': 0.21036707160122678, 'learning_rate': 2.092e-07, 'completion_length': 55.57143020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1025390625, 'epoch': 0.79} 79%|███████▉ | 1977/2500 [16:14:12<3:14:21, 22.30s/it] 79%|███████▉ | 1978/2500 [16:14:34<3:13:13, 22.21s/it] {'loss': 0.0046, 'grad_norm': 0.25227195209679076, 'learning_rate': 2.0880000000000002e-07, 'completion_length': 56.72321701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1162109375, 'epoch': 0.79} 79%|███████▉ | 1978/2500 [16:14:34<3:13:13, 22.21s/it] 79%|███████▉ | 1979/2500 [16:14:57<3:16:10, 22.59s/it] {'loss': 0.0085, 'grad_norm': 2.7802442697087, 'learning_rate': 2.0839999999999999e-07, 'completion_length': 58.30357551574707, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9375001192092896, 'reward_std': 0.1379830539226532, 'kl': 0.212890625, 'epoch': 0.79} 79%|███████▉ | 1979/2500 [16:14:57<3:16:10, 22.59s/it] 79%|███████▉ | 1980/2500 [16:15:18<3:11:20, 22.08s/it] {'loss': 0.0055, 'grad_norm': 0.49165309999262863, 'learning_rate': 2.0799999999999998e-07, 'completion_length': 57.88393211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.138671875, 'epoch': 0.79} 79%|███████▉ | 1980/2500 [16:15:18<3:11:20, 22.08s/it] 79%|███████▉ | 1981/2500 [16:15:43<3:19:00, 23.01s/it] {'loss': 0.0069, 'grad_norm': 0.4618716946956537, 'learning_rate': 2.076e-07, 'completion_length': 48.58928680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1728515625, 'epoch': 0.79} 79%|███████▉ | 1981/2500 [16:15:43<3:19:00, 23.01s/it] 79%|███████▉ | 1982/2500 [16:16:06<3:16:23, 22.75s/it] {'loss': 0.0068, 'grad_norm': 1.3314772376224484, 'learning_rate': 2.0719999999999998e-07, 'completion_length': 54.39285850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.169921875, 'epoch': 0.79} 79%|███████▉ | 1982/2500 [16:16:06<3:16:23, 22.75s/it] 79%|███████▉ | 1983/2500 [16:16:26<3:09:50, 22.03s/it] {'loss': 0.0091, 'grad_norm': 1.002650231853406, 'learning_rate': 2.068e-07, 'completion_length': 62.580360412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.22705078125, 'epoch': 0.79} 79%|███████▉ | 1983/2500 [16:16:26<3:09:50, 22.03s/it] 79%|███████▉ | 1984/2500 [16:16:49<3:11:17, 22.24s/it] {'loss': 0.0054, 'grad_norm': 1.9991831741188233, 'learning_rate': 2.064e-07, 'completion_length': 54.48214530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.134765625, 'epoch': 0.79} 79%|███████▉ | 1984/2500 [16:16:49<3:11:17, 22.24s/it] 79%|███████▉ | 1985/2500 [16:17:11<3:12:09, 22.39s/it] {'loss': 0.0031, 'grad_norm': 1.5161714066069905, 'learning_rate': 2.06e-07, 'completion_length': 57.88393020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.07666015625, 'epoch': 0.79} 79%|███████▉ | 1985/2500 [16:17:11<3:12:09, 22.39s/it] 79%|███████▉ | 1986/2500 [16:17:34<3:11:27, 22.35s/it] {'loss': 0.0079, 'grad_norm': 2.478948931786611, 'learning_rate': 2.056e-07, 'completion_length': 53.017860412597656, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.19677734375, 'epoch': 0.79} 79%|███████▉ | 1986/2500 [16:17:34<3:11:27, 22.35s/it] 79%|███████▉ | 1987/2500 [16:17:56<3:11:01, 22.34s/it] {'loss': 0.006, 'grad_norm': 0.974590814768639, 'learning_rate': 2.0519999999999998e-07, 'completion_length': 59.74107360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.15087890625, 'epoch': 0.79} 79%|███████▉ | 1987/2500 [16:17:56<3:11:01, 22.34s/it] 80%|███████▉ | 1988/2500 [16:18:16<3:06:01, 21.80s/it] {'loss': 0.0053, 'grad_norm': 1.526316262892034, 'learning_rate': 2.048e-07, 'completion_length': 50.017860412597656, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.13134765625, 'epoch': 0.8} 80%|███████▉ | 1988/2500 [16:18:16<3:06:01, 21.80s/it] 80%|███████▉ | 1989/2500 [16:18:39<3:07:07, 21.97s/it] {'loss': 0.004, 'grad_norm': 1.9074115985542026, 'learning_rate': 2.0439999999999998e-07, 'completion_length': 54.89285850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.09912109375, 'epoch': 0.8} 80%|███████▉ | 1989/2500 [16:18:39<3:07:07, 21.97s/it] 80%|███████▉ | 1990/2500 [16:19:02<3:08:36, 22.19s/it] {'loss': 0.0096, 'grad_norm': 1.3436101778388179, 'learning_rate': 2.0399999999999997e-07, 'completion_length': 63.312503814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.23974609375, 'epoch': 0.8} 80%|███████▉ | 1990/2500 [16:19:02<3:08:36, 22.19s/it] 80%|███████▉ | 1991/2500 [16:19:23<3:07:06, 22.06s/it] {'loss': 0.0197, 'grad_norm': 2.125527330598862, 'learning_rate': 2.036e-07, 'completion_length': 62.50893020629883, 'rewards/accuracy_reward': 0.9375000596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9196429252624512, 'reward_std': 0.10882644355297089, 'kl': 0.494140625, 'epoch': 0.8} 80%|███████▉ | 1991/2500 [16:19:23<3:07:06, 22.06s/it] 80%|███████▉ | 1992/2500 [16:19:46<3:09:15, 22.35s/it] {'loss': 0.0109, 'grad_norm': 1.3757570446678637, 'learning_rate': 2.032e-07, 'completion_length': 51.06250190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.27197265625, 'epoch': 0.8} 80%|███████▉ | 1992/2500 [16:19:46<3:09:15, 22.35s/it] 80%|███████▉ | 1993/2500 [16:20:08<3:06:38, 22.09s/it] {'loss': 0.0051, 'grad_norm': 0.2731701566844069, 'learning_rate': 2.028e-07, 'completion_length': 59.27678871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12841796875, 'epoch': 0.8} 80%|███████▉ | 1993/2500 [16:20:08<3:06:38, 22.09s/it] 80%|███████▉ | 1994/2500 [16:20:29<3:03:08, 21.72s/it] {'loss': 0.1425, 'grad_norm': 22.329308146914727, 'learning_rate': 2.0239999999999999e-07, 'completion_length': 51.15178871154785, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.9375001192092896, 'reward_std': 0.1418914571404457, 'kl': 3.554931640625, 'epoch': 0.8} 80%|███████▉ | 1994/2500 [16:20:29<3:03:08, 21.72s/it] 80%|███████▉ | 1995/2500 [16:20:51<3:04:29, 21.92s/it] {'loss': 0.0111, 'grad_norm': 1.5641469075808483, 'learning_rate': 2.02e-07, 'completion_length': 59.84821701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.2783203125, 'epoch': 0.8} 80%|███████▉ | 1995/2500 [16:20:51<3:04:29, 21.92s/it] 80%|███████▉ | 1996/2500 [16:21:14<3:05:53, 22.13s/it] {'loss': 0.0285, 'grad_norm': 6.471317121456232, 'learning_rate': 2.016e-07, 'completion_length': 62.02678680419922, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9553572535514832, 'reward_std': 0.12626906484365463, 'kl': 0.708984375, 'epoch': 0.8} 80%|███████▉ | 1996/2500 [16:21:14<3:05:53, 22.13s/it] 80%|███████▉ | 1997/2500 [16:21:36<3:06:19, 22.23s/it] {'loss': 0.0079, 'grad_norm': 0.7128533043675467, 'learning_rate': 2.0119999999999998e-07, 'completion_length': 55.62500190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.197265625, 'epoch': 0.8} 80%|███████▉ | 1997/2500 [16:21:36<3:06:19, 22.23s/it] 80%|███████▉ | 1998/2500 [16:22:00<3:10:57, 22.82s/it] {'loss': 0.0189, 'grad_norm': 2.209650586374696, 'learning_rate': 2.008e-07, 'completion_length': 63.82143020629883, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.946428656578064, 'reward_std': 0.15152288228273392, 'kl': 0.4736328125, 'epoch': 0.8} 80%|███████▉ | 1998/2500 [16:22:00<3:10:57, 22.82s/it] 80%|███████▉ | 1999/2500 [16:22:23<3:10:19, 22.79s/it] {'loss': 0.0122, 'grad_norm': 2.692551441960337, 'learning_rate': 2.004e-07, 'completion_length': 53.97321701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9553571939468384, 'reward_std': 0.12626906856894493, 'kl': 0.30517578125, 'epoch': 0.8} 80%|███████▉ | 1999/2500 [16:22:23<3:10:19, 22.79s/it] 80%|████████ | 2000/2500 [16:22:45<3:07:42, 22.53s/it] {'loss': 0.0117, 'grad_norm': 1.936224170503143, 'learning_rate': 2e-07, 'completion_length': 68.20536041259766, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9375000596046448, 'reward_std': 0.13671719282865524, 'kl': 0.2919921875, 'epoch': 0.8} 80%|████████ | 2000/2500 [16:22:45<3:07:42, 22.53s/it] 80%|████████ | 2001/2500 [16:24:00<5:19:06, 38.37s/it] {'loss': 0.0086, 'grad_norm': 1.1213821742797316, 'learning_rate': 1.996e-07, 'completion_length': 60.11607360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.2158203125, 'epoch': 0.8} 80%|████████ | 2001/2500 [16:24:00<5:19:06, 38.37s/it] 80%|████████ | 2002/2500 [16:24:21<4:34:39, 33.09s/it] {'loss': 0.0061, 'grad_norm': 1.3316652568587708, 'learning_rate': 1.9919999999999998e-07, 'completion_length': 59.41071701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.15234375, 'epoch': 0.8} 80%|████████ | 2002/2500 [16:24:21<4:34:39, 33.09s/it] 80%|████████ | 2003/2500 [16:24:42<4:04:34, 29.53s/it] {'loss': 0.0089, 'grad_norm': 1.8038566793195996, 'learning_rate': 1.988e-07, 'completion_length': 53.46428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.2236328125, 'epoch': 0.8} 80%|████████ | 2003/2500 [16:24:42<4:04:34, 29.53s/it] 80%|████████ | 2004/2500 [16:25:04<3:44:39, 27.18s/it] {'loss': 0.0116, 'grad_norm': 2.5655242277151427, 'learning_rate': 1.9839999999999998e-07, 'completion_length': 55.71428871154785, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9553571939468384, 'reward_std': 0.07924874126911163, 'kl': 0.2900390625, 'epoch': 0.8} 80%|████████ | 2004/2500 [16:25:04<3:44:39, 27.18s/it] 80%|████████ | 2005/2500 [16:25:25<3:28:40, 25.29s/it] {'loss': 0.1103, 'grad_norm': 19.36975040913762, 'learning_rate': 1.98e-07, 'completion_length': 53.83035850524902, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9553571939468384, 'reward_std': 0.10365219414234161, 'kl': 2.7568359375, 'epoch': 0.8} 80%|████████ | 2005/2500 [16:25:25<3:28:40, 25.29s/it] 80%|████████ | 2006/2500 [16:25:48<3:21:47, 24.51s/it] {'loss': 0.0134, 'grad_norm': 2.4245860688562817, 'learning_rate': 1.976e-07, 'completion_length': 60.250003814697266, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9553572535514832, 'reward_std': 0.12626906484365463, 'kl': 0.333984375, 'epoch': 0.8} 80%|████████ | 2006/2500 [16:25:48<3:21:47, 24.51s/it] 80%|████████ | 2007/2500 [16:26:09<3:12:51, 23.47s/it] {'loss': 0.0211, 'grad_norm': 3.9327509592209977, 'learning_rate': 1.9719999999999997e-07, 'completion_length': 51.69643020629883, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.946428656578064, 'reward_std': 0.13408026099205017, 'kl': 0.52734375, 'epoch': 0.8} 80%|████████ | 2007/2500 [16:26:09<3:12:51, 23.47s/it] 80%|████████ | 2008/2500 [16:26:30<3:07:47, 22.90s/it] {'loss': 0.0153, 'grad_norm': 7.90194065612208, 'learning_rate': 1.968e-07, 'completion_length': 58.13393211364746, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.910714328289032, 'reward_std': 0.05050762742757797, 'kl': 0.3818359375, 'epoch': 0.8} 80%|████████ | 2008/2500 [16:26:30<3:07:47, 22.90s/it] 80%|████████ | 2009/2500 [16:26:53<3:06:25, 22.78s/it] {'loss': 0.0057, 'grad_norm': 0.8307035481090413, 'learning_rate': 1.9639999999999999e-07, 'completion_length': 51.98214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.14111328125, 'epoch': 0.8} 80%|████████ | 2009/2500 [16:26:53<3:06:25, 22.78s/it] 80%|████████ | 2010/2500 [16:27:15<3:04:52, 22.64s/it] {'loss': 0.0145, 'grad_norm': 1.2463198193049825, 'learning_rate': 1.96e-07, 'completion_length': 51.27678680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.36279296875, 'epoch': 0.8} 80%|████████ | 2010/2500 [16:27:15<3:04:52, 22.64s/it] 80%|████████ | 2011/2500 [16:27:37<3:02:45, 22.42s/it] {'loss': 0.0115, 'grad_norm': 2.5869606471958324, 'learning_rate': 1.9559999999999998e-07, 'completion_length': 51.57143020629883, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.946428656578064, 'reward_std': 0.15152288228273392, 'kl': 0.2880859375, 'epoch': 0.8} 80%|████████ | 2011/2500 [16:27:37<3:02:45, 22.42s/it] 80%|████████ | 2012/2500 [16:28:01<3:05:57, 22.86s/it] {'loss': 0.0121, 'grad_norm': 4.741231917150175, 'learning_rate': 1.952e-07, 'completion_length': 51.598215103149414, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.3017578125, 'epoch': 0.8} 80%|████████ | 2012/2500 [16:28:01<3:05:57, 22.86s/it] 81%|████████ | 2013/2500 [16:28:22<3:00:58, 22.30s/it] {'loss': 0.0133, 'grad_norm': 3.1454354123336397, 'learning_rate': 1.948e-07, 'completion_length': 51.83035850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.3330078125, 'epoch': 0.81} 81%|████████ | 2013/2500 [16:28:22<3:00:58, 22.30s/it] 81%|████████ | 2014/2500 [16:28:43<2:57:31, 21.92s/it] {'loss': 0.0092, 'grad_norm': 2.328631444048383, 'learning_rate': 1.944e-07, 'completion_length': 48.392860412597656, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.23095703125, 'epoch': 0.81} 81%|████████ | 2014/2500 [16:28:43<2:57:31, 21.92s/it] 81%|████████ | 2015/2500 [16:29:07<3:02:41, 22.60s/it] {'loss': 0.0073, 'grad_norm': 3.2673612842944193, 'learning_rate': 1.94e-07, 'completion_length': 48.035715103149414, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.07576143741607666, 'kl': 0.18310546875, 'epoch': 0.81} 81%|████████ | 2015/2500 [16:29:07<3:02:41, 22.60s/it] 81%|████████ | 2016/2500 [16:29:29<3:00:38, 22.39s/it] {'loss': 0.0096, 'grad_norm': 1.3011954245319661, 'learning_rate': 1.9359999999999999e-07, 'completion_length': 56.41964530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.23974609375, 'epoch': 0.81} 81%|████████ | 2016/2500 [16:29:29<3:00:38, 22.39s/it] 81%|████████ | 2017/2500 [16:29:52<3:02:20, 22.65s/it] {'loss': 0.0128, 'grad_norm': 3.2493727455375696, 'learning_rate': 1.932e-07, 'completion_length': 50.70535850524902, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.10882644355297089, 'kl': 0.3212890625, 'epoch': 0.81} 81%|████████ | 2017/2500 [16:29:52<3:02:20, 22.65s/it] 81%|████████ | 2018/2500 [16:30:13<2:58:17, 22.19s/it] {'loss': 0.0082, 'grad_norm': 2.1385921818084728, 'learning_rate': 1.9279999999999998e-07, 'completion_length': 50.84821701049805, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.08747542649507523, 'kl': 0.2041015625, 'epoch': 0.81} 81%|████████ | 2018/2500 [16:30:13<2:58:17, 22.19s/it] 81%|████████ | 2019/2500 [16:30:35<2:55:55, 21.95s/it] {'loss': 0.0077, 'grad_norm': 3.3277188394803003, 'learning_rate': 1.9239999999999998e-07, 'completion_length': 48.70535850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.19287109375, 'epoch': 0.81} 81%|████████ | 2019/2500 [16:30:35<2:55:55, 21.95s/it] 81%|████████ | 2020/2500 [16:30:58<2:59:45, 22.47s/it] {'loss': 0.1409, 'grad_norm': 21.27793509243998, 'learning_rate': 1.92e-07, 'completion_length': 42.41071701049805, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.883928656578064, 'reward_std': 0.22028982639312744, 'kl': 3.529296875, 'epoch': 0.81} 81%|████████ | 2020/2500 [16:30:58<2:59:45, 22.47s/it] 81%|████████ | 2021/2500 [16:31:20<2:56:36, 22.12s/it] {'loss': 0.011, 'grad_norm': 4.214767961225819, 'learning_rate': 1.916e-07, 'completion_length': 47.65178680419922, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9553572535514832, 'reward_std': 0.12626906484365463, 'kl': 0.2744140625, 'epoch': 0.81} 81%|████████ | 2021/2500 [16:31:20<2:56:36, 22.12s/it] 81%|████████ | 2022/2500 [16:31:41<2:54:36, 21.92s/it] {'loss': 0.0203, 'grad_norm': 3.723150323285479, 'learning_rate': 1.912e-07, 'completion_length': 50.142860412597656, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9285714626312256, 'reward_std': 0.1632368564605713, 'kl': 0.505859375, 'epoch': 0.81} 81%|████████ | 2022/2500 [16:31:41<2:54:36, 21.92s/it] 81%|████████ | 2023/2500 [16:32:03<2:54:40, 21.97s/it] {'loss': 0.0373, 'grad_norm': 3.7005901149806455, 'learning_rate': 1.908e-07, 'completion_length': 50.12500190734863, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.9107143878936768, 'reward_std': 0.1643299236893654, 'kl': 0.9326171875, 'epoch': 0.81} 81%|████████ | 2023/2500 [16:32:03<2:54:40, 21.97s/it] 81%|████████ | 2024/2500 [16:32:24<2:51:18, 21.59s/it] {'loss': 0.0093, 'grad_norm': 1.2260043075595357, 'learning_rate': 1.904e-07, 'completion_length': 52.79464530944824, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.231689453125, 'epoch': 0.81} 81%|████████ | 2024/2500 [16:32:24<2:51:18, 21.59s/it] 81%|████████ | 2025/2500 [16:32:46<2:52:31, 21.79s/it] {'loss': 0.0087, 'grad_norm': 1.6725489584375022, 'learning_rate': 1.8999999999999998e-07, 'completion_length': 53.49107360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.216796875, 'epoch': 0.81} 81%|████████ | 2025/2500 [16:32:46<2:52:31, 21.79s/it] 81%|████████ | 2026/2500 [16:33:08<2:52:35, 21.85s/it] {'loss': 0.0058, 'grad_norm': 1.894458037966532, 'learning_rate': 1.8959999999999998e-07, 'completion_length': 48.45535850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.14599609375, 'epoch': 0.81} 81%|████████ | 2026/2500 [16:33:08<2:52:35, 21.85s/it] 81%|████████ | 2027/2500 [16:33:30<2:51:11, 21.71s/it] {'loss': 0.0231, 'grad_norm': 4.4940500881458805, 'learning_rate': 1.892e-07, 'completion_length': 55.82143020629883, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9196429252624512, 'reward_std': 0.22728432714939117, 'kl': 0.5771484375, 'epoch': 0.81} 81%|████████ | 2027/2500 [16:33:30<2:51:11, 21.71s/it] 81%|████████ | 2028/2500 [16:33:52<2:53:37, 22.07s/it] {'loss': 0.0371, 'grad_norm': 4.065887482116038, 'learning_rate': 1.888e-07, 'completion_length': 54.42857360839844, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9464285969734192, 'reward_std': 0.11663764715194702, 'kl': 0.92431640625, 'epoch': 0.81} 81%|████████ | 2028/2500 [16:33:52<2:53:37, 22.07s/it] 81%|████████ | 2029/2500 [16:34:13<2:49:54, 21.64s/it] {'loss': 0.0134, 'grad_norm': 1.4658386204554947, 'learning_rate': 1.884e-07, 'completion_length': 49.473215103149414, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.3359375, 'epoch': 0.81} 81%|████████ | 2029/2500 [16:34:13<2:49:54, 21.64s/it] 81%|████████ | 2030/2500 [16:34:35<2:49:33, 21.65s/it] {'loss': 0.0145, 'grad_norm': 3.4148451582863726, 'learning_rate': 1.88e-07, 'completion_length': 51.535715103149414, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.361328125, 'epoch': 0.81} 81%|████████ | 2030/2500 [16:34:35<2:49:33, 21.65s/it] 81%|████████ | 2031/2500 [16:34:58<2:53:19, 22.17s/it] {'loss': 0.0234, 'grad_norm': 2.988827450244977, 'learning_rate': 1.8759999999999999e-07, 'completion_length': 54.46428871154785, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9196429252624512, 'reward_std': 0.18849068135023117, 'kl': 0.5849609375, 'epoch': 0.81} 81%|████████ | 2031/2500 [16:34:58<2:53:19, 22.17s/it] 81%|████████▏ | 2032/2500 [16:35:19<2:49:32, 21.74s/it] {'loss': 0.0286, 'grad_norm': 3.0969922066591966, 'learning_rate': 1.872e-07, 'completion_length': 52.73214530944824, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.712890625, 'epoch': 0.81} 81%|████████▏ | 2032/2500 [16:35:19<2:49:32, 21.74s/it] 81%|████████▏ | 2033/2500 [16:35:40<2:47:24, 21.51s/it] {'loss': 0.0103, 'grad_norm': 1.3131852214205206, 'learning_rate': 1.8679999999999998e-07, 'completion_length': 52.93750190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.25830078125, 'epoch': 0.81} 81%|████████▏ | 2033/2500 [16:35:40<2:47:24, 21.51s/it] 81%|████████▏ | 2034/2500 [16:36:02<2:49:23, 21.81s/it] {'loss': 0.01, 'grad_norm': 2.0562893782524405, 'learning_rate': 1.864e-07, 'completion_length': 56.45535850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.25048828125, 'epoch': 0.81} 81%|████████▏ | 2034/2500 [16:36:02<2:49:23, 21.81s/it] 81%|████████▏ | 2035/2500 [16:36:24<2:47:45, 21.65s/it] {'loss': 0.0137, 'grad_norm': 1.7533292206743838, 'learning_rate': 1.86e-07, 'completion_length': 57.88393211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.3408203125, 'epoch': 0.81} 81%|████████▏ | 2035/2500 [16:36:24<2:47:45, 21.65s/it] 81%|████████▏ | 2036/2500 [16:36:45<2:46:13, 21.49s/it] {'loss': 0.0133, 'grad_norm': 1.2755943520960749, 'learning_rate': 1.8559999999999997e-07, 'completion_length': 51.91964530944824, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9017857909202576, 'reward_std': 0.053144559264183044, 'kl': 0.333984375, 'epoch': 0.81} 81%|████████▏ | 2036/2500 [16:36:45<2:46:13, 21.49s/it] 81%|████████▏ | 2037/2500 [16:37:06<2:45:42, 21.47s/it] {'loss': 0.012, 'grad_norm': 1.105656911652814, 'learning_rate': 1.852e-07, 'completion_length': 51.15178680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.298828125, 'epoch': 0.81} 81%|████████▏ | 2037/2500 [16:37:06<2:45:42, 21.47s/it] 82%|████████▏ | 2038/2500 [16:37:27<2:42:50, 21.15s/it] {'loss': 0.0101, 'grad_norm': 2.7798464576842137, 'learning_rate': 1.848e-07, 'completion_length': 53.71428871154785, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.2529296875, 'epoch': 0.82} 82%|████████▏ | 2038/2500 [16:37:27<2:42:50, 21.15s/it] 82%|████████▏ | 2039/2500 [16:37:49<2:44:24, 21.40s/it] {'loss': 0.0077, 'grad_norm': 3.0807143728044775, 'learning_rate': 1.844e-07, 'completion_length': 51.473215103149414, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1923828125, 'epoch': 0.82} 82%|████████▏ | 2039/2500 [16:37:49<2:44:24, 21.40s/it] 82%|████████▏ | 2040/2500 [16:38:10<2:45:12, 21.55s/it] {'loss': 0.0787, 'grad_norm': 22.618710596083996, 'learning_rate': 1.8399999999999998e-07, 'completion_length': 56.80357551574707, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857909202576, 'reward_std': 0.07839837297797203, 'kl': 1.96484375, 'epoch': 0.82} 82%|████████▏ | 2040/2500 [16:38:10<2:45:12, 21.55s/it] 82%|████████▏ | 2041/2500 [16:38:32<2:44:38, 21.52s/it] {'loss': 0.0057, 'grad_norm': 1.7788179849736145, 'learning_rate': 1.836e-07, 'completion_length': 61.08928871154785, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.1416015625, 'epoch': 0.82} 82%|████████▏ | 2041/2500 [16:38:32<2:44:38, 21.52s/it] 82%|████████▏ | 2042/2500 [16:38:54<2:46:09, 21.77s/it] {'loss': 0.0082, 'grad_norm': 1.766725006044646, 'learning_rate': 1.832e-07, 'completion_length': 56.34821701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.20556640625, 'epoch': 0.82} 82%|████████▏ | 2042/2500 [16:38:54<2:46:09, 21.77s/it] 82%|████████▏ | 2043/2500 [16:39:15<2:44:31, 21.60s/it] {'loss': 0.1403, 'grad_norm': 23.706806204209453, 'learning_rate': 1.8279999999999997e-07, 'completion_length': 54.02678680419922, 'rewards/accuracy_reward': 0.9017857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.8660714626312256, 'reward_std': 0.15415982902050018, 'kl': 3.4912109375, 'epoch': 0.82} 82%|████████▏ | 2043/2500 [16:39:15<2:44:31, 21.60s/it] 82%|████████▏ | 2044/2500 [16:39:37<2:44:47, 21.68s/it] {'loss': 0.0072, 'grad_norm': 2.6047995566278574, 'learning_rate': 1.824e-07, 'completion_length': 51.72321701049805, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.1806640625, 'epoch': 0.82} 82%|████████▏ | 2044/2500 [16:39:37<2:44:47, 21.68s/it] 82%|████████▏ | 2045/2500 [16:39:58<2:41:34, 21.31s/it] {'loss': 0.0046, 'grad_norm': 1.9154490450790276, 'learning_rate': 1.82e-07, 'completion_length': 54.35714340209961, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9107143878936768, 'reward_std': 0.05050762742757797, 'kl': 0.115966796875, 'epoch': 0.82} 82%|████████▏ | 2045/2500 [16:39:58<2:41:34, 21.31s/it] 82%|████████▏ | 2046/2500 [16:40:20<2:42:19, 21.45s/it] {'loss': 0.0058, 'grad_norm': 0.360539161859477, 'learning_rate': 1.816e-07, 'completion_length': 55.74107360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.14453125, 'epoch': 0.82} 82%|████████▏ | 2046/2500 [16:40:20<2:42:19, 21.45s/it] 82%|████████▏ | 2047/2500 [16:40:41<2:42:32, 21.53s/it] {'loss': 0.0062, 'grad_norm': 0.45212100919442394, 'learning_rate': 1.8119999999999998e-07, 'completion_length': 53.60714530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.15478515625, 'epoch': 0.82} 82%|████████▏ | 2047/2500 [16:40:41<2:42:32, 21.53s/it] 82%|████████▏ | 2048/2500 [16:41:03<2:42:19, 21.55s/it] {'loss': 0.0222, 'grad_norm': 26.723850127142487, 'learning_rate': 1.8079999999999998e-07, 'completion_length': 52.98214530944824, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9196429252624512, 'reward': 1.8928572535514832, 'reward_std': 0.0835726335644722, 'kl': 0.5546875, 'epoch': 0.82} 82%|████████▏ | 2048/2500 [16:41:03<2:42:19, 21.55s/it] 82%|████████▏ | 2049/2500 [16:41:24<2:40:57, 21.41s/it] {'loss': 0.0056, 'grad_norm': 0.3506241505042895, 'learning_rate': 1.804e-07, 'completion_length': 53.95535850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.140625, 'epoch': 0.82} 82%|████████▏ | 2049/2500 [16:41:24<2:40:57, 21.41s/it] 82%|████████▏ | 2050/2500 [16:41:46<2:43:05, 21.75s/it] {'loss': 0.0065, 'grad_norm': 0.3895580657154238, 'learning_rate': 1.8e-07, 'completion_length': 53.50893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1630859375, 'epoch': 0.82} 82%|████████▏ | 2050/2500 [16:41:46<2:43:05, 21.75s/it] 82%|████████▏ | 2051/2500 [16:42:08<2:41:32, 21.59s/it] {'loss': 0.0645, 'grad_norm': 24.133068685715454, 'learning_rate': 1.796e-07, 'completion_length': 52.57143020629883, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.910714328289032, 'reward_std': 0.09528662264347076, 'kl': 1.609619140625, 'epoch': 0.82} 82%|████████▏ | 2051/2500 [16:42:08<2:41:32, 21.59s/it] 82%|████████▏ | 2052/2500 [16:42:29<2:40:51, 21.54s/it] {'loss': 0.0043, 'grad_norm': 0.2672386013439083, 'learning_rate': 1.792e-07, 'completion_length': 51.75000190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.108642578125, 'epoch': 0.82} 82%|████████▏ | 2052/2500 [16:42:29<2:40:51, 21.54s/it] 82%|████████▏ | 2053/2500 [16:42:50<2:38:50, 21.32s/it] {'loss': 0.0956, 'grad_norm': 15.046471209434241, 'learning_rate': 1.7879999999999999e-07, 'completion_length': 46.46428680419922, 'rewards/accuracy_reward': 0.9375000596046448, 'rewards/format_reward': 0.8839285969734192, 'reward': 1.821428656578064, 'reward_std': 0.24302531778812408, 'kl': 2.3876953125, 'epoch': 0.82} 82%|████████▏ | 2053/2500 [16:42:50<2:38:50, 21.32s/it] 82%|████████▏ | 2054/2500 [16:43:11<2:37:18, 21.16s/it] {'loss': 0.0031, 'grad_norm': 0.22821514383203975, 'learning_rate': 1.7839999999999998e-07, 'completion_length': 54.437503814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.077392578125, 'epoch': 0.82} 82%|████████▏ | 2054/2500 [16:43:11<2:37:18, 21.16s/it] 82%|████████▏ | 2055/2500 [16:43:32<2:37:14, 21.20s/it] {'loss': 0.0236, 'grad_norm': 5.796662543849074, 'learning_rate': 1.7799999999999998e-07, 'completion_length': 54.500003814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.5908203125, 'epoch': 0.82} 82%|████████▏ | 2055/2500 [16:43:32<2:37:14, 21.20s/it] 82%|████████▏ | 2056/2500 [16:43:54<2:38:30, 21.42s/it] {'loss': 0.0044, 'grad_norm': 0.4293881459021464, 'learning_rate': 1.776e-07, 'completion_length': 55.392860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11083984375, 'epoch': 0.82} 82%|████████▏ | 2056/2500 [16:43:54<2:38:30, 21.42s/it] 82%|████████▏ | 2057/2500 [16:44:15<2:38:17, 21.44s/it] {'loss': 0.0059, 'grad_norm': 0.3283132788883188, 'learning_rate': 1.772e-07, 'completion_length': 56.72321701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1484375, 'epoch': 0.82} 82%|████████▏ | 2057/2500 [16:44:15<2:38:17, 21.44s/it] 82%|████████▏ | 2058/2500 [16:44:37<2:37:09, 21.33s/it] {'loss': 0.0449, 'grad_norm': 36.18695889491017, 'learning_rate': 1.768e-07, 'completion_length': 56.48214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 1.124267578125, 'epoch': 0.82} 82%|████████▏ | 2058/2500 [16:44:37<2:37:09, 21.33s/it] 82%|████████▏ | 2059/2500 [16:45:03<2:48:15, 22.89s/it] {'loss': 0.0416, 'grad_norm': 46.18386380751456, 'learning_rate': 1.764e-07, 'completion_length': 45.73214530944824, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.848214328289032, 'reward': 1.8035715222358704, 'reward_std': 0.21661832928657532, 'kl': 1.041015625, 'epoch': 0.82} 82%|████████▏ | 2059/2500 [16:45:03<2:48:15, 22.89s/it] 82%|████████▏ | 2060/2500 [16:45:25<2:44:48, 22.47s/it] {'loss': 0.0211, 'grad_norm': 29.2919613536848, 'learning_rate': 1.76e-07, 'completion_length': 55.41964530944824, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 0.8750000596046448, 'reward': 1.821428656578064, 'reward_std': 0.14970264583826065, 'kl': 0.52880859375, 'epoch': 0.82} 82%|████████▏ | 2060/2500 [16:45:25<2:44:48, 22.47s/it] 82%|████████▏ | 2061/2500 [16:45:47<2:43:53, 22.40s/it] {'loss': 0.0083, 'grad_norm': 6.076586266738614, 'learning_rate': 1.756e-07, 'completion_length': 58.482147216796875, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.2080078125, 'epoch': 0.82} 82%|████████▏ | 2061/2500 [16:45:47<2:43:53, 22.40s/it] 82%|████████▏ | 2062/2500 [16:46:08<2:40:56, 22.05s/it] {'loss': 0.092, 'grad_norm': 40.03661270051147, 'learning_rate': 1.7519999999999998e-07, 'completion_length': 48.75893211364746, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.8482143878936768, 'reward_std': 0.10361771285533905, 'kl': 2.3046875, 'epoch': 0.82} 82%|████████▏ | 2062/2500 [16:46:08<2:40:56, 22.05s/it] 83%|████████▎ | 2063/2500 [16:46:29<2:37:38, 21.64s/it] {'loss': 0.0174, 'grad_norm': 83.19980307461047, 'learning_rate': 1.748e-07, 'completion_length': 53.82143211364746, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.910714328289032, 'reward_std': 0.05050762742757797, 'kl': 0.433349609375, 'epoch': 0.83} 83%|████████▎ | 2063/2500 [16:46:29<2:37:38, 21.64s/it] 83%|████████▎ | 2064/2500 [16:46:52<2:41:36, 22.24s/it] {'loss': 0.0267, 'grad_norm': 37.16712145227509, 'learning_rate': 1.744e-07, 'completion_length': 52.85714530944824, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9375000596046448, 'reward_std': 0.1418914496898651, 'kl': 0.66845703125, 'epoch': 0.83} 83%|████████▎ | 2064/2500 [16:46:52<2:41:36, 22.24s/it] 83%|████████▎ | 2065/2500 [16:47:14<2:40:59, 22.21s/it] {'loss': 0.0163, 'grad_norm': 4.860943542680425, 'learning_rate': 1.7399999999999997e-07, 'completion_length': 53.50893020629883, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9375000596046448, 'reward_std': 0.10365219414234161, 'kl': 0.40673828125, 'epoch': 0.83} 83%|████████▎ | 2065/2500 [16:47:14<2:40:59, 22.21s/it] 83%|████████▎ | 2066/2500 [16:47:36<2:38:55, 21.97s/it] {'loss': 0.0272, 'grad_norm': 43.507044152444294, 'learning_rate': 1.736e-07, 'completion_length': 44.25893020629883, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.7410714626312256, 'reward': 1.6696429252624512, 'reward_std': 0.18742558360099792, 'kl': 0.681640625, 'epoch': 0.83} 83%|████████▎ | 2066/2500 [16:47:36<2:38:55, 21.97s/it] 83%|████████▎ | 2067/2500 [16:47:59<2:40:47, 22.28s/it] {'loss': 0.0154, 'grad_norm': 28.57442746593766, 'learning_rate': 1.732e-07, 'completion_length': 52.97321701049805, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.8750000596046448, 'reward_std': 0.11663764342665672, 'kl': 0.3837890625, 'epoch': 0.83} 83%|████████▎ | 2067/2500 [16:47:59<2:40:47, 22.28s/it] 83%|████████▎ | 2068/2500 [16:48:21<2:39:07, 22.10s/it] {'loss': 0.0081, 'grad_norm': 9.5146082225441, 'learning_rate': 1.728e-07, 'completion_length': 52.348215103149414, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.928571492433548, 'reward': 1.9196429252624512, 'reward_std': 0.06343399360775948, 'kl': 0.201171875, 'epoch': 0.83} 83%|████████▎ | 2068/2500 [16:48:21<2:39:07, 22.10s/it] 83%|████████▎ | 2069/2500 [16:48:43<2:38:39, 22.09s/it] {'loss': 0.056, 'grad_norm': 18.484102622760172, 'learning_rate': 1.7239999999999998e-07, 'completion_length': 55.16964530944824, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.9017857909202576, 'reward_std': 0.16262901946902275, 'kl': 1.39697265625, 'epoch': 0.83} 83%|████████▎ | 2069/2500 [16:48:43<2:38:39, 22.09s/it] 83%|████████▎ | 2070/2500 [16:49:04<2:37:17, 21.95s/it] {'loss': 0.0373, 'grad_norm': 7.949321557751448, 'learning_rate': 1.7199999999999998e-07, 'completion_length': 46.06250190734863, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.8392857313156128, 'reward_std': 0.06331466138362885, 'kl': 0.93310546875, 'epoch': 0.83} 83%|████████▎ | 2070/2500 [16:49:04<2:37:17, 21.95s/it] 83%|████████▎ | 2071/2500 [16:49:26<2:35:29, 21.75s/it] {'loss': 0.0114, 'grad_norm': 1.3043093119494644, 'learning_rate': 1.716e-07, 'completion_length': 58.57143211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9196429252624512, 'reward': 1.9196429252624512, 'reward_std': 0.025253813713788986, 'kl': 0.28564453125, 'epoch': 0.83} 83%|████████▎ | 2071/2500 [16:49:26<2:35:29, 21.75s/it] 83%|████████▎ | 2072/2500 [16:49:48<2:35:56, 21.86s/it] {'loss': 0.0256, 'grad_norm': 5.319164425907401, 'learning_rate': 1.7119999999999997e-07, 'completion_length': 56.267860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.63818359375, 'epoch': 0.83} 83%|████████▎ | 2072/2500 [16:49:48<2:35:56, 21.86s/it] 83%|████████▎ | 2073/2500 [16:50:10<2:37:19, 22.11s/it] {'loss': 0.0054, 'grad_norm': 0.33913035737984165, 'learning_rate': 1.708e-07, 'completion_length': 59.25000190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1357421875, 'epoch': 0.83} 83%|████████▎ | 2073/2500 [16:50:10<2:37:19, 22.11s/it] 83%|████████▎ | 2074/2500 [16:50:32<2:35:15, 21.87s/it] {'loss': 0.0446, 'grad_norm': 4.943154655474771, 'learning_rate': 1.704e-07, 'completion_length': 53.53571701049805, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.8928571939468384, 'reward_std': 0.11382229626178741, 'kl': 1.111328125, 'epoch': 0.83} 83%|████████▎ | 2074/2500 [16:50:32<2:35:15, 21.87s/it] 83%|████████▎ | 2075/2500 [16:50:54<2:35:18, 21.93s/it] {'loss': 0.0154, 'grad_norm': 20.560179875459962, 'learning_rate': 1.7000000000000001e-07, 'completion_length': 53.74107551574707, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9107142984867096, 'reward': 1.8482143878936768, 'reward_std': 0.1030978113412857, 'kl': 0.38427734375, 'epoch': 0.83} 83%|████████▎ | 2075/2500 [16:50:54<2:35:18, 21.93s/it] 83%|████████▎ | 2076/2500 [16:51:15<2:33:56, 21.78s/it] {'loss': 0.0076, 'grad_norm': 20.252870196177742, 'learning_rate': 1.6959999999999998e-07, 'completion_length': 56.47321701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.18994140625, 'epoch': 0.83} 83%|████████▎ | 2076/2500 [16:51:15<2:33:56, 21.78s/it] 83%|████████▎ | 2077/2500 [16:51:36<2:31:53, 21.55s/it] {'loss': 0.0332, 'grad_norm': 22.151216670273758, 'learning_rate': 1.6919999999999998e-07, 'completion_length': 54.01785850524902, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.82958984375, 'epoch': 0.83} 83%|████████▎ | 2077/2500 [16:51:36<2:31:53, 21.55s/it] 83%|████████▎ | 2078/2500 [16:51:58<2:32:04, 21.62s/it] {'loss': 0.0046, 'grad_norm': 0.31029472706096534, 'learning_rate': 1.688e-07, 'completion_length': 60.91964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.115966796875, 'epoch': 0.83} 83%|████████▎ | 2078/2500 [16:51:58<2:32:04, 21.62s/it] 83%|████████▎ | 2079/2500 [16:52:19<2:30:31, 21.45s/it] {'loss': 0.0215, 'grad_norm': 14.166304217081606, 'learning_rate': 1.684e-07, 'completion_length': 54.48214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.5390625, 'epoch': 0.83} 83%|████████▎ | 2079/2500 [16:52:19<2:30:31, 21.45s/it] 83%|████████▎ | 2080/2500 [16:52:41<2:31:17, 21.61s/it] {'loss': 0.0051, 'grad_norm': 0.2499975230095395, 'learning_rate': 1.68e-07, 'completion_length': 55.99107360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.126220703125, 'epoch': 0.83} 83%|████████▎ | 2080/2500 [16:52:41<2:31:17, 21.61s/it] 83%|████████▎ | 2081/2500 [16:53:03<2:31:56, 21.76s/it] {'loss': 0.0045, 'grad_norm': 0.9364002619246099, 'learning_rate': 1.676e-07, 'completion_length': 56.955360412597656, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.11279296875, 'epoch': 0.83} 83%|████████▎ | 2081/2500 [16:53:03<2:31:56, 21.76s/it] 83%|████████▎ | 2082/2500 [16:53:24<2:29:35, 21.47s/it] {'loss': 0.0043, 'grad_norm': 0.22481157409802047, 'learning_rate': 1.672e-07, 'completion_length': 56.45535850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1083984375, 'epoch': 0.83} 83%|████████▎ | 2082/2500 [16:53:24<2:29:35, 21.47s/it] 83%|████████▎ | 2083/2500 [16:53:45<2:29:10, 21.46s/it] {'loss': 0.0288, 'grad_norm': 5.531832833837751, 'learning_rate': 1.6679999999999998e-07, 'completion_length': 53.85714530944824, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9285714626312256, 'reward_std': 0.12444323301315308, 'kl': 0.718017578125, 'epoch': 0.83} 83%|████████▎ | 2083/2500 [16:53:45<2:29:10, 21.46s/it] 83%|████████▎ | 2084/2500 [16:54:07<2:30:07, 21.65s/it] {'loss': 0.0045, 'grad_norm': 0.29720545381706664, 'learning_rate': 1.6639999999999998e-07, 'completion_length': 61.16964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11328125, 'epoch': 0.83} 83%|████████▎ | 2084/2500 [16:54:07<2:30:07, 21.65s/it] 83%|████████▎ | 2085/2500 [16:54:28<2:28:14, 21.43s/it] {'loss': 0.0688, 'grad_norm': 10.118692496627103, 'learning_rate': 1.66e-07, 'completion_length': 53.33035850524902, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9464285969734192, 'reward_std': 0.09919501841068268, 'kl': 1.71728515625, 'epoch': 0.83} 83%|████████▎ | 2085/2500 [16:54:28<2:28:14, 21.43s/it] 83%|████████▎ | 2086/2500 [16:54:49<2:27:16, 21.35s/it] {'loss': 0.0128, 'grad_norm': 4.860588243476005, 'learning_rate': 1.656e-07, 'completion_length': 53.66964530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.3212890625, 'epoch': 0.83} 83%|████████▎ | 2086/2500 [16:54:50<2:27:16, 21.35s/it] 83%|████████▎ | 2087/2500 [16:55:12<2:28:20, 21.55s/it] {'loss': 0.0482, 'grad_norm': 8.651036834358905, 'learning_rate': 1.652e-07, 'completion_length': 58.47321701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9464285969734192, 'reward_std': 0.08868780732154846, 'kl': 1.20068359375, 'epoch': 0.83} 83%|████████▎ | 2087/2500 [16:55:12<2:28:20, 21.55s/it] 84%|████████▎ | 2088/2500 [16:55:33<2:26:58, 21.40s/it] {'loss': 0.0549, 'grad_norm': 8.981921540382483, 'learning_rate': 1.648e-07, 'completion_length': 60.80357360839844, 'rewards/accuracy_reward': 0.8839285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8660714626312256, 'reward_std': 0.1379830539226532, 'kl': 1.37353515625, 'epoch': 0.84} 84%|████████▎ | 2088/2500 [16:55:33<2:26:58, 21.40s/it] 84%|████████▎ | 2089/2500 [16:55:55<2:29:40, 21.85s/it] {'loss': 0.0389, 'grad_norm': 4.3077952628992024, 'learning_rate': 1.644e-07, 'completion_length': 58.955360412597656, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.06613001227378845, 'kl': 0.975341796875, 'epoch': 0.84} 84%|████████▎ | 2089/2500 [16:55:55<2:29:40, 21.85s/it] 84%|████████▎ | 2090/2500 [16:56:16<2:27:27, 21.58s/it] {'loss': 0.0084, 'grad_norm': 1.7904580291560956, 'learning_rate': 1.64e-07, 'completion_length': 60.54464530944824, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.21044921875, 'epoch': 0.84} 84%|████████▎ | 2090/2500 [16:56:16<2:27:27, 21.58s/it] 84%|████████▎ | 2091/2500 [16:56:39<2:28:59, 21.86s/it] {'loss': 0.0078, 'grad_norm': 3.2536332315113303, 'learning_rate': 1.6359999999999998e-07, 'completion_length': 55.69643020629883, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.883928656578064, 'reward_std': 0.12626906484365463, 'kl': 0.19580078125, 'epoch': 0.84} 84%|████████▎ | 2091/2500 [16:56:39<2:28:59, 21.86s/it] 84%|████████▎ | 2092/2500 [16:57:02<2:30:46, 22.17s/it] {'loss': 0.0054, 'grad_norm': 5.445724877725941, 'learning_rate': 1.632e-07, 'completion_length': 57.10714530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.134521484375, 'epoch': 0.84} 84%|████████▎ | 2092/2500 [16:57:02<2:30:46, 22.17s/it] 84%|████████▎ | 2093/2500 [16:57:22<2:25:51, 21.50s/it] {'loss': 0.0276, 'grad_norm': 3.0768081262117053, 'learning_rate': 1.628e-07, 'completion_length': 53.27678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.689453125, 'epoch': 0.84} 84%|████████▎ | 2093/2500 [16:57:22<2:25:51, 21.50s/it] 84%|████████▍ | 2094/2500 [16:57:43<2:25:10, 21.45s/it] {'loss': 0.0065, 'grad_norm': 0.485739607111491, 'learning_rate': 1.6239999999999997e-07, 'completion_length': 56.44643211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.16162109375, 'epoch': 0.84} 84%|████████▍ | 2094/2500 [16:57:43<2:25:10, 21.45s/it] 84%|████████▍ | 2095/2500 [16:58:05<2:25:08, 21.50s/it] {'loss': 0.0055, 'grad_norm': 0.2829229932481401, 'learning_rate': 1.62e-07, 'completion_length': 56.54464530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13623046875, 'epoch': 0.84} 84%|████████▍ | 2095/2500 [16:58:05<2:25:08, 21.50s/it] 84%|████████▍ | 2096/2500 [16:58:26<2:24:10, 21.41s/it] {'loss': 0.0363, 'grad_norm': 9.051624213526567, 'learning_rate': 1.616e-07, 'completion_length': 54.85714530944824, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.904296875, 'epoch': 0.84} 84%|████████▍ | 2096/2500 [16:58:26<2:24:10, 21.41s/it] 84%|████████▍ | 2097/2500 [16:58:48<2:25:46, 21.70s/it] {'loss': 0.007, 'grad_norm': 0.905992083171454, 'learning_rate': 1.6120000000000001e-07, 'completion_length': 56.03571701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.17529296875, 'epoch': 0.84} 84%|████████▍ | 2097/2500 [16:58:49<2:25:46, 21.70s/it] 84%|████████▍ | 2098/2500 [16:59:12<2:29:02, 22.25s/it] {'loss': 0.0168, 'grad_norm': 3.904400268497841, 'learning_rate': 1.6079999999999998e-07, 'completion_length': 56.72321701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.42041015625, 'epoch': 0.84} 84%|████████▍ | 2098/2500 [16:59:12<2:29:02, 22.25s/it] 84%|████████▍ | 2099/2500 [16:59:33<2:27:26, 22.06s/it] {'loss': 0.0208, 'grad_norm': 4.597630080284373, 'learning_rate': 1.6039999999999998e-07, 'completion_length': 54.23214530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.5185546875, 'epoch': 0.84} 84%|████████▍ | 2099/2500 [16:59:33<2:27:26, 22.06s/it] 84%|████████▍ | 2100/2500 [16:59:56<2:27:38, 22.15s/it] {'loss': 0.0343, 'grad_norm': 14.011028454152958, 'learning_rate': 1.6e-07, 'completion_length': 53.27678680419922, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9553572535514832, 'reward_std': 0.10365218669176102, 'kl': 0.85546875, 'epoch': 0.84} 84%|████████▍ | 2100/2500 [16:59:56<2:27:38, 22.15s/it] 84%|████████▍ | 2101/2500 [17:01:00<3:50:19, 34.64s/it] {'loss': 0.0063, 'grad_norm': 2.2225651230024375, 'learning_rate': 1.5959999999999997e-07, 'completion_length': 54.437503814697266, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.15869140625, 'epoch': 0.84} 84%|████████▍ | 2101/2500 [17:01:00<3:50:19, 34.64s/it] 84%|████████▍ | 2102/2500 [17:01:20<3:20:53, 30.29s/it] {'loss': 0.0147, 'grad_norm': 5.2314503106621615, 'learning_rate': 1.592e-07, 'completion_length': 60.607147216796875, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9375000596046448, 'reward_std': 0.1030978113412857, 'kl': 0.365478515625, 'epoch': 0.84} 84%|████████▍ | 2102/2500 [17:01:20<3:20:53, 30.29s/it] 84%|████████▍ | 2103/2500 [17:01:41<3:01:37, 27.45s/it] {'loss': 0.0064, 'grad_norm': 1.703263664004354, 'learning_rate': 1.588e-07, 'completion_length': 54.81250190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.15966796875, 'epoch': 0.84} 84%|████████▍ | 2103/2500 [17:01:41<3:01:37, 27.45s/it] 84%|████████▍ | 2104/2500 [17:02:00<2:44:23, 24.91s/it] {'loss': 0.04, 'grad_norm': 5.128113093309668, 'learning_rate': 1.5840000000000002e-07, 'completion_length': 50.46428871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.9970703125, 'epoch': 0.84} 84%|████████▍ | 2104/2500 [17:02:00<2:44:23, 24.91s/it] 84%|████████▍ | 2105/2500 [17:02:20<2:34:25, 23.46s/it] {'loss': 0.0042, 'grad_norm': 2.114760916079981, 'learning_rate': 1.5799999999999999e-07, 'completion_length': 56.21428871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.10595703125, 'epoch': 0.84} 84%|████████▍ | 2105/2500 [17:02:20<2:34:25, 23.46s/it] 84%|████████▍ | 2106/2500 [17:02:39<2:25:55, 22.22s/it] {'loss': 0.0039, 'grad_norm': 0.22364447951511265, 'learning_rate': 1.5759999999999998e-07, 'completion_length': 56.16071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09765625, 'epoch': 0.84} 84%|████████▍ | 2106/2500 [17:02:39<2:25:55, 22.22s/it] 84%|████████▍ | 2107/2500 [17:02:59<2:21:17, 21.57s/it] {'loss': 0.0074, 'grad_norm': 1.370602400697527, 'learning_rate': 1.572e-07, 'completion_length': 56.21428871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.18505859375, 'epoch': 0.84} 84%|████████▍ | 2107/2500 [17:02:59<2:21:17, 21.57s/it] 84%|████████▍ | 2108/2500 [17:03:18<2:15:45, 20.78s/it] {'loss': 0.0033, 'grad_norm': 0.1355136250200453, 'learning_rate': 1.5679999999999997e-07, 'completion_length': 57.90178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.083740234375, 'epoch': 0.84} 84%|████████▍ | 2108/2500 [17:03:18<2:15:45, 20.78s/it] 84%|████████▍ | 2109/2500 [17:03:38<2:13:33, 20.49s/it] {'loss': 0.0039, 'grad_norm': 0.11790382299371929, 'learning_rate': 1.564e-07, 'completion_length': 61.19643020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.096435546875, 'epoch': 0.84} 84%|████████▍ | 2109/2500 [17:03:38<2:13:33, 20.49s/it] 84%|████████▍ | 2110/2500 [17:03:58<2:12:28, 20.38s/it] {'loss': 0.0045, 'grad_norm': 0.29977329550856424, 'learning_rate': 1.56e-07, 'completion_length': 60.892860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11328125, 'epoch': 0.84} 84%|████████▍ | 2110/2500 [17:03:58<2:12:28, 20.38s/it] 84%|████████▍ | 2111/2500 [17:04:17<2:08:49, 19.87s/it] {'loss': 0.05, 'grad_norm': 14.932573196209859, 'learning_rate': 1.556e-07, 'completion_length': 55.72321701049805, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.910714328289032, 'reward_std': 0.1643299162387848, 'kl': 1.251953125, 'epoch': 0.84} 84%|████████▍ | 2111/2500 [17:04:17<2:08:49, 19.87s/it] 84%|████████▍ | 2112/2500 [17:04:35<2:06:41, 19.59s/it] {'loss': 0.0204, 'grad_norm': 6.895284580519761, 'learning_rate': 1.552e-07, 'completion_length': 51.08035850524902, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9464285969734192, 'reward_std': 0.06331466138362885, 'kl': 0.509765625, 'epoch': 0.84} 84%|████████▍ | 2112/2500 [17:04:35<2:06:41, 19.59s/it] 85%|████████▍ | 2113/2500 [17:04:55<2:06:17, 19.58s/it] {'loss': 0.0425, 'grad_norm': 8.583024893777393, 'learning_rate': 1.5479999999999998e-07, 'completion_length': 50.33928871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9642857313156128, 'reward_std': 0.05399492755532265, 'kl': 1.061767578125, 'epoch': 0.85} 85%|████████▍ | 2113/2500 [17:04:55<2:06:17, 19.58s/it] 85%|████████▍ | 2114/2500 [17:05:14<2:04:38, 19.37s/it] {'loss': 0.0477, 'grad_norm': 4.155966264208332, 'learning_rate': 1.544e-07, 'completion_length': 50.03571701049805, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.9196429252624512, 'reward_std': 0.12708015739917755, 'kl': 1.1982421875, 'epoch': 0.85} 85%|████████▍ | 2114/2500 [17:05:14<2:04:38, 19.37s/it] 85%|████████▍ | 2115/2500 [17:05:33<2:04:09, 19.35s/it] {'loss': 0.0328, 'grad_norm': 7.311728381949586, 'learning_rate': 1.54e-07, 'completion_length': 56.10714530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.82177734375, 'epoch': 0.85} 85%|████████▍ | 2115/2500 [17:05:33<2:04:09, 19.35s/it] 85%|████████▍ | 2116/2500 [17:05:53<2:03:43, 19.33s/it] {'loss': 0.0444, 'grad_norm': 9.862152053953768, 'learning_rate': 1.5359999999999997e-07, 'completion_length': 57.67857360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9464285969734192, 'reward_std': 0.06331466138362885, 'kl': 1.1025390625, 'epoch': 0.85} 85%|████████▍ | 2116/2500 [17:05:53<2:03:43, 19.33s/it] 85%|████████▍ | 2117/2500 [17:06:11<2:02:34, 19.20s/it] {'loss': 0.0177, 'grad_norm': 3.313417950644015, 'learning_rate': 1.532e-07, 'completion_length': 53.68750190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.053144559264183044, 'kl': 0.44140625, 'epoch': 0.85} 85%|████████▍ | 2117/2500 [17:06:11<2:02:34, 19.20s/it] 85%|████████▍ | 2118/2500 [17:06:31<2:02:08, 19.19s/it] {'loss': 0.0375, 'grad_norm': 5.82794022383178, 'learning_rate': 1.528e-07, 'completion_length': 56.91071701049805, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.8750000596046448, 'reward_std': 0.033065006136894226, 'kl': 0.93701171875, 'epoch': 0.85} 85%|████████▍ | 2118/2500 [17:06:31<2:02:08, 19.19s/it] 85%|████████▍ | 2119/2500 [17:06:49<2:00:28, 18.97s/it] {'loss': 0.0294, 'grad_norm': 3.345389543804503, 'learning_rate': 1.524e-07, 'completion_length': 54.08928871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.053144559264183044, 'kl': 0.736083984375, 'epoch': 0.85} 85%|████████▍ | 2119/2500 [17:06:49<2:00:28, 18.97s/it] 85%|████████▍ | 2120/2500 [17:07:09<2:01:18, 19.15s/it] {'loss': 0.058, 'grad_norm': 11.822956537560305, 'learning_rate': 1.5199999999999998e-07, 'completion_length': 48.63393211364746, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.8660715222358704, 'reward_std': 0.1835171803832054, 'kl': 1.447265625, 'epoch': 0.85} 85%|████████▍ | 2120/2500 [17:07:09<2:01:18, 19.15s/it] 85%|████████▍ | 2121/2500 [17:07:28<2:00:46, 19.12s/it] {'loss': 0.0303, 'grad_norm': 3.324815579563663, 'learning_rate': 1.516e-07, 'completion_length': 51.14285850524902, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.946428656578064, 'reward_std': 0.15152288228273392, 'kl': 0.75732421875, 'epoch': 0.85} 85%|████████▍ | 2121/2500 [17:07:28<2:00:46, 19.12s/it] 85%|████████▍ | 2122/2500 [17:07:48<2:02:16, 19.41s/it] {'loss': 0.0343, 'grad_norm': 3.7722785925188598, 'learning_rate': 1.512e-07, 'completion_length': 64.7589340209961, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9642857313156128, 'reward_std': 0.05399492755532265, 'kl': 0.857421875, 'epoch': 0.85} 85%|████████▍ | 2122/2500 [17:07:48<2:02:16, 19.41s/it] 85%|████████▍ | 2123/2500 [17:08:07<2:01:16, 19.30s/it] {'loss': 0.0803, 'grad_norm': 25.68659003311336, 'learning_rate': 1.5079999999999997e-07, 'completion_length': 54.27678871154785, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9196428656578064, 'reward': 1.8928571939468384, 'reward_std': 0.0835726335644722, 'kl': 2.005126953125, 'epoch': 0.85} 85%|████████▍ | 2123/2500 [17:08:07<2:01:16, 19.30s/it] 85%|████████▍ | 2124/2500 [17:08:26<2:01:07, 19.33s/it] {'loss': 0.0395, 'grad_norm': 8.838970037457191, 'learning_rate': 1.504e-07, 'completion_length': 60.000003814697266, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.9375001192092896, 'reward_std': 0.1297563686966896, 'kl': 0.9873046875, 'epoch': 0.85} 85%|████████▍ | 2124/2500 [17:08:26<2:01:07, 19.33s/it] 85%|████████▌ | 2125/2500 [17:08:46<2:02:24, 19.58s/it] {'loss': 0.0531, 'grad_norm': 14.02018652891871, 'learning_rate': 1.5e-07, 'completion_length': 50.16964530944824, 'rewards/accuracy_reward': 0.9375000596046448, 'rewards/format_reward': 0.830357164144516, 'reward': 1.7678572535514832, 'reward_std': 0.24895472079515457, 'kl': 1.33203125, 'epoch': 0.85} 85%|████████▌ | 2125/2500 [17:08:46<2:02:24, 19.58s/it] 85%|████████▌ | 2126/2500 [17:09:06<2:01:20, 19.47s/it] {'loss': 0.0494, 'grad_norm': 7.346929912216324, 'learning_rate': 1.4960000000000002e-07, 'completion_length': 54.10714530944824, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.9196429252624512, 'reward_std': 0.1872248277068138, 'kl': 1.234375, 'epoch': 0.85} 85%|████████▌ | 2126/2500 [17:09:06<2:01:20, 19.47s/it] 85%|████████▌ | 2127/2500 [17:09:25<2:00:48, 19.43s/it] {'loss': 0.0649, 'grad_norm': 14.351060893455266, 'learning_rate': 1.4919999999999999e-07, 'completion_length': 50.705360412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.910714328289032, 'reward': 1.9017857909202576, 'reward_std': 0.15049393102526665, 'kl': 1.625, 'epoch': 0.85} 85%|████████▌ | 2127/2500 [17:09:25<2:00:48, 19.43s/it] 85%|████████▌ | 2128/2500 [17:09:44<1:59:08, 19.22s/it] {'loss': 0.0385, 'grad_norm': 15.474465939098833, 'learning_rate': 1.4879999999999998e-07, 'completion_length': 56.66071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.9285714626312256, 'reward_std': 0.05050762742757797, 'kl': 0.963134765625, 'epoch': 0.85} 85%|████████▌ | 2128/2500 [17:09:44<1:59:08, 19.22s/it] 85%|████████▌ | 2129/2500 [17:10:03<1:59:07, 19.27s/it] {'loss': 0.0038, 'grad_norm': 1.7323538015291817, 'learning_rate': 1.484e-07, 'completion_length': 54.13393020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.09521484375, 'epoch': 0.85} 85%|████████▌ | 2129/2500 [17:10:03<1:59:07, 19.27s/it] 85%|████████▌ | 2130/2500 [17:10:23<1:59:28, 19.38s/it] {'loss': 0.0529, 'grad_norm': 12.718857187124968, 'learning_rate': 1.4799999999999998e-07, 'completion_length': 53.267860412597656, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.910714328289032, 'reward': 1.848214328289032, 'reward_std': 0.23036431521177292, 'kl': 1.328125, 'epoch': 0.85} 85%|████████▌ | 2130/2500 [17:10:23<1:59:28, 19.38s/it] 85%|████████▌ | 2131/2500 [17:10:45<2:03:51, 20.14s/it] {'loss': 0.0417, 'grad_norm': 19.465055083408416, 'learning_rate': 1.476e-07, 'completion_length': 53.92857360839844, 'rewards/accuracy_reward': 0.9017857313156128, 'rewards/format_reward': 0.910714328289032, 'reward': 1.8125001192092896, 'reward_std': 0.2215556874871254, 'kl': 1.044921875, 'epoch': 0.85} 85%|████████▌ | 2131/2500 [17:10:45<2:03:51, 20.14s/it] 85%|████████▌ | 2132/2500 [17:11:04<2:02:39, 20.00s/it] {'loss': 0.0234, 'grad_norm': 3.7358663770124885, 'learning_rate': 1.472e-07, 'completion_length': 59.00893020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.58447265625, 'epoch': 0.85} 85%|████████▌ | 2132/2500 [17:11:04<2:02:39, 20.00s/it] 85%|████████▌ | 2133/2500 [17:11:24<2:01:24, 19.85s/it] {'loss': 0.0767, 'grad_norm': 14.165327151414308, 'learning_rate': 1.4680000000000002e-07, 'completion_length': 54.18750190734863, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.8839285969734192, 'reward_std': 0.053144559264183044, 'kl': 1.915771484375, 'epoch': 0.85} 85%|████████▌ | 2133/2500 [17:11:24<2:01:24, 19.85s/it] 85%|████████▌ | 2134/2500 [17:11:43<2:00:32, 19.76s/it] {'loss': 0.0172, 'grad_norm': 4.367340613934552, 'learning_rate': 1.464e-07, 'completion_length': 55.42857360839844, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9375000596046448, 'reward_std': 0.07576144114136696, 'kl': 0.430419921875, 'epoch': 0.85} 85%|████████▌ | 2134/2500 [17:11:43<2:00:32, 19.76s/it] 85%|████████▌ | 2135/2500 [17:12:03<1:59:26, 19.63s/it] {'loss': 0.0141, 'grad_norm': 4.509200766788163, 'learning_rate': 1.4599999999999998e-07, 'completion_length': 55.973215103149414, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.35009765625, 'epoch': 0.85} 85%|████████▌ | 2135/2500 [17:12:03<1:59:26, 19.63s/it] 85%|████████▌ | 2136/2500 [17:12:22<1:58:56, 19.61s/it] {'loss': 0.0062, 'grad_norm': 0.29904904926015297, 'learning_rate': 1.456e-07, 'completion_length': 58.642860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.154296875, 'epoch': 0.85} 85%|████████▌ | 2136/2500 [17:12:22<1:58:56, 19.61s/it] 85%|████████▌ | 2137/2500 [17:12:41<1:56:44, 19.30s/it] {'loss': 0.0937, 'grad_norm': 25.23702140854771, 'learning_rate': 1.4519999999999998e-07, 'completion_length': 47.75893020629883, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.8750001192092896, 'reward_std': 0.21244414895772934, 'kl': 2.34375, 'epoch': 0.85} 85%|████████▌ | 2137/2500 [17:12:41<1:56:44, 19.30s/it] 86%|████████▌ | 2138/2500 [17:13:00<1:56:51, 19.37s/it] {'loss': 0.0305, 'grad_norm': 7.992204576764207, 'learning_rate': 1.448e-07, 'completion_length': 50.830360412597656, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9553572535514832, 'reward_std': 0.10365218669176102, 'kl': 0.7626953125, 'epoch': 0.86} 86%|████████▌ | 2138/2500 [17:13:00<1:56:51, 19.37s/it] 86%|████████▌ | 2139/2500 [17:13:19<1:55:21, 19.17s/it] {'loss': 0.0292, 'grad_norm': 2.2701567232960427, 'learning_rate': 1.444e-07, 'completion_length': 53.32143020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.732421875, 'epoch': 0.86} 86%|████████▌ | 2139/2500 [17:13:19<1:55:21, 19.17s/it] 86%|████████▌ | 2140/2500 [17:13:38<1:55:39, 19.28s/it] {'loss': 0.0316, 'grad_norm': 6.119391782587633, 'learning_rate': 1.44e-07, 'completion_length': 48.58928871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.9285714626312256, 'reward_std': 0.07103024423122406, 'kl': 0.792724609375, 'epoch': 0.86} 86%|████████▌ | 2140/2500 [17:13:39<1:55:39, 19.28s/it] 86%|████████▌ | 2141/2500 [17:13:58<1:55:46, 19.35s/it] {'loss': 0.0512, 'grad_norm': 4.162080310018337, 'learning_rate': 1.436e-07, 'completion_length': 54.89285850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9642858505249023, 'reward_std': 0.0835726335644722, 'kl': 1.27734375, 'epoch': 0.86} 86%|████████▌ | 2141/2500 [17:13:58<1:55:46, 19.35s/it] 86%|████████▌ | 2142/2500 [17:14:18<1:56:05, 19.46s/it] {'loss': 0.0134, 'grad_norm': 8.90194633555125, 'learning_rate': 1.4319999999999999e-07, 'completion_length': 52.02678871154785, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9375000596046448, 'reward_std': 0.11394162103533745, 'kl': 0.333984375, 'epoch': 0.86} 86%|████████▌ | 2142/2500 [17:14:18<1:56:05, 19.46s/it] 86%|████████▌ | 2143/2500 [17:14:37<1:55:45, 19.45s/it] {'loss': 0.0316, 'grad_norm': 3.5247493437131485, 'learning_rate': 1.428e-07, 'completion_length': 55.43750190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.79345703125, 'epoch': 0.86} 86%|████████▌ | 2143/2500 [17:14:37<1:55:45, 19.45s/it] 86%|████████▌ | 2144/2500 [17:14:56<1:54:44, 19.34s/it] {'loss': 0.0044, 'grad_norm': 2.459634276409348, 'learning_rate': 1.424e-07, 'completion_length': 52.71428871154785, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.111083984375, 'epoch': 0.86} 86%|████████▌ | 2144/2500 [17:14:56<1:54:44, 19.34s/it] 86%|████████▌ | 2145/2500 [17:15:16<1:54:57, 19.43s/it] {'loss': 0.021, 'grad_norm': 2.7447760337425096, 'learning_rate': 1.4199999999999997e-07, 'completion_length': 52.37500190734863, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9375000596046448, 'reward_std': 0.08747542649507523, 'kl': 0.52294921875, 'epoch': 0.86} 86%|████████▌ | 2145/2500 [17:15:16<1:54:57, 19.43s/it] 86%|████████▌ | 2146/2500 [17:15:36<1:55:36, 19.59s/it] {'loss': 0.0144, 'grad_norm': 3.051859503402913, 'learning_rate': 1.416e-07, 'completion_length': 53.61607551574707, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.359130859375, 'epoch': 0.86} 86%|████████▌ | 2146/2500 [17:15:36<1:55:36, 19.59s/it] 86%|████████▌ | 2147/2500 [17:15:55<1:55:07, 19.57s/it] {'loss': 0.0057, 'grad_norm': 11.347513512159086, 'learning_rate': 1.412e-07, 'completion_length': 58.19643211364746, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.06222161650657654, 'kl': 0.14306640625, 'epoch': 0.86} 86%|████████▌ | 2147/2500 [17:15:55<1:55:07, 19.57s/it] 86%|████████▌ | 2148/2500 [17:16:15<1:54:11, 19.46s/it] {'loss': 0.0056, 'grad_norm': 4.269362657428506, 'learning_rate': 1.408e-07, 'completion_length': 57.46428871154785, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857909202576, 'reward_std': 0.07839837297797203, 'kl': 0.14013671875, 'epoch': 0.86} 86%|████████▌ | 2148/2500 [17:16:15<1:54:11, 19.46s/it] 86%|████████▌ | 2149/2500 [17:16:33<1:52:52, 19.30s/it] {'loss': 0.008, 'grad_norm': 1.6041609903359133, 'learning_rate': 1.4039999999999999e-07, 'completion_length': 56.60714530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.200439453125, 'epoch': 0.86} 86%|████████▌ | 2149/2500 [17:16:33<1:52:52, 19.30s/it] 86%|████████▌ | 2150/2500 [17:16:52<1:51:38, 19.14s/it] {'loss': 0.0104, 'grad_norm': 0.8697890338739679, 'learning_rate': 1.4e-07, 'completion_length': 50.937503814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.261962890625, 'epoch': 0.86} 86%|████████▌ | 2150/2500 [17:16:52<1:51:38, 19.14s/it] 86%|████████▌ | 2151/2500 [17:17:11<1:51:22, 19.15s/it] {'loss': 0.0057, 'grad_norm': 0.42395416122822777, 'learning_rate': 1.396e-07, 'completion_length': 52.51785850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.141845703125, 'epoch': 0.86} 86%|████████▌ | 2151/2500 [17:17:11<1:51:22, 19.15s/it] 86%|████████▌ | 2152/2500 [17:17:30<1:50:39, 19.08s/it] {'loss': 0.0082, 'grad_norm': 0.634901792676593, 'learning_rate': 1.3919999999999998e-07, 'completion_length': 55.19643020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.2041015625, 'epoch': 0.86} 86%|████████▌ | 2152/2500 [17:17:30<1:50:39, 19.08s/it] 86%|████████▌ | 2153/2500 [17:17:49<1:49:05, 18.86s/it] {'loss': 0.0069, 'grad_norm': 0.38113882039337393, 'learning_rate': 1.388e-07, 'completion_length': 51.81250190734863, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.17138671875, 'epoch': 0.86} 86%|████████▌ | 2153/2500 [17:17:49<1:49:05, 18.86s/it] 86%|████████▌ | 2154/2500 [17:18:08<1:50:18, 19.13s/it] {'loss': 0.0049, 'grad_norm': 0.3874490857950852, 'learning_rate': 1.384e-07, 'completion_length': 55.642860412597656, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.121826171875, 'epoch': 0.86} 86%|████████▌ | 2154/2500 [17:18:08<1:50:18, 19.13s/it] 86%|████████▌ | 2155/2500 [17:18:27<1:48:59, 18.95s/it] {'loss': 0.0055, 'grad_norm': 0.25172617149559573, 'learning_rate': 1.3800000000000002e-07, 'completion_length': 56.11607360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.136474609375, 'epoch': 0.86} 86%|████████▌ | 2155/2500 [17:18:27<1:48:59, 18.95s/it] 86%|████████▌ | 2156/2500 [17:18:46<1:48:47, 18.98s/it] {'loss': 0.0089, 'grad_norm': 1.4508647803998613, 'learning_rate': 1.376e-07, 'completion_length': 55.51785850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.22265625, 'epoch': 0.86} 86%|████████▌ | 2156/2500 [17:18:46<1:48:47, 18.98s/it] 86%|████████▋ | 2157/2500 [17:19:05<1:47:43, 18.85s/it] {'loss': 0.0074, 'grad_norm': 1.3296473228039538, 'learning_rate': 1.3719999999999998e-07, 'completion_length': 53.46428871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.184326171875, 'epoch': 0.86} 86%|████████▋ | 2157/2500 [17:19:05<1:47:43, 18.85s/it] 86%|████████▋ | 2158/2500 [17:19:23<1:47:05, 18.79s/it] {'loss': 0.0106, 'grad_norm': 1.756378583561765, 'learning_rate': 1.368e-07, 'completion_length': 56.91964530944824, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.263671875, 'epoch': 0.86} 86%|████████▋ | 2158/2500 [17:19:23<1:47:05, 18.79s/it] 86%|████████▋ | 2159/2500 [17:19:42<1:45:57, 18.64s/it] {'loss': 0.0062, 'grad_norm': 0.3092472468560785, 'learning_rate': 1.3639999999999998e-07, 'completion_length': 51.79464530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1552734375, 'epoch': 0.86} 86%|████████▋ | 2159/2500 [17:19:42<1:45:57, 18.64s/it] 86%|████████▋ | 2160/2500 [17:20:00<1:45:11, 18.56s/it] {'loss': 0.005, 'grad_norm': 1.9213442840904282, 'learning_rate': 1.36e-07, 'completion_length': 51.02678680419922, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.124755859375, 'epoch': 0.86} 86%|████████▋ | 2160/2500 [17:20:00<1:45:11, 18.56s/it] 86%|████████▋ | 2161/2500 [17:20:18<1:44:40, 18.53s/it] {'loss': 0.0071, 'grad_norm': 1.3310733743447571, 'learning_rate': 1.356e-07, 'completion_length': 54.72321701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.17724609375, 'epoch': 0.86} 86%|████████▋ | 2161/2500 [17:20:18<1:44:40, 18.53s/it] 86%|████████▋ | 2162/2500 [17:20:38<1:46:06, 18.84s/it] {'loss': 0.0118, 'grad_norm': 2.949185693851759, 'learning_rate': 1.352e-07, 'completion_length': 59.312503814697266, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9553572535514832, 'reward_std': 0.12626906484365463, 'kl': 0.294921875, 'epoch': 0.86} 86%|████████▋ | 2162/2500 [17:20:38<1:46:06, 18.84s/it] 87%|████████▋ | 2163/2500 [17:20:58<1:47:13, 19.09s/it] {'loss': 0.0056, 'grad_norm': 6.687989446088185, 'learning_rate': 1.348e-07, 'completion_length': 51.14285850524902, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.14111328125, 'epoch': 0.87} 87%|████████▋ | 2163/2500 [17:20:58<1:47:13, 19.09s/it] 87%|████████▋ | 2164/2500 [17:21:16<1:46:04, 18.94s/it] {'loss': 0.0049, 'grad_norm': 0.2536471432323566, 'learning_rate': 1.3439999999999999e-07, 'completion_length': 52.41964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.122802734375, 'epoch': 0.87} 87%|████████▋ | 2164/2500 [17:21:16<1:46:04, 18.94s/it] 87%|████████▋ | 2165/2500 [17:21:34<1:44:31, 18.72s/it] {'loss': 0.0069, 'grad_norm': 0.33551310581524474, 'learning_rate': 1.34e-07, 'completion_length': 48.98214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.17236328125, 'epoch': 0.87} 87%|████████▋ | 2165/2500 [17:21:34<1:44:31, 18.72s/it] 87%|████████▋ | 2166/2500 [17:21:54<1:45:52, 19.02s/it] {'loss': 0.0085, 'grad_norm': 1.3192063876913074, 'learning_rate': 1.3359999999999998e-07, 'completion_length': 57.875003814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.21337890625, 'epoch': 0.87} 87%|████████▋ | 2166/2500 [17:21:54<1:45:52, 19.02s/it] 87%|████████▋ | 2167/2500 [17:22:14<1:47:29, 19.37s/it] {'loss': 0.0213, 'grad_norm': 3.5181300874269263, 'learning_rate': 1.332e-07, 'completion_length': 52.517860412597656, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9464285969734192, 'reward_std': 0.1289059966802597, 'kl': 0.5322265625, 'epoch': 0.87} 87%|████████▋ | 2167/2500 [17:22:14<1:47:29, 19.37s/it] 87%|████████▋ | 2168/2500 [17:22:33<1:46:23, 19.23s/it] {'loss': 0.0206, 'grad_norm': 1.9183153084084241, 'learning_rate': 1.328e-07, 'completion_length': 55.36607360839844, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9464285969734192, 'reward_std': 0.11663764715194702, 'kl': 0.51611328125, 'epoch': 0.87} 87%|████████▋ | 2168/2500 [17:22:33<1:46:23, 19.23s/it] 87%|████████▋ | 2169/2500 [17:22:52<1:45:17, 19.09s/it] {'loss': 0.0036, 'grad_norm': 0.15868781794829886, 'learning_rate': 1.324e-07, 'completion_length': 53.72321701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.0888671875, 'epoch': 0.87} 87%|████████▋ | 2169/2500 [17:22:52<1:45:17, 19.09s/it] 87%|████████▋ | 2170/2500 [17:23:11<1:44:59, 19.09s/it] {'loss': 0.0056, 'grad_norm': 0.2826269638067637, 'learning_rate': 1.32e-07, 'completion_length': 50.49107360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.140380859375, 'epoch': 0.87} 87%|████████▋ | 2170/2500 [17:23:11<1:44:59, 19.09s/it] 87%|████████▋ | 2171/2500 [17:23:31<1:45:36, 19.26s/it] {'loss': 0.0606, 'grad_norm': 8.760152757758616, 'learning_rate': 1.316e-07, 'completion_length': 52.13393020629883, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9285714626312256, 'reward_std': 0.16323687136173248, 'kl': 1.51953125, 'epoch': 0.87} 87%|████████▋ | 2171/2500 [17:23:31<1:45:36, 19.26s/it] 87%|████████▋ | 2172/2500 [17:23:51<1:46:12, 19.43s/it] {'loss': 0.0056, 'grad_norm': 2.5140551160028406, 'learning_rate': 1.312e-07, 'completion_length': 54.97321701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.138671875, 'epoch': 0.87} 87%|████████▋ | 2172/2500 [17:23:51<1:46:12, 19.43s/it] 87%|████████▋ | 2173/2500 [17:24:10<1:45:28, 19.35s/it] {'loss': 0.0202, 'grad_norm': 1.772217400944317, 'learning_rate': 1.308e-07, 'completion_length': 52.98214530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.50390625, 'epoch': 0.87} 87%|████████▋ | 2173/2500 [17:24:10<1:45:28, 19.35s/it] 87%|████████▋ | 2174/2500 [17:24:29<1:45:33, 19.43s/it] {'loss': 0.0048, 'grad_norm': 1.0407715366649817, 'learning_rate': 1.3039999999999998e-07, 'completion_length': 52.96428871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.11962890625, 'epoch': 0.87} 87%|████████▋ | 2174/2500 [17:24:29<1:45:33, 19.43s/it] 87%|████████▋ | 2175/2500 [17:24:48<1:44:20, 19.26s/it] {'loss': 0.0053, 'grad_norm': 0.28135462444371234, 'learning_rate': 1.3e-07, 'completion_length': 52.16071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.133544921875, 'epoch': 0.87} 87%|████████▋ | 2175/2500 [17:24:48<1:44:20, 19.26s/it] 87%|████████▋ | 2176/2500 [17:25:08<1:45:18, 19.50s/it] {'loss': 0.0069, 'grad_norm': 0.516851585735461, 'learning_rate': 1.296e-07, 'completion_length': 54.16071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.17236328125, 'epoch': 0.87} 87%|████████▋ | 2176/2500 [17:25:08<1:45:18, 19.50s/it] 87%|████████▋ | 2177/2500 [17:25:27<1:43:31, 19.23s/it] {'loss': 0.005, 'grad_norm': 0.2036844163650806, 'learning_rate': 1.292e-07, 'completion_length': 52.05357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.125732421875, 'epoch': 0.87} 87%|████████▋ | 2177/2500 [17:25:27<1:43:31, 19.23s/it] 87%|████████▋ | 2178/2500 [17:25:47<1:44:28, 19.47s/it] {'loss': 0.0063, 'grad_norm': 4.191281877700665, 'learning_rate': 1.288e-07, 'completion_length': 47.89285850524902, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.09138382971286774, 'kl': 0.158203125, 'epoch': 0.87} 87%|████████▋ | 2178/2500 [17:25:47<1:44:28, 19.47s/it] 87%|████████▋ | 2179/2500 [17:26:06<1:43:53, 19.42s/it] {'loss': 0.0046, 'grad_norm': 0.3358840952609574, 'learning_rate': 1.2839999999999999e-07, 'completion_length': 51.09821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.115966796875, 'epoch': 0.87} 87%|████████▋ | 2179/2500 [17:26:06<1:43:53, 19.42s/it] 87%|████████▋ | 2180/2500 [17:26:25<1:42:18, 19.18s/it] {'loss': 0.0048, 'grad_norm': 0.2232797543704071, 'learning_rate': 1.28e-07, 'completion_length': 52.98214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12060546875, 'epoch': 0.87} 87%|████████▋ | 2180/2500 [17:26:25<1:42:18, 19.18s/it] 87%|████████▋ | 2181/2500 [17:26:44<1:41:50, 19.15s/it] {'loss': 0.0107, 'grad_norm': 1.4805316795831864, 'learning_rate': 1.2759999999999998e-07, 'completion_length': 51.34821701049805, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857313156128, 'reward_std': 0.05399492755532265, 'kl': 0.267822265625, 'epoch': 0.87} 87%|████████▋ | 2181/2500 [17:26:44<1:41:50, 19.15s/it] 87%|████████▋ | 2182/2500 [17:27:04<1:42:17, 19.30s/it] {'loss': 0.0088, 'grad_norm': 0.7603193856626321, 'learning_rate': 1.272e-07, 'completion_length': 56.562503814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.21923828125, 'epoch': 0.87} 87%|████████▋ | 2182/2500 [17:27:04<1:42:17, 19.30s/it] 87%|████████▋ | 2183/2500 [17:27:24<1:43:02, 19.50s/it] {'loss': 0.0721, 'grad_norm': 14.83770583076346, 'learning_rate': 1.268e-07, 'completion_length': 51.60714530944824, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9553571939468384, 'reward_std': 0.10365219414234161, 'kl': 1.7958984375, 'epoch': 0.87} 87%|████████▋ | 2183/2500 [17:27:24<1:43:02, 19.50s/it] 87%|████████▋ | 2184/2500 [17:27:43<1:42:23, 19.44s/it] {'loss': 0.0053, 'grad_norm': 0.24779172605726352, 'learning_rate': 1.264e-07, 'completion_length': 52.41964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1328125, 'epoch': 0.87} 87%|████████▋ | 2184/2500 [17:27:43<1:42:23, 19.44s/it] 87%|████████▋ | 2185/2500 [17:28:02<1:42:27, 19.52s/it] {'loss': 0.0059, 'grad_norm': 0.3870405296942752, 'learning_rate': 1.26e-07, 'completion_length': 56.312503814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.14794921875, 'epoch': 0.87} 87%|████████▋ | 2185/2500 [17:28:02<1:42:27, 19.52s/it] 87%|████████▋ | 2186/2500 [17:28:21<1:40:12, 19.15s/it] {'loss': 0.0364, 'grad_norm': 11.242448363061529, 'learning_rate': 1.2559999999999999e-07, 'completion_length': 45.44643020629883, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9553572535514832, 'reward_std': 0.10882644355297089, 'kl': 0.91015625, 'epoch': 0.87} 87%|████████▋ | 2186/2500 [17:28:21<1:40:12, 19.15s/it] 87%|████████▋ | 2187/2500 [17:28:40<1:39:17, 19.03s/it] {'loss': 0.0382, 'grad_norm': 3.8175783768151232, 'learning_rate': 1.252e-07, 'completion_length': 49.13393020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.95458984375, 'epoch': 0.87} 87%|████████▋ | 2187/2500 [17:28:40<1:39:17, 19.03s/it] 88%|████████▊ | 2188/2500 [17:29:00<1:40:59, 19.42s/it] {'loss': 0.0039, 'grad_norm': 0.1455042386494636, 'learning_rate': 1.2479999999999998e-07, 'completion_length': 52.38393020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.096435546875, 'epoch': 0.88} 88%|████████▊ | 2188/2500 [17:29:00<1:40:59, 19.42s/it] 88%|████████▊ | 2189/2500 [17:29:20<1:41:54, 19.66s/it] {'loss': 0.0134, 'grad_norm': 1.3094706114753873, 'learning_rate': 1.244e-07, 'completion_length': 53.92857360839844, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285714626312256, 'reward_std': 0.03818017989397049, 'kl': 0.3349609375, 'epoch': 0.88} 88%|████████▊ | 2189/2500 [17:29:20<1:41:54, 19.66s/it] 88%|████████▊ | 2190/2500 [17:29:40<1:41:31, 19.65s/it] {'loss': 0.0105, 'grad_norm': 1.1290315506017141, 'learning_rate': 1.24e-07, 'completion_length': 53.80357551574707, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.263671875, 'epoch': 0.88} 88%|████████▊ | 2190/2500 [17:29:40<1:41:31, 19.65s/it] 88%|████████▊ | 2191/2500 [17:30:00<1:41:49, 19.77s/it] {'loss': 0.0067, 'grad_norm': 0.3352174647050407, 'learning_rate': 1.236e-07, 'completion_length': 58.78571701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1669921875, 'epoch': 0.88} 88%|████████▊ | 2191/2500 [17:30:00<1:41:49, 19.77s/it] 88%|████████▊ | 2192/2500 [17:30:19<1:40:30, 19.58s/it] {'loss': 0.0088, 'grad_norm': 1.6439124248351735, 'learning_rate': 1.232e-07, 'completion_length': 60.83035850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.21923828125, 'epoch': 0.88} 88%|████████▊ | 2192/2500 [17:30:19<1:40:30, 19.58s/it] 88%|████████▊ | 2193/2500 [17:30:39<1:41:24, 19.82s/it] {'loss': 0.0057, 'grad_norm': 2.7934959730865714, 'learning_rate': 1.228e-07, 'completion_length': 50.58928871154785, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.14306640625, 'epoch': 0.88} 88%|████████▊ | 2193/2500 [17:30:39<1:41:24, 19.82s/it] 88%|████████▊ | 2194/2500 [17:30:59<1:41:26, 19.89s/it] {'loss': 0.0068, 'grad_norm': 2.070342813101509, 'learning_rate': 1.2239999999999998e-07, 'completion_length': 54.50893211364746, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.17138671875, 'epoch': 0.88} 88%|████████▊ | 2194/2500 [17:30:59<1:41:26, 19.89s/it] 88%|████████▊ | 2195/2500 [17:31:18<1:38:43, 19.42s/it] {'loss': 0.0054, 'grad_norm': 9.5043041859528, 'learning_rate': 1.2199999999999998e-07, 'completion_length': 51.49107360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.135009765625, 'epoch': 0.88} 88%|████████▊ | 2195/2500 [17:31:18<1:38:43, 19.42s/it] 88%|████████▊ | 2196/2500 [17:31:38<1:39:58, 19.73s/it] {'loss': 0.0051, 'grad_norm': 0.17468893288484483, 'learning_rate': 1.216e-07, 'completion_length': 50.705360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.127197265625, 'epoch': 0.88} 88%|████████▊ | 2196/2500 [17:31:38<1:39:58, 19.73s/it] 88%|████████▊ | 2197/2500 [17:31:57<1:39:02, 19.61s/it] {'loss': 0.0086, 'grad_norm': 1.4122608145998758, 'learning_rate': 1.212e-07, 'completion_length': 52.19643020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.21484375, 'epoch': 0.88} 88%|████████▊ | 2197/2500 [17:31:57<1:39:02, 19.61s/it] 88%|████████▊ | 2198/2500 [17:32:18<1:39:28, 19.76s/it] {'loss': 0.0043, 'grad_norm': 0.18633730389142528, 'learning_rate': 1.208e-07, 'completion_length': 58.82143020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10693359375, 'epoch': 0.88} 88%|████████▊ | 2198/2500 [17:32:18<1:39:28, 19.76s/it] 88%|████████▊ | 2199/2500 [17:32:39<1:40:59, 20.13s/it] {'loss': 0.0193, 'grad_norm': 3.2757494233657667, 'learning_rate': 1.204e-07, 'completion_length': 51.61607360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.4794921875, 'epoch': 0.88} 88%|████████▊ | 2199/2500 [17:32:39<1:40:59, 20.13s/it] 88%|████████▊ | 2200/2500 [17:32:58<1:39:55, 19.99s/it] {'loss': 0.0056, 'grad_norm': 2.502192857204416, 'learning_rate': 1.2e-07, 'completion_length': 55.02678871154785, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.13916015625, 'epoch': 0.88} 88%|████████▊ | 2200/2500 [17:32:58<1:39:55, 19.99s/it] 88%|████████▊ | 2201/2500 [17:33:59<2:40:56, 32.30s/it] {'loss': 0.0046, 'grad_norm': 0.25964059765725384, 'learning_rate': 1.1959999999999999e-07, 'completion_length': 51.91964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11572265625, 'epoch': 0.88} 88%|████████▊ | 2201/2500 [17:33:59<2:40:56, 32.30s/it] 88%|████████▊ | 2202/2500 [17:34:19<2:22:12, 28.63s/it] {'loss': 0.0126, 'grad_norm': 1.0381919612162072, 'learning_rate': 1.192e-07, 'completion_length': 58.05357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.31640625, 'epoch': 0.88} 88%|████████▊ | 2202/2500 [17:34:19<2:22:12, 28.63s/it] 88%|████████▊ | 2203/2500 [17:34:46<2:19:17, 28.14s/it] {'loss': 0.0214, 'grad_norm': 3.599672842870925, 'learning_rate': 1.1879999999999999e-07, 'completion_length': 60.90178680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.53759765625, 'epoch': 0.88} 88%|████████▊ | 2203/2500 [17:34:46<2:19:17, 28.14s/it] 88%|████████▊ | 2204/2500 [17:35:10<2:11:40, 26.69s/it] {'loss': 0.004, 'grad_norm': 0.18008052832570146, 'learning_rate': 1.184e-07, 'completion_length': 57.875003814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10107421875, 'epoch': 0.88} 88%|████████▊ | 2204/2500 [17:35:10<2:11:40, 26.69s/it] 88%|████████▊ | 2205/2500 [17:35:28<1:59:19, 24.27s/it] {'loss': 0.0301, 'grad_norm': 2.5906827457090684, 'learning_rate': 1.1799999999999998e-07, 'completion_length': 54.37500190734863, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9553572535514832, 'reward_std': 0.10365218669176102, 'kl': 0.751953125, 'epoch': 0.88} 88%|████████▊ | 2205/2500 [17:35:28<1:59:19, 24.27s/it] 88%|████████▊ | 2206/2500 [17:35:50<1:55:47, 23.63s/it] {'loss': 0.0055, 'grad_norm': 0.7529050875077102, 'learning_rate': 1.176e-07, 'completion_length': 60.65178680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.137939453125, 'epoch': 0.88} 88%|████████▊ | 2206/2500 [17:35:50<1:55:47, 23.63s/it] 88%|████████▊ | 2207/2500 [17:36:10<1:49:04, 22.34s/it] {'loss': 0.0087, 'grad_norm': 1.0418371448244879, 'learning_rate': 1.1719999999999999e-07, 'completion_length': 57.312503814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.21728515625, 'epoch': 0.88} 88%|████████▊ | 2207/2500 [17:36:10<1:49:04, 22.34s/it] 88%|████████▊ | 2208/2500 [17:36:31<1:47:36, 22.11s/it] {'loss': 0.0315, 'grad_norm': 1.5817354781604873, 'learning_rate': 1.168e-07, 'completion_length': 53.75000190734863, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.025253813713788986, 'kl': 0.7880859375, 'epoch': 0.88} 88%|████████▊ | 2208/2500 [17:36:31<1:47:36, 22.11s/it] 88%|████████▊ | 2209/2500 [17:36:50<1:41:44, 20.98s/it] {'loss': 0.0047, 'grad_norm': 1.1662340013081305, 'learning_rate': 1.164e-07, 'completion_length': 58.54464530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1162109375, 'epoch': 0.88} 88%|████████▊ | 2209/2500 [17:36:50<1:41:44, 20.98s/it] 88%|████████▊ | 2210/2500 [17:37:08<1:38:16, 20.33s/it] {'loss': 0.0044, 'grad_norm': 0.3308313342430874, 'learning_rate': 1.16e-07, 'completion_length': 50.20535850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.10888671875, 'epoch': 0.88} 88%|████████▊ | 2210/2500 [17:37:08<1:38:16, 20.33s/it] 88%|████████▊ | 2211/2500 [17:37:29<1:38:49, 20.52s/it] {'loss': 0.0075, 'grad_norm': 1.093822220835726, 'learning_rate': 1.1559999999999999e-07, 'completion_length': 56.86607360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.18798828125, 'epoch': 0.88} 88%|████████▊ | 2211/2500 [17:37:29<1:38:49, 20.52s/it] 88%|████████▊ | 2212/2500 [17:37:49<1:37:23, 20.29s/it] {'loss': 0.0099, 'grad_norm': 1.6469402900032561, 'learning_rate': 1.1519999999999999e-07, 'completion_length': 60.65178680419922, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.248046875, 'epoch': 0.88} 88%|████████▊ | 2212/2500 [17:37:49<1:37:23, 20.29s/it] 89%|████████▊ | 2213/2500 [17:38:09<1:36:00, 20.07s/it] {'loss': 0.0039, 'grad_norm': 0.18345278367356704, 'learning_rate': 1.148e-07, 'completion_length': 54.58928871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.097900390625, 'epoch': 0.89} 89%|████████▊ | 2213/2500 [17:38:09<1:36:00, 20.07s/it] 89%|████████▊ | 2214/2500 [17:38:28<1:34:58, 19.93s/it] {'loss': 0.0053, 'grad_norm': 1.6571887212105794, 'learning_rate': 1.1439999999999999e-07, 'completion_length': 60.97321701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.132080078125, 'epoch': 0.89} 89%|████████▊ | 2214/2500 [17:38:28<1:34:58, 19.93s/it] 89%|████████▊ | 2215/2500 [17:38:48<1:34:11, 19.83s/it] {'loss': 0.0077, 'grad_norm': 0.6161596818994879, 'learning_rate': 1.14e-07, 'completion_length': 55.90178871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.19140625, 'epoch': 0.89} 89%|████████▊ | 2215/2500 [17:38:48<1:34:11, 19.83s/it] 89%|████████▊ | 2216/2500 [17:39:09<1:35:13, 20.12s/it] {'loss': 0.0052, 'grad_norm': 1.5180515936170473, 'learning_rate': 1.136e-07, 'completion_length': 60.57143211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.130859375, 'epoch': 0.89} 89%|████████▊ | 2216/2500 [17:39:09<1:35:13, 20.12s/it] 89%|████████▊ | 2217/2500 [17:39:28<1:33:49, 19.89s/it] {'loss': 0.0053, 'grad_norm': 0.2725450728088094, 'learning_rate': 1.132e-07, 'completion_length': 56.142860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1337890625, 'epoch': 0.89} 89%|████████▊ | 2217/2500 [17:39:28<1:33:49, 19.89s/it] 89%|████████▊ | 2218/2500 [17:39:48<1:33:13, 19.83s/it] {'loss': 0.0042, 'grad_norm': 2.194273418120105, 'learning_rate': 1.1279999999999999e-07, 'completion_length': 56.705360412597656, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.03818017989397049, 'kl': 0.105712890625, 'epoch': 0.89} 89%|████████▊ | 2218/2500 [17:39:48<1:33:13, 19.83s/it] 89%|████████▉ | 2219/2500 [17:40:07<1:31:32, 19.55s/it] {'loss': 0.0036, 'grad_norm': 1.6008498418686734, 'learning_rate': 1.124e-07, 'completion_length': 52.26785850524902, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.090087890625, 'epoch': 0.89} 89%|████████▉ | 2219/2500 [17:40:07<1:31:32, 19.55s/it] 89%|████████▉ | 2220/2500 [17:40:26<1:30:42, 19.44s/it] {'loss': 0.0045, 'grad_norm': 0.16606968098174651, 'learning_rate': 1.12e-07, 'completion_length': 53.642860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.111328125, 'epoch': 0.89} 89%|████████▉ | 2220/2500 [17:40:26<1:30:42, 19.44s/it] 89%|████████▉ | 2221/2500 [17:40:45<1:30:18, 19.42s/it] {'loss': 0.0156, 'grad_norm': 2.224171521622292, 'learning_rate': 1.116e-07, 'completion_length': 50.71428871154785, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.39013671875, 'epoch': 0.89} 89%|████████▉ | 2221/2500 [17:40:45<1:30:18, 19.42s/it] 89%|████████▉ | 2222/2500 [17:41:05<1:30:22, 19.51s/it] {'loss': 0.0046, 'grad_norm': 1.0886055137939756, 'learning_rate': 1.1119999999999999e-07, 'completion_length': 56.51785850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.114013671875, 'epoch': 0.89} 89%|████████▉ | 2222/2500 [17:41:05<1:30:22, 19.51s/it] 89%|████████▉ | 2223/2500 [17:41:25<1:30:17, 19.56s/it] {'loss': 0.0058, 'grad_norm': 0.4232408337279169, 'learning_rate': 1.1079999999999999e-07, 'completion_length': 57.723215103149414, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.145263671875, 'epoch': 0.89} 89%|████████▉ | 2223/2500 [17:41:25<1:30:17, 19.56s/it] 89%|████████▉ | 2224/2500 [17:41:44<1:29:40, 19.49s/it] {'loss': 0.0363, 'grad_norm': 1.9502061050236976, 'learning_rate': 1.104e-07, 'completion_length': 51.19643020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.904296875, 'epoch': 0.89} 89%|████████▉ | 2224/2500 [17:41:44<1:29:40, 19.49s/it] 89%|████████▉ | 2225/2500 [17:42:04<1:29:44, 19.58s/it] {'loss': 0.005, 'grad_norm': 3.029104559065837, 'learning_rate': 1.0999999999999999e-07, 'completion_length': 59.97321891784668, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.124755859375, 'epoch': 0.89} 89%|████████▉ | 2225/2500 [17:42:04<1:29:44, 19.58s/it] 89%|████████▉ | 2226/2500 [17:42:23<1:29:30, 19.60s/it] {'loss': 0.0051, 'grad_norm': 1.1400163357487787, 'learning_rate': 1.096e-07, 'completion_length': 56.82143020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.126953125, 'epoch': 0.89} 89%|████████▉ | 2226/2500 [17:42:23<1:29:30, 19.60s/it] 89%|████████▉ | 2227/2500 [17:42:44<1:30:13, 19.83s/it] {'loss': 0.0051, 'grad_norm': 2.3697652268530507, 'learning_rate': 1.092e-07, 'completion_length': 59.91071701049805, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642858505249023, 'reward_std': 0.0835726335644722, 'kl': 0.126953125, 'epoch': 0.89} 89%|████████▉ | 2227/2500 [17:42:44<1:30:13, 19.83s/it] 89%|████████▉ | 2228/2500 [17:43:03<1:29:40, 19.78s/it] {'loss': 0.0047, 'grad_norm': 0.2125344259192421, 'learning_rate': 1.088e-07, 'completion_length': 54.98214530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.116455078125, 'epoch': 0.89} 89%|████████▉ | 2228/2500 [17:43:03<1:29:40, 19.78s/it] 89%|████████▉ | 2229/2500 [17:43:23<1:29:08, 19.74s/it] {'loss': 0.0037, 'grad_norm': 0.14790711941191267, 'learning_rate': 1.0839999999999999e-07, 'completion_length': 52.19643211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.09228515625, 'epoch': 0.89} 89%|████████▉ | 2229/2500 [17:43:23<1:29:08, 19.74s/it] 89%|████████▉ | 2230/2500 [17:43:44<1:29:50, 19.97s/it] {'loss': 0.0772, 'grad_norm': 83.22952362584662, 'learning_rate': 1.0799999999999999e-07, 'completion_length': 52.50893211364746, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.946428656578064, 'reward_std': 0.13408026099205017, 'kl': 1.9326171875, 'epoch': 0.89} 89%|████████▉ | 2230/2500 [17:43:44<1:29:50, 19.97s/it] 89%|████████▉ | 2231/2500 [17:44:04<1:30:41, 20.23s/it] {'loss': 0.0284, 'grad_norm': 4.830088930406116, 'learning_rate': 1.076e-07, 'completion_length': 53.91071701049805, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.9196429252624512, 'reward_std': 0.18722482025623322, 'kl': 0.7099609375, 'epoch': 0.89} 89%|████████▉ | 2231/2500 [17:44:04<1:30:41, 20.23s/it] 89%|████████▉ | 2232/2500 [17:44:24<1:29:42, 20.08s/it] {'loss': 0.0054, 'grad_norm': 2.0461044792950918, 'learning_rate': 1.072e-07, 'completion_length': 53.36607360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.134765625, 'epoch': 0.89} 89%|████████▉ | 2232/2500 [17:44:24<1:29:42, 20.08s/it] 89%|████████▉ | 2233/2500 [17:44:44<1:29:42, 20.16s/it] {'loss': 0.004, 'grad_norm': 0.2147694163596493, 'learning_rate': 1.068e-07, 'completion_length': 53.455360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1005859375, 'epoch': 0.89} 89%|████████▉ | 2233/2500 [17:44:44<1:29:42, 20.16s/it] 89%|████████▉ | 2234/2500 [17:45:03<1:27:35, 19.76s/it] {'loss': 0.013, 'grad_norm': 1.9366297621697974, 'learning_rate': 1.0639999999999999e-07, 'completion_length': 48.90178680419922, 'rewards/accuracy_reward': 0.928571492433548, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9107143878936768, 'reward_std': 0.08868780732154846, 'kl': 0.325439453125, 'epoch': 0.89} 89%|████████▉ | 2234/2500 [17:45:03<1:27:35, 19.76s/it] 89%|████████▉ | 2235/2500 [17:45:22<1:26:32, 19.59s/it] {'loss': 0.0251, 'grad_norm': 5.299962281979697, 'learning_rate': 1.06e-07, 'completion_length': 49.05357360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9464285969734192, 'reward_std': 0.13408026099205017, 'kl': 0.627197265625, 'epoch': 0.89} 89%|████████▉ | 2235/2500 [17:45:22<1:26:32, 19.59s/it] 89%|████████▉ | 2236/2500 [17:45:42<1:26:26, 19.65s/it] {'loss': 0.0124, 'grad_norm': 1.1157601016424028, 'learning_rate': 1.0559999999999999e-07, 'completion_length': 55.13393020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.31103515625, 'epoch': 0.89} 89%|████████▉ | 2236/2500 [17:45:42<1:26:26, 19.65s/it] 89%|████████▉ | 2237/2500 [17:46:02<1:26:37, 19.76s/it] {'loss': 0.004, 'grad_norm': 1.0564829258377686, 'learning_rate': 1.052e-07, 'completion_length': 52.26785850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.100341796875, 'epoch': 0.89} 89%|████████▉ | 2237/2500 [17:46:02<1:26:37, 19.76s/it] 90%|████████▉ | 2238/2500 [17:46:21<1:25:34, 19.60s/it] {'loss': 0.0099, 'grad_norm': 5.688854353960726, 'learning_rate': 1.048e-07, 'completion_length': 48.18750190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.053144559264183044, 'kl': 0.24951171875, 'epoch': 0.9} 90%|████████▉ | 2238/2500 [17:46:21<1:25:34, 19.60s/it] 90%|████████▉ | 2239/2500 [17:46:41<1:24:48, 19.50s/it] {'loss': 0.0045, 'grad_norm': 0.288016144405583, 'learning_rate': 1.0440000000000001e-07, 'completion_length': 57.70536231994629, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.11181640625, 'epoch': 0.9} 90%|████████▉ | 2239/2500 [17:46:41<1:24:48, 19.50s/it] 90%|████████▉ | 2240/2500 [17:46:59<1:23:12, 19.20s/it] {'loss': 0.1321, 'grad_norm': 30.5946112878911, 'learning_rate': 1.0399999999999999e-07, 'completion_length': 48.10714530944824, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.955357164144516, 'reward': 1.8660715222358704, 'reward_std': 0.11231374368071556, 'kl': 3.3125, 'epoch': 0.9} 90%|████████▉ | 2240/2500 [17:46:59<1:23:12, 19.20s/it] 90%|████████▉ | 2241/2500 [17:47:19<1:23:53, 19.44s/it] {'loss': 0.0055, 'grad_norm': 0.5091017097931593, 'learning_rate': 1.0359999999999999e-07, 'completion_length': 53.56250190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.136474609375, 'epoch': 0.9} 90%|████████▉ | 2241/2500 [17:47:19<1:23:53, 19.44s/it] 90%|████████▉ | 2242/2500 [17:47:39<1:24:23, 19.63s/it] {'loss': 0.0055, 'grad_norm': 0.34482495352035475, 'learning_rate': 1.032e-07, 'completion_length': 57.02678871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13623046875, 'epoch': 0.9} 90%|████████▉ | 2242/2500 [17:47:39<1:24:23, 19.63s/it] 90%|████████▉ | 2243/2500 [17:48:00<1:25:21, 19.93s/it] {'loss': 0.0083, 'grad_norm': 1.3979008136867797, 'learning_rate': 1.028e-07, 'completion_length': 57.080360412597656, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.20751953125, 'epoch': 0.9} 90%|████████▉ | 2243/2500 [17:48:00<1:25:21, 19.93s/it] 90%|████████▉ | 2244/2500 [17:48:21<1:26:26, 20.26s/it] {'loss': 0.0097, 'grad_norm': 1.339047268455592, 'learning_rate': 1.024e-07, 'completion_length': 60.15178871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.053144559264183044, 'kl': 0.2431640625, 'epoch': 0.9} 90%|████████▉ | 2244/2500 [17:48:21<1:26:26, 20.26s/it] 90%|████████▉ | 2245/2500 [17:48:41<1:26:17, 20.30s/it] {'loss': 0.0059, 'grad_norm': 1.2643344779528816, 'learning_rate': 1.0199999999999999e-07, 'completion_length': 58.46428871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.146484375, 'epoch': 0.9} 90%|████████▉ | 2245/2500 [17:48:41<1:26:17, 20.30s/it] 90%|████████▉ | 2246/2500 [17:49:02<1:26:12, 20.37s/it] {'loss': 0.0474, 'grad_norm': 5.949118439136996, 'learning_rate': 1.016e-07, 'completion_length': 60.330360412597656, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9285714626312256, 'reward_std': 0.14579425007104874, 'kl': 1.18115234375, 'epoch': 0.9} 90%|████████▉ | 2246/2500 [17:49:02<1:26:12, 20.37s/it] 90%|████████▉ | 2247/2500 [17:49:22<1:25:20, 20.24s/it] {'loss': 0.0144, 'grad_norm': 3.4658670866510732, 'learning_rate': 1.0119999999999999e-07, 'completion_length': 59.44643211364746, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9285714626312256, 'reward_std': 0.1289060115814209, 'kl': 0.361328125, 'epoch': 0.9} 90%|████████▉ | 2247/2500 [17:49:22<1:25:20, 20.24s/it] 90%|████████▉ | 2248/2500 [17:49:42<1:25:28, 20.35s/it] {'loss': 0.0071, 'grad_norm': 1.7762017894520468, 'learning_rate': 1.008e-07, 'completion_length': 56.41964530944824, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.17822265625, 'epoch': 0.9} 90%|████████▉ | 2248/2500 [17:49:42<1:25:28, 20.35s/it] 90%|████████▉ | 2249/2500 [17:50:07<1:30:45, 21.70s/it] {'loss': 0.0274, 'grad_norm': 6.195730722622264, 'learning_rate': 1.004e-07, 'completion_length': 50.57143020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.6826171875, 'epoch': 0.9} 90%|████████▉ | 2249/2500 [17:50:07<1:30:45, 21.70s/it] 90%|█████████ | 2250/2500 [17:50:35<1:38:02, 23.53s/it] {'loss': 0.0066, 'grad_norm': 2.6080107156049674, 'learning_rate': 1e-07, 'completion_length': 57.70535850524902, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9553571939468384, 'reward_std': 0.12626906856894493, 'kl': 0.164794921875, 'epoch': 0.9} 90%|█████████ | 2250/2500 [17:50:35<1:38:02, 23.53s/it] 90%|█████████ | 2251/2500 [17:51:06<1:46:32, 25.67s/it] {'loss': 0.0069, 'grad_norm': 2.2602716408345795, 'learning_rate': 9.959999999999999e-08, 'completion_length': 54.517860412597656, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.946428656578064, 'reward_std': 0.11146337911486626, 'kl': 0.1728515625, 'epoch': 0.9} 90%|█████████ | 2251/2500 [17:51:06<1:46:32, 25.67s/it] 90%|█████████ | 2252/2500 [17:51:36<1:51:33, 26.99s/it] {'loss': 0.0124, 'grad_norm': 1.9429289646530217, 'learning_rate': 9.919999999999999e-08, 'completion_length': 51.97321701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.3095703125, 'epoch': 0.9} 90%|█████████ | 2252/2500 [17:51:36<1:51:33, 26.99s/it] 90%|█████████ | 2253/2500 [17:51:56<1:42:33, 24.91s/it] {'loss': 0.0717, 'grad_norm': 10.202217040613068, 'learning_rate': 9.88e-08, 'completion_length': 50.52678871154785, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9553571939468384, 'reward_std': 0.09138382598757744, 'kl': 1.79296875, 'epoch': 0.9} 90%|█████████ | 2253/2500 [17:51:56<1:42:33, 24.91s/it] 90%|█████████ | 2254/2500 [17:52:14<1:34:01, 22.93s/it] {'loss': 0.0068, 'grad_norm': 2.968243837476192, 'learning_rate': 9.84e-08, 'completion_length': 52.598215103149414, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857909202576, 'reward_std': 0.0835726372897625, 'kl': 0.1708984375, 'epoch': 0.9} 90%|█████████ | 2254/2500 [17:52:14<1:34:01, 22.93s/it] 90%|█████████ | 2255/2500 [17:52:34<1:30:12, 22.09s/it] {'loss': 0.0063, 'grad_norm': 1.4996473286300425, 'learning_rate': 9.8e-08, 'completion_length': 53.22321701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.157958984375, 'epoch': 0.9} 90%|█████████ | 2255/2500 [17:52:34<1:30:12, 22.09s/it] 90%|█████████ | 2256/2500 [17:52:56<1:28:44, 21.82s/it] {'loss': 0.0282, 'grad_norm': 4.290542888647712, 'learning_rate': 9.76e-08, 'completion_length': 52.13393020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.70263671875, 'epoch': 0.9} 90%|█████████ | 2256/2500 [17:52:56<1:28:44, 21.82s/it] 90%|█████████ | 2257/2500 [17:53:15<1:26:05, 21.26s/it] {'loss': 0.0093, 'grad_norm': 4.218308829975863, 'learning_rate': 9.72e-08, 'completion_length': 54.517860412597656, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857313156128, 'reward_std': 0.06222161650657654, 'kl': 0.232421875, 'epoch': 0.9} 90%|█████████ | 2257/2500 [17:53:15<1:26:05, 21.26s/it] 90%|█████████ | 2258/2500 [17:53:36<1:24:21, 20.91s/it] {'loss': 0.0222, 'grad_norm': 7.537980061211894, 'learning_rate': 9.679999999999999e-08, 'completion_length': 51.06250190734863, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.955357164144516, 'reward': 1.910714328289032, 'reward_std': 0.21765289455652237, 'kl': 0.556640625, 'epoch': 0.9} 90%|█████████ | 2258/2500 [17:53:36<1:24:21, 20.91s/it] 90%|█████████ | 2259/2500 [17:53:58<1:25:40, 21.33s/it] {'loss': 0.0072, 'grad_norm': 0.6752336774837118, 'learning_rate': 9.639999999999999e-08, 'completion_length': 55.07143020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1806640625, 'epoch': 0.9} 90%|█████████ | 2259/2500 [17:53:58<1:25:40, 21.33s/it] 90%|█████████ | 2260/2500 [17:54:18<1:23:32, 20.88s/it] {'loss': 0.0072, 'grad_norm': 1.731872976756331, 'learning_rate': 9.6e-08, 'completion_length': 51.80357360839844, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.1796875, 'epoch': 0.9} 90%|█████████ | 2260/2500 [17:54:18<1:23:32, 20.88s/it] 90%|█████████ | 2261/2500 [17:54:38<1:21:52, 20.55s/it] {'loss': 0.0335, 'grad_norm': 8.870549999162074, 'learning_rate': 9.56e-08, 'completion_length': 50.30357360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.910714328289032, 'reward_std': 0.21765288710594177, 'kl': 0.8388671875, 'epoch': 0.9} 90%|█████████ | 2261/2500 [17:54:38<1:21:52, 20.55s/it] 90%|█████████ | 2262/2500 [17:55:00<1:23:31, 21.06s/it] {'loss': 0.0483, 'grad_norm': 3.5229627893626865, 'learning_rate': 9.52e-08, 'completion_length': 51.24107360839844, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9196429252624512, 'reward_std': 0.18205057084560394, 'kl': 1.205078125, 'epoch': 0.9} 90%|█████████ | 2262/2500 [17:55:00<1:23:31, 21.06s/it] 91%|█████████ | 2263/2500 [17:55:20<1:22:09, 20.80s/it] {'loss': 0.032, 'grad_norm': 8.588380580348293, 'learning_rate': 9.479999999999999e-08, 'completion_length': 50.18750190734863, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.946428656578064, 'reward_std': 0.15152288228273392, 'kl': 0.796875, 'epoch': 0.91} 91%|█████████ | 2263/2500 [17:55:20<1:22:09, 20.80s/it] 91%|█████████ | 2264/2500 [17:55:40<1:21:13, 20.65s/it] {'loss': 0.0173, 'grad_norm': 2.620429491882181, 'learning_rate': 9.44e-08, 'completion_length': 50.58035850524902, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.43115234375, 'epoch': 0.91} 91%|█████████ | 2264/2500 [17:55:40<1:21:13, 20.65s/it] 91%|█████████ | 2265/2500 [17:56:01<1:21:17, 20.75s/it] {'loss': 0.0597, 'grad_norm': 5.201518523929038, 'learning_rate': 9.4e-08, 'completion_length': 46.25893020629883, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.946428656578064, 'reward_std': 0.15152287483215332, 'kl': 1.4921875, 'epoch': 0.91} 91%|█████████ | 2265/2500 [17:56:01<1:21:17, 20.75s/it] 91%|█████████ | 2266/2500 [17:56:21<1:20:09, 20.55s/it] {'loss': 0.0127, 'grad_norm': 0.9088335392282983, 'learning_rate': 9.36e-08, 'completion_length': 53.54464530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.318359375, 'epoch': 0.91} 91%|█████████ | 2266/2500 [17:56:21<1:20:09, 20.55s/it] 91%|█████████ | 2267/2500 [17:56:44<1:22:01, 21.12s/it] {'loss': 0.0101, 'grad_norm': 1.276250191978551, 'learning_rate': 9.32e-08, 'completion_length': 51.25000190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.251953125, 'epoch': 0.91} 91%|█████████ | 2267/2500 [17:56:44<1:22:01, 21.12s/it] 91%|█████████ | 2268/2500 [17:57:06<1:23:06, 21.49s/it] {'loss': 0.0053, 'grad_norm': 0.4633516993231279, 'learning_rate': 9.279999999999998e-08, 'completion_length': 58.500003814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13330078125, 'epoch': 0.91} 91%|█████████ | 2268/2500 [17:57:06<1:23:06, 21.49s/it] 91%|█████████ | 2269/2500 [17:57:26<1:21:11, 21.09s/it] {'loss': 0.0059, 'grad_norm': 2.7392266352805326, 'learning_rate': 9.24e-08, 'completion_length': 57.73214530944824, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9375000596046448, 'reward_std': 0.1030978113412857, 'kl': 0.146728515625, 'epoch': 0.91} 91%|█████████ | 2269/2500 [17:57:26<1:21:11, 21.09s/it] 91%|█████████ | 2270/2500 [17:57:49<1:22:28, 21.52s/it] {'loss': 0.0117, 'grad_norm': 2.7576538675348985, 'learning_rate': 9.199999999999999e-08, 'completion_length': 50.43750190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.293212890625, 'epoch': 0.91} 91%|█████████ | 2270/2500 [17:57:49<1:22:28, 21.52s/it] 91%|█████████ | 2271/2500 [17:58:09<1:20:59, 21.22s/it] {'loss': 0.0096, 'grad_norm': 4.625284674163168, 'learning_rate': 9.16e-08, 'completion_length': 49.94643211364746, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9553572535514832, 'reward_std': 0.12626906484365463, 'kl': 0.24072265625, 'epoch': 0.91} 91%|█████████ | 2271/2500 [17:58:09<1:20:59, 21.22s/it] 91%|█████████ | 2272/2500 [17:58:29<1:18:34, 20.68s/it] {'loss': 0.0169, 'grad_norm': 8.073879308022535, 'learning_rate': 9.12e-08, 'completion_length': 50.02678680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9375000596046448, 'reward_std': 0.1418914496898651, 'kl': 0.4208984375, 'epoch': 0.91} 91%|█████████ | 2272/2500 [17:58:29<1:18:34, 20.68s/it] 91%|█████████ | 2273/2500 [17:58:51<1:20:14, 21.21s/it] {'loss': 0.0256, 'grad_norm': 3.2551299350249048, 'learning_rate': 9.08e-08, 'completion_length': 48.66071701049805, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.6435546875, 'epoch': 0.91} 91%|█████████ | 2273/2500 [17:58:51<1:20:14, 21.21s/it] 91%|█████████ | 2274/2500 [17:59:11<1:18:26, 20.82s/it] {'loss': 0.0141, 'grad_norm': 6.934065212210655, 'learning_rate': 9.039999999999999e-08, 'completion_length': 55.83928680419922, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.35302734375, 'epoch': 0.91} 91%|█████████ | 2274/2500 [17:59:11<1:18:26, 20.82s/it] 91%|█████████ | 2275/2500 [17:59:31<1:16:39, 20.44s/it] {'loss': 0.0102, 'grad_norm': 1.4771471317882499, 'learning_rate': 9e-08, 'completion_length': 52.48214530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.25439453125, 'epoch': 0.91} 91%|█████████ | 2275/2500 [17:59:31<1:16:39, 20.44s/it] 91%|█████████ | 2276/2500 [17:59:53<1:18:41, 21.08s/it] {'loss': 0.0053, 'grad_norm': 4.566054005929496, 'learning_rate': 8.96e-08, 'completion_length': 56.17857360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.132080078125, 'epoch': 0.91} 91%|█████████ | 2276/2500 [17:59:53<1:18:41, 21.08s/it] 91%|█████████ | 2277/2500 [18:00:14<1:17:33, 20.87s/it] {'loss': 0.0063, 'grad_norm': 2.554140523575452, 'learning_rate': 8.919999999999999e-08, 'completion_length': 55.84821701049805, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9464285969734192, 'reward_std': 0.08868780732154846, 'kl': 0.158447265625, 'epoch': 0.91} 91%|█████████ | 2277/2500 [18:00:14<1:17:33, 20.87s/it] 91%|█████████ | 2278/2500 [18:00:34<1:16:36, 20.71s/it] {'loss': 0.0165, 'grad_norm': 2.6275368595762014, 'learning_rate': 8.88e-08, 'completion_length': 51.33035850524902, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.412109375, 'epoch': 0.91} 91%|█████████ | 2278/2500 [18:00:34<1:16:36, 20.71s/it] 91%|█████████ | 2279/2500 [18:00:55<1:16:33, 20.79s/it] {'loss': 0.01, 'grad_norm': 2.812812475925983, 'learning_rate': 8.84e-08, 'completion_length': 55.58928680419922, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9375000596046448, 'reward_std': 0.1379830539226532, 'kl': 0.25, 'epoch': 0.91} 91%|█████████ | 2279/2500 [18:00:55<1:16:33, 20.79s/it] 91%|█████████ | 2280/2500 [18:01:15<1:15:15, 20.52s/it] {'loss': 0.125, 'grad_norm': 52.653182613839306, 'learning_rate': 8.8e-08, 'completion_length': 50.83035850524902, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.9285715222358704, 'reward_std': 0.13919543474912643, 'kl': 3.1337890625, 'epoch': 0.91} 91%|█████████ | 2280/2500 [18:01:15<1:15:15, 20.52s/it] 91%|█████████ | 2281/2500 [18:01:35<1:14:30, 20.41s/it] {'loss': 0.0096, 'grad_norm': 2.784238712586186, 'learning_rate': 8.759999999999999e-08, 'completion_length': 48.93750190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.239990234375, 'epoch': 0.91} 91%|█████████ | 2281/2500 [18:01:35<1:14:30, 20.41s/it] 91%|█████████▏| 2282/2500 [18:01:57<1:16:01, 20.92s/it] {'loss': 0.0069, 'grad_norm': 1.745579490890138, 'learning_rate': 8.72e-08, 'completion_length': 52.72321701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.173828125, 'epoch': 0.91} 91%|█████████▏| 2282/2500 [18:01:57<1:16:01, 20.92s/it] 91%|█████████▏| 2283/2500 [18:02:17<1:14:11, 20.51s/it] {'loss': 0.0073, 'grad_norm': 2.0541829812822106, 'learning_rate': 8.68e-08, 'completion_length': 52.25893211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.18359375, 'epoch': 0.91} 91%|█████████▏| 2283/2500 [18:02:17<1:14:11, 20.51s/it] 91%|█████████▏| 2284/2500 [18:02:35<1:11:58, 19.99s/it] {'loss': 0.0167, 'grad_norm': 2.447463273486458, 'learning_rate': 8.64e-08, 'completion_length': 48.25893211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9642858505249023, 'reward_std': 0.0835726335644722, 'kl': 0.4189453125, 'epoch': 0.91} 91%|█████████▏| 2284/2500 [18:02:35<1:11:58, 19.99s/it] 91%|█████████▏| 2285/2500 [18:02:57<1:13:26, 20.50s/it] {'loss': 0.0056, 'grad_norm': 2.960000157312455, 'learning_rate': 8.599999999999999e-08, 'completion_length': 52.54464340209961, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.139892578125, 'epoch': 0.91} 91%|█████████▏| 2285/2500 [18:02:57<1:13:26, 20.50s/it] 91%|█████████▏| 2286/2500 [18:03:16<1:11:56, 20.17s/it] {'loss': 0.0131, 'grad_norm': 3.6968211529504527, 'learning_rate': 8.559999999999999e-08, 'completion_length': 51.60714530944824, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.32666015625, 'epoch': 0.91} 91%|█████████▏| 2286/2500 [18:03:17<1:11:56, 20.17s/it] 91%|█████████▏| 2287/2500 [18:03:37<1:11:39, 20.19s/it] {'loss': 0.0395, 'grad_norm': 3.1587039985744276, 'learning_rate': 8.52e-08, 'completion_length': 50.642860412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.9921875, 'epoch': 0.91} 91%|█████████▏| 2287/2500 [18:03:37<1:11:39, 20.19s/it] 92%|█████████▏| 2288/2500 [18:03:58<1:12:01, 20.38s/it] {'loss': 0.0434, 'grad_norm': 5.14094800726435, 'learning_rate': 8.479999999999999e-08, 'completion_length': 43.82143020629883, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.946428656578064, 'reward_std': 0.1289059966802597, 'kl': 1.0830078125, 'epoch': 0.92} 92%|█████████▏| 2288/2500 [18:03:58<1:12:01, 20.38s/it] 92%|█████████▏| 2289/2500 [18:04:18<1:11:27, 20.32s/it] {'loss': 0.0104, 'grad_norm': 2.887176335183635, 'learning_rate': 8.44e-08, 'completion_length': 59.500003814697266, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9375001192092896, 'reward_std': 0.11394161731004715, 'kl': 0.2607421875, 'epoch': 0.92} 92%|█████████▏| 2289/2500 [18:04:18<1:11:27, 20.32s/it] 92%|█████████▏| 2290/2500 [18:04:38<1:11:26, 20.41s/it] {'loss': 0.0132, 'grad_norm': 3.189302960490283, 'learning_rate': 8.4e-08, 'completion_length': 54.16964530944824, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9553571939468384, 'reward_std': 0.12626906856894493, 'kl': 0.330078125, 'epoch': 0.92} 92%|█████████▏| 2290/2500 [18:04:38<1:11:26, 20.41s/it] 92%|█████████▏| 2291/2500 [18:05:00<1:12:10, 20.72s/it] {'loss': 0.0534, 'grad_norm': 7.788958139887045, 'learning_rate': 8.36e-08, 'completion_length': 49.66071701049805, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.8482143878936768, 'reward_std': 0.24290670081973076, 'kl': 1.333984375, 'epoch': 0.92} 92%|█████████▏| 2291/2500 [18:05:00<1:12:10, 20.72s/it] 92%|█████████▏| 2292/2500 [18:05:20<1:11:28, 20.62s/it] {'loss': 0.0091, 'grad_norm': 2.580357427268041, 'learning_rate': 8.319999999999999e-08, 'completion_length': 48.02678680419922, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.946428656578064, 'reward_std': 0.1289059966802597, 'kl': 0.22900390625, 'epoch': 0.92} 92%|█████████▏| 2292/2500 [18:05:20<1:11:28, 20.62s/it] 92%|█████████▏| 2293/2500 [18:05:42<1:12:24, 20.99s/it] {'loss': 0.0083, 'grad_norm': 2.7298576536887, 'learning_rate': 8.28e-08, 'completion_length': 53.85714530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.20751953125, 'epoch': 0.92} 92%|█████████▏| 2293/2500 [18:05:42<1:12:24, 20.99s/it] 92%|█████████▏| 2294/2500 [18:06:02<1:11:02, 20.69s/it] {'loss': 0.0113, 'grad_norm': 1.7653618783619585, 'learning_rate': 8.24e-08, 'completion_length': 52.94643211364746, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.28271484375, 'epoch': 0.92} 92%|█████████▏| 2294/2500 [18:06:02<1:11:02, 20.69s/it] 92%|█████████▏| 2295/2500 [18:06:21<1:09:24, 20.32s/it] {'loss': 0.0092, 'grad_norm': 1.453988465035073, 'learning_rate': 8.2e-08, 'completion_length': 48.37500190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.22900390625, 'epoch': 0.92} 92%|█████████▏| 2295/2500 [18:06:22<1:09:24, 20.32s/it] 92%|█████████▏| 2296/2500 [18:06:42<1:09:20, 20.39s/it] {'loss': 0.0888, 'grad_norm': 12.937608719193703, 'learning_rate': 8.16e-08, 'completion_length': 43.33035850524902, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9375000596046448, 'reward_std': 0.1418914496898651, 'kl': 2.2041015625, 'epoch': 0.92} 92%|█████████▏| 2296/2500 [18:06:42<1:09:20, 20.39s/it] 92%|█████████▏| 2297/2500 [18:07:02<1:08:57, 20.38s/it] {'loss': 0.0103, 'grad_norm': 2.5428280207567306, 'learning_rate': 8.119999999999999e-08, 'completion_length': 52.36607360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.08747542649507523, 'kl': 0.2578125, 'epoch': 0.92} 92%|█████████▏| 2297/2500 [18:07:02<1:08:57, 20.38s/it] 92%|█████████▏| 2298/2500 [18:07:23<1:08:27, 20.33s/it] {'loss': 0.0247, 'grad_norm': 3.1071810795024217, 'learning_rate': 8.08e-08, 'completion_length': 51.98214530944824, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9017857313156128, 'reward_std': 0.07576143741607666, 'kl': 0.6171875, 'epoch': 0.92} 92%|█████████▏| 2298/2500 [18:07:23<1:08:27, 20.33s/it] 92%|█████████▏| 2299/2500 [18:07:44<1:09:14, 20.67s/it] {'loss': 0.0076, 'grad_norm': 0.9603622154592892, 'learning_rate': 8.039999999999999e-08, 'completion_length': 50.517860412597656, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.053144559264183044, 'kl': 0.19091796875, 'epoch': 0.92} 92%|█████████▏| 2299/2500 [18:07:44<1:09:14, 20.67s/it] 92%|█████████▏| 2300/2500 [18:08:04<1:07:43, 20.32s/it] {'loss': 0.0383, 'grad_norm': 6.043490618908677, 'learning_rate': 8e-08, 'completion_length': 57.24107360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.9560546875, 'epoch': 0.92} 92%|█████████▏| 2300/2500 [18:08:04<1:07:43, 20.32s/it] 92%|█████████▏| 2301/2500 [18:08:48<1:30:55, 27.41s/it] {'loss': 0.0076, 'grad_norm': 2.3671725011792284, 'learning_rate': 7.96e-08, 'completion_length': 52.06250190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.18994140625, 'epoch': 0.92} 92%|█████████▏| 2301/2500 [18:08:48<1:30:55, 27.41s/it] 92%|█████████▏| 2302/2500 [18:08:57<1:12:55, 22.10s/it] {'loss': 0.0091, 'grad_norm': 0.9091520368653416, 'learning_rate': 7.920000000000001e-08, 'completion_length': 50.35714530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.22705078125, 'epoch': 0.92} 92%|█████████▏| 2302/2500 [18:08:57<1:12:55, 22.10s/it] 92%|█████████▏| 2303/2500 [18:09:05<58:32, 17.83s/it] {'loss': 0.0231, 'grad_norm': 3.151644877768202, 'learning_rate': 7.879999999999999e-08, 'completion_length': 46.66071701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.576171875, 'epoch': 0.92} 92%|█████████▏| 2303/2500 [18:09:05<58:32, 17.83s/it] 92%|█████████▏| 2304/2500 [18:09:15<50:18, 15.40s/it] {'loss': 0.0111, 'grad_norm': 2.4847754471361396, 'learning_rate': 7.839999999999999e-08, 'completion_length': 50.83928871154785, 'rewards/accuracy_reward': 0.928571492433548, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.8928572535514832, 'reward_std': 0.167145274579525, 'kl': 0.27783203125, 'epoch': 0.92} 92%|█████████▏| 2304/2500 [18:09:15<50:18, 15.40s/it] 92%|█████████▏| 2305/2500 [18:09:24<43:29, 13.38s/it] {'loss': 0.006, 'grad_norm': 2.6900449166334988, 'learning_rate': 7.8e-08, 'completion_length': 49.32143020629883, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9375000596046448, 'reward_std': 0.05831881985068321, 'kl': 0.15087890625, 'epoch': 0.92} 92%|█████████▏| 2305/2500 [18:09:24<43:29, 13.38s/it] 92%|█████████▏| 2306/2500 [18:09:33<39:20, 12.17s/it] {'loss': 0.0146, 'grad_norm': 3.7068947386570668, 'learning_rate': 7.76e-08, 'completion_length': 48.25893020629883, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.919642984867096, 'reward_std': 0.19239907711744308, 'kl': 0.3642578125, 'epoch': 0.92} 92%|█████████▏| 2306/2500 [18:09:33<39:20, 12.17s/it] 92%|█████████▏| 2307/2500 [18:09:47<40:48, 12.68s/it] {'loss': 0.0122, 'grad_norm': 4.73308022406013, 'learning_rate': 7.72e-08, 'completion_length': 54.50893211364746, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9375001192092896, 'reward_std': 0.1541598215699196, 'kl': 0.3046875, 'epoch': 0.92} 92%|█████████▏| 2307/2500 [18:09:47<40:48, 12.68s/it] 92%|█████████▏| 2308/2500 [18:10:02<43:09, 13.49s/it] {'loss': 0.0137, 'grad_norm': 1.2850228621828363, 'learning_rate': 7.679999999999999e-08, 'completion_length': 46.50893020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.34130859375, 'epoch': 0.92} 92%|█████████▏| 2308/2500 [18:10:02<43:09, 13.49s/it] 92%|█████████▏| 2309/2500 [18:10:13<40:38, 12.77s/it] {'loss': 0.0074, 'grad_norm': 2.322596697250348, 'learning_rate': 7.64e-08, 'completion_length': 52.82143020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.18408203125, 'epoch': 0.92} 92%|█████████▏| 2309/2500 [18:10:13<40:38, 12.77s/it] 92%|█████████▏| 2310/2500 [18:10:21<35:41, 11.27s/it] {'loss': 0.0618, 'grad_norm': 10.018336444938925, 'learning_rate': 7.599999999999999e-08, 'completion_length': 47.68750190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9375000596046448, 'reward_std': 0.11394162476062775, 'kl': 1.5439453125, 'epoch': 0.92} 92%|█████████▏| 2310/2500 [18:10:21<35:41, 11.27s/it] 92%|█████████▏| 2311/2500 [18:10:29<32:24, 10.29s/it] {'loss': 0.0131, 'grad_norm': 3.961889869957052, 'learning_rate': 7.56e-08, 'completion_length': 48.54464530944824, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.946428656578064, 'reward_std': 0.15152288228273392, 'kl': 0.328125, 'epoch': 0.92} 92%|█████████▏| 2311/2500 [18:10:29<32:24, 10.29s/it] 92%|█████████▏| 2312/2500 [18:10:37<29:54, 9.54s/it] {'loss': 0.013, 'grad_norm': 1.9309413348779612, 'learning_rate': 7.52e-08, 'completion_length': 52.91964530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.3251953125, 'epoch': 0.92} 92%|█████████▏| 2312/2500 [18:10:37<29:54, 9.54s/it] 93%|█████████▎| 2313/2500 [18:10:45<28:25, 9.12s/it] {'loss': 0.0408, 'grad_norm': 5.425368620725168, 'learning_rate': 7.480000000000001e-08, 'completion_length': 46.25893020629883, 'rewards/accuracy_reward': 0.9375000596046448, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.9017858505249023, 'reward_std': 0.2149568870663643, 'kl': 1.02294921875, 'epoch': 0.93} 93%|█████████▎| 2313/2500 [18:10:45<28:25, 9.12s/it] 93%|█████████▎| 2314/2500 [18:10:53<27:08, 8.75s/it] {'loss': 0.0081, 'grad_norm': 0.8096500512190921, 'learning_rate': 7.439999999999999e-08, 'completion_length': 49.98214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.201171875, 'epoch': 0.93} 93%|█████████▎| 2314/2500 [18:10:53<27:08, 8.75s/it] 93%|█████████▎| 2315/2500 [18:11:02<27:08, 8.80s/it] {'loss': 0.0126, 'grad_norm': 1.376905901107705, 'learning_rate': 7.399999999999999e-08, 'completion_length': 48.54464530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.31494140625, 'epoch': 0.93} 93%|█████████▎| 2315/2500 [18:11:02<27:08, 8.80s/it] 93%|█████████▎| 2316/2500 [18:11:16<32:16, 10.52s/it] {'loss': 0.0279, 'grad_norm': 3.181972010248286, 'learning_rate': 7.36e-08, 'completion_length': 47.642860412597656, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9375000596046448, 'reward_std': 0.141891460865736, 'kl': 0.69970703125, 'epoch': 0.93} 93%|█████████▎| 2316/2500 [18:11:16<32:16, 10.52s/it] 93%|█████████▎| 2317/2500 [18:11:24<29:13, 9.58s/it] {'loss': 0.0088, 'grad_norm': 7.407914608810407, 'learning_rate': 7.32e-08, 'completion_length': 43.95535850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.21875, 'epoch': 0.93} 93%|█████████▎| 2317/2500 [18:11:24<29:13, 9.58s/it] 93%|█████████▎| 2318/2500 [18:11:31<27:14, 8.98s/it] {'loss': 0.0151, 'grad_norm': 2.5639647203522853, 'learning_rate': 7.28e-08, 'completion_length': 44.90178871154785, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.946428656578064, 'reward_std': 0.13408026099205017, 'kl': 0.37890625, 'epoch': 0.93} 93%|█████████▎| 2318/2500 [18:11:31<27:14, 8.98s/it] 93%|█████████▎| 2319/2500 [18:11:40<26:49, 8.89s/it] {'loss': 0.0082, 'grad_norm': 1.048245259787695, 'learning_rate': 7.24e-08, 'completion_length': 50.00893211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.2041015625, 'epoch': 0.93} 93%|█████████▎| 2319/2500 [18:11:40<26:49, 8.89s/it] 93%|█████████▎| 2320/2500 [18:11:48<26:22, 8.79s/it] {'loss': 0.0183, 'grad_norm': 3.982574615291772, 'learning_rate': 7.2e-08, 'completion_length': 51.05357360839844, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.45703125, 'epoch': 0.93} 93%|█████████▎| 2320/2500 [18:11:48<26:22, 8.79s/it] 93%|█████████▎| 2321/2500 [18:11:58<26:36, 8.92s/it] {'loss': 0.0443, 'grad_norm': 14.499222530458052, 'learning_rate': 7.159999999999999e-08, 'completion_length': 46.68750190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642858505249023, 'reward_std': 0.0835726335644722, 'kl': 1.1064453125, 'epoch': 0.93} 93%|█████████▎| 2321/2500 [18:11:58<26:36, 8.92s/it] 93%|█████████▎| 2322/2500 [18:12:06<26:04, 8.79s/it] {'loss': 0.0097, 'grad_norm': 2.2669930046771536, 'learning_rate': 7.12e-08, 'completion_length': 49.78571701049805, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.08747542649507523, 'kl': 0.24169921875, 'epoch': 0.93} 93%|█████████▎| 2322/2500 [18:12:06<26:04, 8.79s/it] 93%|█████████▎| 2323/2500 [18:12:15<26:09, 8.86s/it] {'loss': 0.0303, 'grad_norm': 2.4976192733125018, 'learning_rate': 7.08e-08, 'completion_length': 47.80357360839844, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.946428656578064, 'reward_std': 0.15152287483215332, 'kl': 0.75537109375, 'epoch': 0.93} 93%|█████████▎| 2323/2500 [18:12:15<26:09, 8.86s/it] 93%|█████████▎| 2324/2500 [18:12:24<25:40, 8.75s/it] {'loss': 0.0077, 'grad_norm': 3.9249092804059984, 'learning_rate': 7.04e-08, 'completion_length': 50.750003814697266, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9196429252624512, 'reward_std': 0.16444924473762512, 'kl': 0.19091796875, 'epoch': 0.93} 93%|█████████▎| 2324/2500 [18:12:24<25:40, 8.75s/it] 93%|█████████▎| 2325/2500 [18:12:33<26:08, 8.97s/it] {'loss': 0.0135, 'grad_norm': 1.473593165620714, 'learning_rate': 7e-08, 'completion_length': 51.63393211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.33837890625, 'epoch': 0.93} 93%|█████████▎| 2325/2500 [18:12:33<26:08, 8.97s/it] 93%|█████████▎| 2326/2500 [18:12:44<27:11, 9.38s/it] {'loss': 0.0136, 'grad_norm': 2.042933373992501, 'learning_rate': 6.959999999999999e-08, 'completion_length': 55.71428871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.33984375, 'epoch': 0.93} 93%|█████████▎| 2326/2500 [18:12:44<27:11, 9.38s/it] 93%|█████████▎| 2327/2500 [18:12:53<27:17, 9.46s/it] {'loss': 0.0741, 'grad_norm': 7.507119719414708, 'learning_rate': 6.92e-08, 'completion_length': 41.54464340209961, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9553572535514832, 'reward_std': 0.12626906484365463, 'kl': 1.853515625, 'epoch': 0.93} 93%|█████████▎| 2327/2500 [18:12:53<27:17, 9.46s/it] 93%|█████████▎| 2328/2500 [18:13:01<25:52, 9.03s/it] {'loss': 0.0101, 'grad_norm': 1.9157188657400634, 'learning_rate': 6.88e-08, 'completion_length': 45.69643020629883, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.251953125, 'epoch': 0.93} 93%|█████████▎| 2328/2500 [18:13:01<25:52, 9.03s/it] 93%|█████████▎| 2329/2500 [18:13:10<25:21, 8.90s/it] {'loss': 0.0142, 'grad_norm': 1.3147650283084868, 'learning_rate': 6.84e-08, 'completion_length': 50.27678871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.353515625, 'epoch': 0.93} 93%|█████████▎| 2329/2500 [18:13:10<25:21, 8.90s/it] 93%|█████████▎| 2330/2500 [18:13:19<25:24, 8.97s/it] {'loss': 0.0059, 'grad_norm': 2.353746959114138, 'learning_rate': 6.8e-08, 'completion_length': 48.55357360839844, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.1474609375, 'epoch': 0.93} 93%|█████████▎| 2330/2500 [18:13:19<25:24, 8.97s/it] 93%|█████████▎| 2331/2500 [18:13:32<28:22, 10.07s/it] {'loss': 0.0201, 'grad_norm': 2.5325014379159816, 'learning_rate': 6.76e-08, 'completion_length': 46.767860412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.50390625, 'epoch': 0.93} 93%|█████████▎| 2331/2500 [18:13:32<28:22, 10.07s/it] 93%|█████████▎| 2332/2500 [18:13:50<35:34, 12.70s/it] {'loss': 0.0214, 'grad_norm': 2.908778098442828, 'learning_rate': 6.719999999999999e-08, 'completion_length': 48.62500190734863, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.9375001192092896, 'reward_std': 0.1418914571404457, 'kl': 0.5322265625, 'epoch': 0.93} 93%|█████████▎| 2332/2500 [18:13:50<35:34, 12.70s/it] 93%|█████████▎| 2333/2500 [18:14:11<42:12, 15.17s/it] {'loss': 0.0273, 'grad_norm': 17.16385911361676, 'learning_rate': 6.679999999999999e-08, 'completion_length': 46.29464340209961, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.6787109375, 'epoch': 0.93} 93%|█████████▎| 2333/2500 [18:14:11<42:12, 15.17s/it] 93%|█████████▎| 2334/2500 [18:14:32<46:46, 16.91s/it] {'loss': 0.0238, 'grad_norm': 5.608207539622359, 'learning_rate': 6.64e-08, 'completion_length': 46.90178871154785, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.910714328289032, 'reward_std': 0.1897030547261238, 'kl': 0.5966796875, 'epoch': 0.93} 93%|█████████▎| 2334/2500 [18:14:32<46:46, 16.91s/it] 93%|█████████▎| 2335/2500 [18:14:56<52:22, 19.04s/it] {'loss': 0.0093, 'grad_norm': 1.048580891587455, 'learning_rate': 6.6e-08, 'completion_length': 45.63393020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.23193359375, 'epoch': 0.93} 93%|█████████▎| 2335/2500 [18:14:56<52:22, 19.04s/it] 93%|█████████▎| 2336/2500 [18:15:18<54:35, 19.97s/it] {'loss': 0.025, 'grad_norm': 2.177082365362449, 'learning_rate': 6.56e-08, 'completion_length': 47.29464530944824, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.946428656578064, 'reward_std': 0.15152288228273392, 'kl': 0.626953125, 'epoch': 0.93} 93%|█████████▎| 2336/2500 [18:15:18<54:35, 19.97s/it] 93%|█████████▎| 2337/2500 [18:15:40<55:19, 20.36s/it] {'loss': 0.0097, 'grad_norm': 2.271859178171974, 'learning_rate': 6.519999999999999e-08, 'completion_length': 45.35714340209961, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.2412109375, 'epoch': 0.93} 93%|█████████▎| 2337/2500 [18:15:40<55:19, 20.36s/it] 94%|█████████▎| 2338/2500 [18:16:02<56:20, 20.87s/it] {'loss': 0.0135, 'grad_norm': 3.654815773042554, 'learning_rate': 6.48e-08, 'completion_length': 49.96428871154785, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9375000596046448, 'reward_std': 0.141891460865736, 'kl': 0.3359375, 'epoch': 0.94} 94%|█████████▎| 2338/2500 [18:16:02<56:20, 20.87s/it] 94%|█████████▎| 2339/2500 [18:16:23<56:10, 20.94s/it] {'loss': 0.0187, 'grad_norm': 4.950440710921665, 'learning_rate': 6.44e-08, 'completion_length': 47.50000190734863, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.8750000596046448, 'reward_std': 0.2642521373927593, 'kl': 0.4697265625, 'epoch': 0.94} 94%|█████████▎| 2339/2500 [18:16:23<56:10, 20.94s/it] 94%|█████████▎| 2340/2500 [18:16:45<56:36, 21.23s/it] {'loss': 0.0168, 'grad_norm': 4.176467667780681, 'learning_rate': 6.4e-08, 'completion_length': 49.25893211364746, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9553572535514832, 'reward_std': 0.12626906484365463, 'kl': 0.4189453125, 'epoch': 0.94} 94%|█████████▎| 2340/2500 [18:16:45<56:36, 21.23s/it] 94%|█████████▎| 2341/2500 [18:17:06<55:55, 21.11s/it] {'loss': 0.0111, 'grad_norm': 1.148129356028897, 'learning_rate': 6.36e-08, 'completion_length': 46.00000190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.2763671875, 'epoch': 0.94} 94%|█████████▎| 2341/2500 [18:17:06<55:55, 21.11s/it] 94%|█████████▎| 2342/2500 [18:17:26<55:03, 20.91s/it] {'loss': 0.0087, 'grad_norm': 5.579507464986795, 'learning_rate': 6.32e-08, 'completion_length': 43.92857360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.10882644355297089, 'kl': 0.21630859375, 'epoch': 0.94} 94%|█████████▎| 2342/2500 [18:17:26<55:03, 20.91s/it] 94%|█████████▎| 2343/2500 [18:17:47<55:01, 21.03s/it] {'loss': 0.0073, 'grad_norm': 10.786852752934621, 'learning_rate': 6.279999999999999e-08, 'completion_length': 44.58928680419922, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 1.0, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.18310546875, 'epoch': 0.94} 94%|█████████▎| 2343/2500 [18:17:47<55:01, 21.03s/it] 94%|█████████▍| 2344/2500 [18:18:09<55:03, 21.18s/it] {'loss': 0.0156, 'grad_norm': 4.548701751095214, 'learning_rate': 6.239999999999999e-08, 'completion_length': 42.88393020629883, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.9375001192092896, 'reward_std': 0.1767766997218132, 'kl': 0.3896484375, 'epoch': 0.94} 94%|█████████▍| 2344/2500 [18:18:09<55:03, 21.18s/it] 94%|█████████▍| 2345/2500 [18:18:29<53:39, 20.77s/it] {'loss': 0.0083, 'grad_norm': 1.8358760772994158, 'learning_rate': 6.2e-08, 'completion_length': 45.66071701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.20654296875, 'epoch': 0.94} 94%|█████████▍| 2345/2500 [18:18:29<53:39, 20.77s/it] 94%|█████████▍| 2346/2500 [18:18:52<55:13, 21.52s/it] {'loss': 0.0177, 'grad_norm': 4.464896188558381, 'learning_rate': 6.16e-08, 'completion_length': 46.47321701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.44091796875, 'epoch': 0.94} 94%|█████████▍| 2346/2500 [18:18:52<55:13, 21.52s/it] 94%|█████████▍| 2347/2500 [18:19:13<54:28, 21.37s/it] {'loss': 0.0456, 'grad_norm': 10.356079796217365, 'learning_rate': 6.119999999999999e-08, 'completion_length': 43.95535850524902, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.955357164144516, 'reward': 1.9196429252624512, 'reward_std': 0.19239908456802368, 'kl': 1.13818359375, 'epoch': 0.94} 94%|█████████▍| 2347/2500 [18:19:13<54:28, 21.37s/it] 94%|█████████▍| 2348/2500 [18:19:34<53:51, 21.26s/it] {'loss': 0.0077, 'grad_norm': 0.7556227271962198, 'learning_rate': 6.08e-08, 'completion_length': 48.517860412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.19287109375, 'epoch': 0.94} 94%|█████████▍| 2348/2500 [18:19:34<53:51, 21.26s/it] 94%|█████████▍| 2349/2500 [18:19:56<53:47, 21.37s/it] {'loss': 0.0243, 'grad_norm': 11.202384039257321, 'learning_rate': 6.04e-08, 'completion_length': 45.17857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.6083984375, 'epoch': 0.94} 94%|█████████▍| 2349/2500 [18:19:56<53:47, 21.37s/it] 94%|█████████▍| 2350/2500 [18:20:17<53:24, 21.37s/it] {'loss': 0.0126, 'grad_norm': 2.5795813371991896, 'learning_rate': 6e-08, 'completion_length': 45.19643020629883, 'rewards/accuracy_reward': 0.9642857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.946428656578064, 'reward_std': 0.0901123583316803, 'kl': 0.31591796875, 'epoch': 0.94} 94%|█████████▍| 2350/2500 [18:20:17<53:24, 21.37s/it] 94%|█████████▍| 2351/2500 [18:20:38<53:11, 21.42s/it] {'loss': 0.0174, 'grad_norm': 4.045750286153592, 'learning_rate': 5.96e-08, 'completion_length': 52.67857360839844, 'rewards/accuracy_reward': 0.9196429252624512, 'rewards/format_reward': 0.973214328289032, 'reward': 1.8928572535514832, 'reward_std': 0.18458788841962814, 'kl': 0.435546875, 'epoch': 0.94} 94%|█████████▍| 2351/2500 [18:20:38<53:11, 21.42s/it] 94%|█████████▍| 2352/2500 [18:20:59<52:28, 21.27s/it] {'loss': 0.0075, 'grad_norm': 1.6284096454182553, 'learning_rate': 5.92e-08, 'completion_length': 44.12500190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.18701171875, 'epoch': 0.94} 94%|█████████▍| 2352/2500 [18:20:59<52:28, 21.27s/it] 94%|█████████▍| 2353/2500 [18:21:20<51:16, 20.93s/it] {'loss': 0.0099, 'grad_norm': 2.5363860500985997, 'learning_rate': 5.88e-08, 'completion_length': 46.48214530944824, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9017857909202576, 'reward_std': 0.07576144114136696, 'kl': 0.24658203125, 'epoch': 0.94} 94%|█████████▍| 2353/2500 [18:21:20<51:16, 20.93s/it] 94%|█████████▍| 2354/2500 [18:21:41<51:05, 21.00s/it] {'loss': 0.0148, 'grad_norm': 3.974678444101958, 'learning_rate': 5.84e-08, 'completion_length': 46.82143020629883, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9196429252624512, 'reward_std': 0.1379830539226532, 'kl': 0.37109375, 'epoch': 0.94} 94%|█████████▍| 2354/2500 [18:21:41<51:05, 21.00s/it] 94%|█████████▍| 2355/2500 [18:22:01<50:21, 20.83s/it] {'loss': 0.0132, 'grad_norm': 2.2687122409920875, 'learning_rate': 5.8e-08, 'completion_length': 48.16071701049805, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.33203125, 'epoch': 0.94} 94%|█████████▍| 2355/2500 [18:22:01<50:21, 20.83s/it] 94%|█████████▍| 2356/2500 [18:22:20<48:41, 20.28s/it] {'loss': 0.0187, 'grad_norm': 2.9261596730138844, 'learning_rate': 5.759999999999999e-08, 'completion_length': 48.33928871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.46826171875, 'epoch': 0.94} 94%|█████████▍| 2356/2500 [18:22:20<48:41, 20.28s/it] 94%|█████████▍| 2357/2500 [18:22:41<48:56, 20.53s/it] {'loss': 0.0144, 'grad_norm': 1.6343128566774465, 'learning_rate': 5.7199999999999996e-08, 'completion_length': 45.02678871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.357421875, 'epoch': 0.94} 94%|█████████▍| 2357/2500 [18:22:41<48:56, 20.53s/it] 94%|█████████▍| 2358/2500 [18:23:02<48:53, 20.66s/it] {'loss': 0.0184, 'grad_norm': 3.6973925563176366, 'learning_rate': 5.68e-08, 'completion_length': 44.910715103149414, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.05831882357597351, 'kl': 0.458984375, 'epoch': 0.94} 94%|█████████▍| 2358/2500 [18:23:02<48:53, 20.66s/it] 94%|█████████▍| 2359/2500 [18:23:23<48:40, 20.71s/it] {'loss': 0.0423, 'grad_norm': 3.8207987680733964, 'learning_rate': 5.6399999999999995e-08, 'completion_length': 42.44643020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 1.0556640625, 'epoch': 0.94} 94%|█████████▍| 2359/2500 [18:23:23<48:40, 20.71s/it] 94%|█████████▍| 2360/2500 [18:23:45<49:02, 21.02s/it] {'loss': 0.0596, 'grad_norm': 7.138149265222913, 'learning_rate': 5.6e-08, 'completion_length': 43.87500190734863, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.9196429252624512, 'reward_std': 0.09138382971286774, 'kl': 1.486328125, 'epoch': 0.94} 94%|█████████▍| 2360/2500 [18:23:45<49:02, 21.02s/it] 94%|█████████▍| 2361/2500 [18:24:05<48:10, 20.79s/it] {'loss': 0.0144, 'grad_norm': 1.9942740817860827, 'learning_rate': 5.5599999999999995e-08, 'completion_length': 46.75893020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.3603515625, 'epoch': 0.94} 94%|█████████▍| 2361/2500 [18:24:05<48:10, 20.79s/it] 94%|█████████▍| 2362/2500 [18:24:26<47:54, 20.83s/it] {'loss': 0.0116, 'grad_norm': 3.687510185756513, 'learning_rate': 5.52e-08, 'completion_length': 44.43750190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.2900390625, 'epoch': 0.94} 94%|█████████▍| 2362/2500 [18:24:26<47:54, 20.83s/it] 95%|█████████▍| 2363/2500 [18:24:49<49:11, 21.54s/it] {'loss': 0.0052, 'grad_norm': 0.2707351768148224, 'learning_rate': 5.48e-08, 'completion_length': 43.33035850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13037109375, 'epoch': 0.95} 95%|█████████▍| 2363/2500 [18:24:49<49:11, 21.54s/it] 95%|█████████▍| 2364/2500 [18:25:10<48:06, 21.22s/it] {'loss': 0.0209, 'grad_norm': 4.777130481410048, 'learning_rate': 5.44e-08, 'completion_length': 46.68750190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.5224609375, 'epoch': 0.95} 95%|█████████▍| 2364/2500 [18:25:10<48:06, 21.22s/it] 95%|█████████▍| 2365/2500 [18:25:31<47:33, 21.14s/it] {'loss': 0.0202, 'grad_norm': 2.9954039573403954, 'learning_rate': 5.3999999999999994e-08, 'completion_length': 43.75893020629883, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.5048828125, 'epoch': 0.95} 95%|█████████▍| 2365/2500 [18:25:31<47:33, 21.14s/it] 95%|█████████▍| 2366/2500 [18:25:53<48:14, 21.60s/it] {'loss': 0.0347, 'grad_norm': 5.316290652064122, 'learning_rate': 5.36e-08, 'completion_length': 49.90178680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.8671875, 'epoch': 0.95} 95%|█████████▍| 2366/2500 [18:25:53<48:14, 21.60s/it] 95%|█████████▍| 2367/2500 [18:26:13<46:44, 21.08s/it] {'loss': 0.0106, 'grad_norm': 2.331323906415977, 'learning_rate': 5.319999999999999e-08, 'completion_length': 45.375003814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.26611328125, 'epoch': 0.95} 95%|█████████▍| 2367/2500 [18:26:13<46:44, 21.08s/it] 95%|█████████▍| 2368/2500 [18:26:33<45:32, 20.70s/it] {'loss': 0.0467, 'grad_norm': 6.627836794693878, 'learning_rate': 5.2799999999999996e-08, 'completion_length': 42.87500190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9642857909202576, 'reward_std': 0.10101525112986565, 'kl': 1.171875, 'epoch': 0.95} 95%|█████████▍| 2368/2500 [18:26:33<45:32, 20.70s/it] 95%|█████████▍| 2369/2500 [18:26:55<45:53, 21.02s/it] {'loss': 0.0182, 'grad_norm': 2.878960606146657, 'learning_rate': 5.24e-08, 'completion_length': 44.46428871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.4541015625, 'epoch': 0.95} 95%|█████████▍| 2369/2500 [18:26:55<45:53, 21.02s/it] 95%|█████████▍| 2370/2500 [18:27:15<45:18, 20.92s/it] {'loss': 0.0253, 'grad_norm': 2.409552343327847, 'learning_rate': 5.1999999999999996e-08, 'completion_length': 50.87500190734863, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9553572535514832, 'reward_std': 0.10365218669176102, 'kl': 0.6318359375, 'epoch': 0.95} 95%|█████████▍| 2370/2500 [18:27:15<45:18, 20.92s/it] 95%|█████████▍| 2371/2500 [18:27:36<44:27, 20.68s/it] {'loss': 0.0497, 'grad_norm': 3.9237633055550667, 'learning_rate': 5.16e-08, 'completion_length': 44.43750190734863, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.955357164144516, 'reward': 1.919642984867096, 'reward_std': 0.17104806005954742, 'kl': 1.24267578125, 'epoch': 0.95} 95%|█████████▍| 2371/2500 [18:27:36<44:27, 20.68s/it] 95%|█████████▍| 2372/2500 [18:27:57<44:23, 20.81s/it] {'loss': 0.0143, 'grad_norm': 3.6780093992443827, 'learning_rate': 5.12e-08, 'completion_length': 46.25893020629883, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9196429252624512, 'reward_std': 0.1664527878165245, 'kl': 0.3583984375, 'epoch': 0.95} 95%|█████████▍| 2372/2500 [18:27:57<44:23, 20.81s/it] 95%|█████████▍| 2373/2500 [18:28:17<43:54, 20.75s/it] {'loss': 0.0116, 'grad_norm': 1.7766679999945838, 'learning_rate': 5.08e-08, 'completion_length': 45.38393020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.2900390625, 'epoch': 0.95} 95%|█████████▍| 2373/2500 [18:28:17<43:54, 20.75s/it] 95%|█████████▍| 2374/2500 [18:28:39<44:14, 21.07s/it] {'loss': 0.0246, 'grad_norm': 5.823319844042016, 'learning_rate': 5.04e-08, 'completion_length': 45.74107360839844, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.883928656578064, 'reward_std': 0.1767766922712326, 'kl': 0.6142578125, 'epoch': 0.95} 95%|█████████▍| 2374/2500 [18:28:39<44:14, 21.07s/it] 95%|█████████▌| 2375/2500 [18:29:03<45:52, 22.02s/it] {'loss': 0.024, 'grad_norm': 2.9217230398556358, 'learning_rate': 5e-08, 'completion_length': 48.35714530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.6015625, 'epoch': 0.95} 95%|█████████▌| 2375/2500 [18:29:03<45:52, 22.02s/it] 95%|█████████▌| 2376/2500 [18:29:23<44:16, 21.42s/it] {'loss': 0.0052, 'grad_norm': 0.3630482531743728, 'learning_rate': 4.9599999999999994e-08, 'completion_length': 43.80357360839844, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.1298828125, 'epoch': 0.95} 95%|█████████▌| 2376/2500 [18:29:23<44:16, 21.42s/it] 95%|█████████▌| 2377/2500 [18:29:44<43:27, 21.20s/it] {'loss': 0.0133, 'grad_norm': 2.001185250230421, 'learning_rate': 4.92e-08, 'completion_length': 45.22321701049805, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.3330078125, 'epoch': 0.95} 95%|█████████▌| 2377/2500 [18:29:44<43:27, 21.20s/it] 95%|█████████▌| 2378/2500 [18:30:04<42:36, 20.96s/it] {'loss': 0.0135, 'grad_norm': 3.026238125052817, 'learning_rate': 4.88e-08, 'completion_length': 47.35714530944824, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9553571939468384, 'reward_std': 0.12626907229423523, 'kl': 0.33740234375, 'epoch': 0.95} 95%|█████████▌| 2378/2500 [18:30:04<42:36, 20.96s/it] 95%|█████████▌| 2379/2500 [18:30:24<41:14, 20.45s/it] {'loss': 0.0114, 'grad_norm': 3.4243460871547367, 'learning_rate': 4.8399999999999997e-08, 'completion_length': 51.23214530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.973214328289032, 'reward_std': 0.07576143741607666, 'kl': 0.28515625, 'epoch': 0.95} 95%|█████████▌| 2379/2500 [18:30:24<41:14, 20.45s/it] 95%|█████████▌| 2380/2500 [18:30:44<41:06, 20.56s/it] {'loss': 0.0074, 'grad_norm': 0.617088761784798, 'learning_rate': 4.8e-08, 'completion_length': 47.03571701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1845703125, 'epoch': 0.95} 95%|█████████▌| 2380/2500 [18:30:45<41:06, 20.56s/it] 95%|█████████▌| 2381/2500 [18:31:05<40:44, 20.54s/it] {'loss': 0.0293, 'grad_norm': 3.4342607207918023, 'learning_rate': 4.76e-08, 'completion_length': 45.15178871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.7353515625, 'epoch': 0.95} 95%|█████████▌| 2381/2500 [18:31:05<40:44, 20.54s/it] 95%|█████████▌| 2382/2500 [18:31:24<39:40, 20.17s/it] {'loss': 0.0219, 'grad_norm': 2.734631419115553, 'learning_rate': 4.72e-08, 'completion_length': 44.20535850524902, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.9285714626312256, 'reward_std': 0.1671452671289444, 'kl': 0.54931640625, 'epoch': 0.95} 95%|█████████▌| 2382/2500 [18:31:24<39:40, 20.17s/it] 95%|█████████▌| 2383/2500 [18:31:46<40:07, 20.58s/it] {'loss': 0.011, 'grad_norm': 1.0497485899229821, 'learning_rate': 4.68e-08, 'completion_length': 44.55357360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.27392578125, 'epoch': 0.95} 95%|█████████▌| 2383/2500 [18:31:46<40:07, 20.58s/it] 95%|█████████▌| 2384/2500 [18:32:07<40:09, 20.77s/it] {'loss': 0.0106, 'grad_norm': 1.137639701893513, 'learning_rate': 4.639999999999999e-08, 'completion_length': 48.58928871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.2646484375, 'epoch': 0.95} 95%|█████████▌| 2384/2500 [18:32:07<40:09, 20.77s/it] 95%|█████████▌| 2385/2500 [18:32:27<39:21, 20.54s/it] {'loss': 0.0222, 'grad_norm': 4.525459951848469, 'learning_rate': 4.5999999999999995e-08, 'completion_length': 51.45535850524902, 'rewards/accuracy_reward': 0.8839285969734192, 'rewards/format_reward': 0.973214328289032, 'reward': 1.8571429252624512, 'reward_std': 0.18458789587020874, 'kl': 0.5546875, 'epoch': 0.95} 95%|█████████▌| 2385/2500 [18:32:27<39:21, 20.54s/it] 95%|█████████▌| 2386/2500 [18:32:50<40:19, 21.22s/it] {'loss': 0.0086, 'grad_norm': 2.4270130481140626, 'learning_rate': 4.56e-08, 'completion_length': 50.50893020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.21435546875, 'epoch': 0.95} 95%|█████████▌| 2386/2500 [18:32:50<40:19, 21.22s/it] 95%|█████████▌| 2387/2500 [18:33:10<39:12, 20.82s/it] {'loss': 0.0452, 'grad_norm': 8.009000088268731, 'learning_rate': 4.5199999999999994e-08, 'completion_length': 45.85714340209961, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.9017857909202576, 'reward_std': 0.23899830877780914, 'kl': 1.12890625, 'epoch': 0.95} 95%|█████████▌| 2387/2500 [18:33:10<39:12, 20.82s/it] 96%|█████████▌| 2388/2500 [18:33:30<38:45, 20.76s/it] {'loss': 0.0221, 'grad_norm': 2.1068340453446237, 'learning_rate': 4.48e-08, 'completion_length': 45.71428680419922, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857909202576, 'reward_std': 0.10101525112986565, 'kl': 0.552734375, 'epoch': 0.96} 96%|█████████▌| 2388/2500 [18:33:30<38:45, 20.76s/it] 96%|█████████▌| 2389/2500 [18:33:51<38:31, 20.83s/it] {'loss': 0.0076, 'grad_norm': 1.3077421399241895, 'learning_rate': 4.44e-08, 'completion_length': 45.05357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.1904296875, 'epoch': 0.96} 96%|█████████▌| 2389/2500 [18:33:51<38:31, 20.83s/it] 96%|█████████▌| 2390/2500 [18:34:11<37:38, 20.53s/it] {'loss': 0.0086, 'grad_norm': 0.7582593241154182, 'learning_rate': 4.4e-08, 'completion_length': 41.96428871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.214599609375, 'epoch': 0.96} 96%|█████████▌| 2390/2500 [18:34:11<37:38, 20.53s/it] 96%|█████████▌| 2391/2500 [18:34:32<37:34, 20.68s/it] {'loss': 0.0121, 'grad_norm': 2.7804223163852737, 'learning_rate': 4.36e-08, 'completion_length': 51.43750190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.3037109375, 'epoch': 0.96} 96%|█████████▌| 2391/2500 [18:34:32<37:34, 20.68s/it] 96%|█████████▌| 2392/2500 [18:34:54<38:04, 21.16s/it] {'loss': 0.0115, 'grad_norm': 3.996518029680947, 'learning_rate': 4.32e-08, 'completion_length': 50.70535850524902, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9553571939468384, 'reward_std': 0.12626906856894493, 'kl': 0.28759765625, 'epoch': 0.96} 96%|█████████▌| 2392/2500 [18:34:55<38:04, 21.16s/it] 96%|█████████▌| 2393/2500 [18:35:14<36:52, 20.68s/it] {'loss': 0.012, 'grad_norm': 1.0352916455652965, 'learning_rate': 4.279999999999999e-08, 'completion_length': 50.24107360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.30078125, 'epoch': 0.96} 96%|█████████▌| 2393/2500 [18:35:14<36:52, 20.68s/it] 96%|█████████▌| 2394/2500 [18:35:35<36:52, 20.87s/it] {'loss': 0.014, 'grad_norm': 2.7541602685992297, 'learning_rate': 4.2399999999999996e-08, 'completion_length': 47.08035850524902, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9375001192092896, 'reward_std': 0.1379830539226532, 'kl': 0.349609375, 'epoch': 0.96} 96%|█████████▌| 2394/2500 [18:35:35<36:52, 20.87s/it] 96%|█████████▌| 2395/2500 [18:35:58<37:23, 21.37s/it] {'loss': 0.0474, 'grad_norm': 2.8379426650929527, 'learning_rate': 4.2e-08, 'completion_length': 45.33035850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.9642857313156128, 'reward_std': 0.0835726335644722, 'kl': 1.1865234375, 'epoch': 0.96} 96%|█████████▌| 2395/2500 [18:35:58<37:23, 21.37s/it] 96%|█████████▌| 2396/2500 [18:36:18<36:15, 20.92s/it] {'loss': 0.0122, 'grad_norm': 2.4961808982562244, 'learning_rate': 4.1599999999999995e-08, 'completion_length': 40.73214530944824, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.30517578125, 'epoch': 0.96} 96%|█████████▌| 2396/2500 [18:36:18<36:15, 20.92s/it] 96%|█████████▌| 2397/2500 [18:36:39<35:59, 20.97s/it] {'loss': 0.0163, 'grad_norm': 2.1213223338171168, 'learning_rate': 4.12e-08, 'completion_length': 44.18750190734863, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.40576171875, 'epoch': 0.96} 96%|█████████▌| 2397/2500 [18:36:39<35:59, 20.97s/it] 96%|█████████▌| 2398/2500 [18:37:01<36:16, 21.34s/it] {'loss': 0.0078, 'grad_norm': 1.3850277219073452, 'learning_rate': 4.08e-08, 'completion_length': 50.99107360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.196044921875, 'epoch': 0.96} 96%|█████████▌| 2398/2500 [18:37:01<36:16, 21.34s/it] 96%|█████████▌| 2399/2500 [18:37:21<35:26, 21.06s/it] {'loss': 0.0105, 'grad_norm': 2.561036755553549, 'learning_rate': 4.04e-08, 'completion_length': 45.39285850524902, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.2626953125, 'epoch': 0.96} 96%|█████████▌| 2399/2500 [18:37:21<35:26, 21.06s/it] 96%|█████████▌| 2400/2500 [18:37:44<35:40, 21.40s/it] {'loss': 0.0168, 'grad_norm': 2.5739096437982543, 'learning_rate': 4e-08, 'completion_length': 51.78571701049805, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9375000596046448, 'reward_std': 0.09132473915815353, 'kl': 0.4208984375, 'epoch': 0.96} 96%|█████████▌| 2400/2500 [18:37:44<35:40, 21.40s/it] 96%|█████████▌| 2401/2500 [18:38:41<53:16, 32.29s/it] {'loss': 0.006, 'grad_norm': 0.4285021129727418, 'learning_rate': 3.9600000000000004e-08, 'completion_length': 51.08928871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.15087890625, 'epoch': 0.96} 96%|█████████▌| 2401/2500 [18:38:41<53:16, 32.29s/it] 96%|█████████▌| 2402/2500 [18:39:01<46:35, 28.53s/it] {'loss': 0.0185, 'grad_norm': 2.5997200915507954, 'learning_rate': 3.9199999999999994e-08, 'completion_length': 54.50893020629883, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9285714626312256, 'reward_std': 0.0835726335644722, 'kl': 0.462890625, 'epoch': 0.96} 96%|█████████▌| 2402/2500 [18:39:01<46:35, 28.53s/it] 96%|█████████▌| 2403/2500 [18:39:20<41:19, 25.56s/it] {'loss': 0.0139, 'grad_norm': 1.2425890422239136, 'learning_rate': 3.88e-08, 'completion_length': 51.73214530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.34765625, 'epoch': 0.96} 96%|█████████▌| 2403/2500 [18:39:20<41:19, 25.56s/it] 96%|█████████▌| 2404/2500 [18:39:42<39:09, 24.48s/it] {'loss': 0.0178, 'grad_norm': 2.1117454110441614, 'learning_rate': 3.839999999999999e-08, 'completion_length': 45.91964530944824, 'rewards/accuracy_reward': 0.9821429252624512, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642858505249023, 'reward_std': 0.10101525485515594, 'kl': 0.4462890625, 'epoch': 0.96} 96%|█████████▌| 2404/2500 [18:39:42<39:09, 24.48s/it] 96%|█████████▌| 2405/2500 [18:40:03<37:07, 23.45s/it] {'loss': 0.0068, 'grad_norm': 0.6273656078438377, 'learning_rate': 3.7999999999999996e-08, 'completion_length': 48.910715103149414, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.16943359375, 'epoch': 0.96} 96%|█████████▌| 2405/2500 [18:40:03<37:07, 23.45s/it] 96%|█████████▌| 2406/2500 [18:40:23<35:23, 22.59s/it] {'loss': 0.0084, 'grad_norm': 0.5224350075246551, 'learning_rate': 3.76e-08, 'completion_length': 46.57143020629883, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.21142578125, 'epoch': 0.96} 96%|█████████▌| 2406/2500 [18:40:23<35:23, 22.59s/it] 96%|█████████▋| 2407/2500 [18:40:45<34:46, 22.43s/it] {'loss': 0.0096, 'grad_norm': 2.403290398110577, 'learning_rate': 3.7199999999999996e-08, 'completion_length': 45.58035850524902, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.23974609375, 'epoch': 0.96} 96%|█████████▋| 2407/2500 [18:40:45<34:46, 22.43s/it] 96%|█████████▋| 2408/2500 [18:41:10<35:14, 22.99s/it] {'loss': 0.009, 'grad_norm': 1.1020344591867182, 'learning_rate': 3.68e-08, 'completion_length': 47.66071701049805, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9285714626312256, 'reward_std': 0.0, 'kl': 0.22509765625, 'epoch': 0.96} 96%|█████████▋| 2408/2500 [18:41:10<35:14, 22.99s/it] 96%|█████████▋| 2409/2500 [18:41:28<32:55, 21.70s/it] {'loss': 0.0074, 'grad_norm': 3.519546887276724, 'learning_rate': 3.64e-08, 'completion_length': 44.64285850524902, 'rewards/accuracy_reward': 0.928571492433548, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.919642984867096, 'reward_std': 0.07576144114136696, 'kl': 0.18408203125, 'epoch': 0.96} 96%|█████████▋| 2409/2500 [18:41:28<32:55, 21.70s/it] 96%|█████████▋| 2410/2500 [18:41:50<32:38, 21.76s/it] {'loss': 0.0124, 'grad_norm': 3.6068533828853244, 'learning_rate': 3.6e-08, 'completion_length': 45.44643020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.3095703125, 'epoch': 0.96} 96%|█████████▋| 2410/2500 [18:41:50<32:38, 21.76s/it] 96%|█████████▋| 2411/2500 [18:42:10<31:26, 21.19s/it] {'loss': 0.0082, 'grad_norm': 0.4811031626854162, 'learning_rate': 3.56e-08, 'completion_length': 42.52678680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.2041015625, 'epoch': 0.96} 96%|█████████▋| 2411/2500 [18:42:10<31:26, 21.19s/it] 96%|█████████▋| 2412/2500 [18:42:30<30:37, 20.88s/it] {'loss': 0.0173, 'grad_norm': 5.200477052237422, 'learning_rate': 3.52e-08, 'completion_length': 50.10714340209961, 'rewards/accuracy_reward': 0.9464286267757416, 'rewards/format_reward': 0.973214328289032, 'reward': 1.919642984867096, 'reward_std': 0.18849068135023117, 'kl': 0.43359375, 'epoch': 0.96} 96%|█████████▋| 2412/2500 [18:42:30<30:37, 20.88s/it] 97%|█████████▋| 2413/2500 [18:42:54<31:18, 21.59s/it] {'loss': 0.0105, 'grad_norm': 3.03211700941563, 'learning_rate': 3.4799999999999994e-08, 'completion_length': 53.392860412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.26171875, 'epoch': 0.97} 97%|█████████▋| 2413/2500 [18:42:54<31:18, 21.59s/it] 97%|█████████▋| 2414/2500 [18:43:14<30:34, 21.33s/it] {'loss': 0.015, 'grad_norm': 7.460161440709376, 'learning_rate': 3.44e-08, 'completion_length': 45.23214530944824, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.910714328289032, 'reward_std': 0.05050762742757797, 'kl': 0.37548828125, 'epoch': 0.97} 97%|█████████▋| 2414/2500 [18:43:14<30:34, 21.33s/it] 97%|█████████▋| 2415/2500 [18:43:34<29:37, 20.91s/it] {'loss': 0.0054, 'grad_norm': 2.6485884823789583, 'learning_rate': 3.4e-08, 'completion_length': 44.36607360839844, 'rewards/accuracy_reward': 0.9732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.973214328289032, 'reward_std': 0.03696779906749725, 'kl': 0.134765625, 'epoch': 0.97} 97%|█████████▋| 2415/2500 [18:43:34<29:37, 20.91s/it] 97%|█████████▋| 2416/2500 [18:43:59<30:40, 21.91s/it] {'loss': 0.0121, 'grad_norm': 1.9554194675225032, 'learning_rate': 3.3599999999999996e-08, 'completion_length': 50.87500190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.30322265625, 'epoch': 0.97} 97%|█████████▋| 2416/2500 [18:43:59<30:40, 21.91s/it] 97%|█████████▋| 2417/2500 [18:44:20<30:15, 21.88s/it] {'loss': 0.013, 'grad_norm': 2.3181182948300143, 'learning_rate': 3.32e-08, 'completion_length': 47.63393020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9732143878936768, 'reward_std': 0.07576144114136696, 'kl': 0.3251953125, 'epoch': 0.97} 97%|█████████▋| 2417/2500 [18:44:20<30:15, 21.88s/it] 97%|█████████▋| 2418/2500 [18:44:43<30:21, 22.21s/it] {'loss': 0.0302, 'grad_norm': 4.9254888262404695, 'learning_rate': 3.28e-08, 'completion_length': 45.81250190734863, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.75537109375, 'epoch': 0.97} 97%|█████████▋| 2418/2500 [18:44:43<30:21, 22.21s/it] 97%|█████████▋| 2419/2500 [18:45:05<29:48, 22.09s/it] {'loss': 0.0156, 'grad_norm': 1.7102453041374022, 'learning_rate': 3.24e-08, 'completion_length': 45.14285850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.38916015625, 'epoch': 0.97} 97%|█████████▋| 2419/2500 [18:45:05<29:48, 22.09s/it] 97%|█████████▋| 2420/2500 [18:45:27<29:18, 21.98s/it] {'loss': 0.0095, 'grad_norm': 2.575321713154193, 'learning_rate': 3.2e-08, 'completion_length': 44.55357360839844, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9464285969734192, 'reward_std': 0.13408026099205017, 'kl': 0.23876953125, 'epoch': 0.97} 97%|█████████▋| 2420/2500 [18:45:27<29:18, 21.98s/it] 97%|█████████▋| 2421/2500 [18:45:49<28:52, 21.92s/it] {'loss': 0.0136, 'grad_norm': 3.0674724326999936, 'learning_rate': 3.16e-08, 'completion_length': 48.25893211364746, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.34033203125, 'epoch': 0.97} 97%|█████████▋| 2421/2500 [18:45:49<28:52, 21.92s/it] 97%|█████████▋| 2422/2500 [18:46:09<27:51, 21.43s/it] {'loss': 0.0137, 'grad_norm': 0.9595088674509468, 'learning_rate': 3.1199999999999995e-08, 'completion_length': 46.34821701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.3408203125, 'epoch': 0.97} 97%|█████████▋| 2422/2500 [18:46:09<27:51, 21.43s/it] 97%|█████████▋| 2423/2500 [18:46:32<28:11, 21.97s/it] {'loss': 0.0086, 'grad_norm': 1.2641173181888627, 'learning_rate': 3.08e-08, 'completion_length': 46.78571701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.2138671875, 'epoch': 0.97} 97%|█████████▋| 2423/2500 [18:46:32<28:11, 21.97s/it] 97%|█████████▋| 2424/2500 [18:46:59<29:35, 23.36s/it] {'loss': 0.0332, 'grad_norm': 4.65590292254742, 'learning_rate': 3.04e-08, 'completion_length': 50.205360412597656, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8571429252624512, 'reward_std': 0.11146338284015656, 'kl': 0.83203125, 'epoch': 0.97} 97%|█████████▋| 2424/2500 [18:46:59<29:35, 23.36s/it] 97%|█████████▋| 2425/2500 [18:47:18<27:34, 22.06s/it] {'loss': 0.0076, 'grad_norm': 1.1594811872993014, 'learning_rate': 3e-08, 'completion_length': 50.53571701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.18896484375, 'epoch': 0.97} 97%|█████████▋| 2425/2500 [18:47:18<27:34, 22.06s/it] 97%|█████████▋| 2426/2500 [18:47:38<26:30, 21.49s/it] {'loss': 0.0116, 'grad_norm': 1.4502501990603982, 'learning_rate': 2.96e-08, 'completion_length': 51.35714530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.28955078125, 'epoch': 0.97} 97%|█████████▋| 2426/2500 [18:47:38<26:30, 21.49s/it] 97%|█████████▋| 2427/2500 [18:48:01<26:47, 22.02s/it] {'loss': 0.0148, 'grad_norm': 1.395415981379526, 'learning_rate': 2.92e-08, 'completion_length': 50.598215103149414, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.36962890625, 'epoch': 0.97} 97%|█████████▋| 2427/2500 [18:48:01<26:47, 22.02s/it] 97%|█████████▋| 2428/2500 [18:48:21<25:29, 21.25s/it] {'loss': 0.0196, 'grad_norm': 2.9598662195061913, 'learning_rate': 2.8799999999999996e-08, 'completion_length': 49.38393020629883, 'rewards/accuracy_reward': 0.928571492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9107143878936768, 'reward_std': 0.07839837297797203, 'kl': 0.490234375, 'epoch': 0.97} 97%|█████████▋| 2428/2500 [18:48:21<25:29, 21.25s/it] 97%|█████████▋| 2429/2500 [18:48:42<25:10, 21.27s/it] {'loss': 0.0088, 'grad_norm': 28.53343006974539, 'learning_rate': 2.84e-08, 'completion_length': 43.38393020629883, 'rewards/accuracy_reward': 0.9196428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.07576143741607666, 'kl': 0.21923828125, 'epoch': 0.97} 97%|█████████▋| 2429/2500 [18:48:42<25:10, 21.27s/it] 97%|█████████▋| 2430/2500 [18:49:03<24:39, 21.14s/it] {'loss': 0.0073, 'grad_norm': 0.9130791530258486, 'learning_rate': 2.8e-08, 'completion_length': 50.83928871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.18310546875, 'epoch': 0.97} 97%|█████████▋| 2430/2500 [18:49:03<24:39, 21.14s/it] 97%|█████████▋| 2431/2500 [18:49:25<24:38, 21.42s/it] {'loss': 0.0116, 'grad_norm': 2.7491231976605355, 'learning_rate': 2.76e-08, 'completion_length': 52.68750190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.291015625, 'epoch': 0.97} 97%|█████████▋| 2431/2500 [18:49:25<24:38, 21.42s/it] 97%|█████████▋| 2432/2500 [18:49:54<27:02, 23.87s/it] {'loss': 0.006, 'grad_norm': 1.409431967246393, 'learning_rate': 2.72e-08, 'completion_length': 52.04464530944824, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.9464285969734192, 'reward_std': 0.033065006136894226, 'kl': 0.1484375, 'epoch': 0.97} 97%|█████████▋| 2432/2500 [18:49:54<27:02, 23.87s/it] 97%|█████████▋| 2433/2500 [18:50:35<32:18, 28.93s/it] {'loss': 0.0295, 'grad_norm': 2.6330065525370356, 'learning_rate': 2.68e-08, 'completion_length': 49.65178871154785, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.973214328289032, 'reward': 1.9553571939468384, 'reward_std': 0.09138382598757744, 'kl': 0.73828125, 'epoch': 0.97} 97%|█████████▋| 2433/2500 [18:50:35<32:18, 28.93s/it] 97%|█████████▋| 2434/2500 [18:50:56<29:04, 26.43s/it] {'loss': 0.0087, 'grad_norm': 0.9857226384245644, 'learning_rate': 2.6399999999999998e-08, 'completion_length': 46.00893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.2177734375, 'epoch': 0.97} 97%|█████████▋| 2434/2500 [18:50:56<29:04, 26.43s/it] 97%|█████████▋| 2435/2500 [18:51:17<27:06, 25.03s/it] {'loss': 0.008, 'grad_norm': 0.7001169283426351, 'learning_rate': 2.5999999999999998e-08, 'completion_length': 49.32143020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.2001953125, 'epoch': 0.97} 97%|█████████▋| 2435/2500 [18:51:17<27:06, 25.03s/it] 97%|█████████▋| 2436/2500 [18:51:39<25:33, 23.96s/it] {'loss': 0.0122, 'grad_norm': 2.4719786897661518, 'learning_rate': 2.56e-08, 'completion_length': 48.34821701049805, 'rewards/accuracy_reward': 0.955357164144516, 'rewards/format_reward': 1.0, 'reward': 1.9553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.3037109375, 'epoch': 0.97} 97%|█████████▋| 2436/2500 [18:51:39<25:33, 23.96s/it] 97%|█████████▋| 2437/2500 [18:52:00<24:07, 22.98s/it] {'loss': 0.0077, 'grad_norm': 1.1622708180860988, 'learning_rate': 2.52e-08, 'completion_length': 48.11607360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.19189453125, 'epoch': 0.97} 97%|█████████▋| 2437/2500 [18:52:00<24:07, 22.98s/it] 98%|█████████▊| 2438/2500 [18:52:19<22:43, 21.99s/it] {'loss': 0.0057, 'grad_norm': 0.3903651674207057, 'learning_rate': 2.4799999999999997e-08, 'completion_length': 48.580360412597656, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1435546875, 'epoch': 0.98} 98%|█████████▊| 2438/2500 [18:52:19<22:43, 21.99s/it] 98%|█████████▊| 2439/2500 [18:52:41<22:16, 21.92s/it] {'loss': 0.0111, 'grad_norm': 14.290794695994881, 'learning_rate': 2.44e-08, 'completion_length': 48.79464530944824, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9553571939468384, 'reward_std': 0.053144559264183044, 'kl': 0.27783203125, 'epoch': 0.98} 98%|█████████▊| 2439/2500 [18:52:41<22:16, 21.92s/it] 98%|█████████▊| 2440/2500 [18:53:03<21:52, 21.87s/it] {'loss': 0.0078, 'grad_norm': 0.6306066678777947, 'learning_rate': 2.4e-08, 'completion_length': 53.96428680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1953125, 'epoch': 0.98} 98%|█████████▊| 2440/2500 [18:53:03<21:52, 21.87s/it] 98%|█████████▊| 2441/2500 [18:53:23<20:56, 21.29s/it] {'loss': 0.0052, 'grad_norm': 1.3529096772165428, 'learning_rate': 2.36e-08, 'completion_length': 51.80357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.130859375, 'epoch': 0.98} 98%|█████████▊| 2441/2500 [18:53:23<20:56, 21.29s/it] 98%|█████████▊| 2442/2500 [18:53:45<20:46, 21.50s/it] {'loss': 0.0082, 'grad_norm': 1.9440666821124353, 'learning_rate': 2.3199999999999996e-08, 'completion_length': 47.84821701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.2060546875, 'epoch': 0.98} 98%|█████████▊| 2442/2500 [18:53:45<20:46, 21.50s/it] 98%|█████████▊| 2443/2500 [18:54:05<20:08, 21.21s/it] {'loss': 0.0078, 'grad_norm': 1.6893310735913916, 'learning_rate': 2.28e-08, 'completion_length': 44.70535850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.19580078125, 'epoch': 0.98} 98%|█████████▊| 2443/2500 [18:54:05<20:08, 21.21s/it] 98%|█████████▊| 2444/2500 [18:54:32<21:28, 23.01s/it] {'loss': 0.014, 'grad_norm': 2.298385025301231, 'learning_rate': 2.24e-08, 'completion_length': 49.73214530944824, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9642857909202576, 'reward_std': 0.0835726372897625, 'kl': 0.3486328125, 'epoch': 0.98} 98%|█████████▊| 2444/2500 [18:54:32<21:28, 23.01s/it] 98%|█████████▊| 2445/2500 [18:54:55<20:51, 22.76s/it] {'loss': 0.0064, 'grad_norm': 1.906535359608528, 'learning_rate': 2.2e-08, 'completion_length': 50.312503814697266, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.159423828125, 'epoch': 0.98} 98%|█████████▊| 2445/2500 [18:54:55<20:51, 22.76s/it] 98%|█████████▊| 2446/2500 [18:55:16<20:07, 22.37s/it] {'loss': 0.0075, 'grad_norm': 2.3999815539180376, 'learning_rate': 2.16e-08, 'completion_length': 52.58035850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1875, 'epoch': 0.98} 98%|█████████▊| 2446/2500 [18:55:16<20:07, 22.37s/it] 98%|█████████▊| 2447/2500 [18:55:36<19:01, 21.54s/it] {'loss': 0.0084, 'grad_norm': 0.78326443969626, 'learning_rate': 2.1199999999999998e-08, 'completion_length': 49.97321701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.21044921875, 'epoch': 0.98} 98%|█████████▊| 2447/2500 [18:55:36<19:01, 21.54s/it] 98%|█████████▊| 2448/2500 [18:56:00<19:23, 22.37s/it] {'loss': 0.0057, 'grad_norm': 0.7308416854344258, 'learning_rate': 2.0799999999999998e-08, 'completion_length': 55.63393211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.142578125, 'epoch': 0.98} 98%|█████████▊| 2448/2500 [18:56:00<19:23, 22.37s/it] 98%|█████████▊| 2449/2500 [18:56:21<18:33, 21.84s/it] {'loss': 0.0091, 'grad_norm': 0.8400496508290518, 'learning_rate': 2.04e-08, 'completion_length': 52.375003814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.22607421875, 'epoch': 0.98} 98%|█████████▊| 2449/2500 [18:56:21<18:33, 21.84s/it] 98%|█████████▊| 2450/2500 [18:56:42<18:00, 21.62s/it] {'loss': 0.0089, 'grad_norm': 0.6915821390568, 'learning_rate': 2e-08, 'completion_length': 45.22321701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.22314453125, 'epoch': 0.98} 98%|█████████▊| 2450/2500 [18:56:42<18:00, 21.62s/it] 98%|█████████▊| 2451/2500 [18:57:02<17:25, 21.33s/it] {'loss': 0.0109, 'grad_norm': 1.284920235997767, 'learning_rate': 1.9599999999999997e-08, 'completion_length': 50.830360412597656, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.272216796875, 'epoch': 0.98} 98%|█████████▊| 2451/2500 [18:57:02<17:25, 21.33s/it] 98%|█████████▊| 2452/2500 [18:57:23<16:49, 21.03s/it] {'loss': 0.0115, 'grad_norm': 3.463222536371402, 'learning_rate': 1.9199999999999997e-08, 'completion_length': 44.15178680419922, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.2890625, 'epoch': 0.98} 98%|█████████▊| 2452/2500 [18:57:23<16:49, 21.03s/it] 98%|█████████▊| 2453/2500 [18:57:45<16:47, 21.44s/it] {'loss': 0.0068, 'grad_norm': 0.5212521993087972, 'learning_rate': 1.88e-08, 'completion_length': 52.69643211364746, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.17041015625, 'epoch': 0.98} 98%|█████████▊| 2453/2500 [18:57:45<16:47, 21.44s/it] 98%|█████████▊| 2454/2500 [18:58:06<16:12, 21.14s/it] {'loss': 0.0108, 'grad_norm': 2.9919004906882067, 'learning_rate': 1.84e-08, 'completion_length': 48.17857360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.2705078125, 'epoch': 0.98} 98%|█████████▊| 2454/2500 [18:58:06<16:12, 21.14s/it] 98%|█████████▊| 2455/2500 [18:58:26<15:45, 21.01s/it] {'loss': 0.0068, 'grad_norm': 1.1150837345290987, 'learning_rate': 1.8e-08, 'completion_length': 51.30357360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.16943359375, 'epoch': 0.98} 98%|█████████▊| 2455/2500 [18:58:26<15:45, 21.01s/it] 98%|█████████▊| 2456/2500 [18:58:50<15:55, 21.72s/it] {'loss': 0.0075, 'grad_norm': 1.527610647745206, 'learning_rate': 1.76e-08, 'completion_length': 58.40178680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.18896484375, 'epoch': 0.98} 98%|█████████▊| 2456/2500 [18:58:50<15:55, 21.72s/it] 98%|█████████▊| 2457/2500 [18:59:16<16:38, 23.22s/it] {'loss': 0.0095, 'grad_norm': 2.3556772919530844, 'learning_rate': 1.72e-08, 'completion_length': 44.38393211364746, 'rewards/accuracy_reward': 0.9642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.9642857313156128, 'reward_std': 0.06222161650657654, 'kl': 0.23828125, 'epoch': 0.98} 98%|█████████▊| 2457/2500 [18:59:16<16:38, 23.22s/it] 98%|█████████▊| 2458/2500 [18:59:35<15:13, 21.76s/it] {'loss': 0.0081, 'grad_norm': 0.7964864432208413, 'learning_rate': 1.6799999999999998e-08, 'completion_length': 47.87500190734863, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.20263671875, 'epoch': 0.98} 98%|█████████▊| 2458/2500 [18:59:35<15:13, 21.76s/it] 98%|█████████▊| 2459/2500 [19:00:00<15:30, 22.70s/it] {'loss': 0.0315, 'grad_norm': 2.7337434102807707, 'learning_rate': 1.64e-08, 'completion_length': 44.33035850524902, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9732143878936768, 'reward_std': 0.05831881985068321, 'kl': 0.78857421875, 'epoch': 0.98} 98%|█████████▊| 2459/2500 [19:00:00<15:30, 22.70s/it] 98%|█████████▊| 2460/2500 [19:00:21<14:47, 22.18s/it] {'loss': 0.0065, 'grad_norm': 1.2515037286962896, 'learning_rate': 1.6e-08, 'completion_length': 46.91071701049805, 'rewards/accuracy_reward': 0.9375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.9375000596046448, 'reward_std': 0.025253813713788986, 'kl': 0.16162109375, 'epoch': 0.98} 98%|█████████▊| 2460/2500 [19:00:21<14:47, 22.18s/it] 98%|█████████▊| 2461/2500 [19:00:42<14:12, 21.85s/it] {'loss': 0.0049, 'grad_norm': 3.3420509824357345, 'learning_rate': 1.5599999999999997e-08, 'completion_length': 53.40178680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.12353515625, 'epoch': 0.98} 98%|█████████▊| 2461/2500 [19:00:42<14:12, 21.85s/it] 98%|█████████▊| 2462/2500 [19:01:03<13:46, 21.76s/it] {'loss': 0.0064, 'grad_norm': 1.6813954018651274, 'learning_rate': 1.52e-08, 'completion_length': 54.29464530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.1611328125, 'epoch': 0.98} 98%|█████████▊| 2462/2500 [19:01:03<13:46, 21.76s/it] 99%|█████████▊| 2463/2500 [19:01:24<13:10, 21.36s/it] {'loss': 0.0123, 'grad_norm': 1.935326426215518, 'learning_rate': 1.48e-08, 'completion_length': 52.00893211364746, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.3076171875, 'epoch': 0.99} 99%|█████████▊| 2463/2500 [19:01:24<13:10, 21.36s/it] 99%|█████████▊| 2464/2500 [19:01:45<12:49, 21.37s/it] {'loss': 0.0044, 'grad_norm': 0.17753716450451676, 'learning_rate': 1.4399999999999998e-08, 'completion_length': 51.19643020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.109130859375, 'epoch': 0.99} 99%|█████████▊| 2464/2500 [19:01:45<12:49, 21.37s/it] 99%|█████████▊| 2465/2500 [19:02:05<12:12, 20.93s/it] {'loss': 0.0088, 'grad_norm': 2.7453881132317606, 'learning_rate': 1.4e-08, 'completion_length': 46.26785850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.2197265625, 'epoch': 0.99} 99%|█████████▊| 2465/2500 [19:02:05<12:12, 20.93s/it] 99%|█████████▊| 2466/2500 [19:02:28<12:13, 21.58s/it] {'loss': 0.0061, 'grad_norm': 0.3325987327755173, 'learning_rate': 1.36e-08, 'completion_length': 48.14285850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1533203125, 'epoch': 0.99} 99%|█████████▊| 2466/2500 [19:02:28<12:13, 21.58s/it] 99%|█████████▊| 2467/2500 [19:02:49<11:49, 21.51s/it] {'loss': 0.0053, 'grad_norm': 0.3981113783633376, 'learning_rate': 1.3199999999999999e-08, 'completion_length': 50.785715103149414, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1318359375, 'epoch': 0.99} 99%|█████████▊| 2467/2500 [19:02:49<11:49, 21.51s/it] 99%|█████████▊| 2468/2500 [19:03:10<11:17, 21.18s/it] {'loss': 0.0089, 'grad_norm': 3.7806507001909955, 'learning_rate': 1.28e-08, 'completion_length': 51.39285850524902, 'rewards/accuracy_reward': 0.973214328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9553572535514832, 'reward_std': 0.12626906484365463, 'kl': 0.22216796875, 'epoch': 0.99} 99%|█████████▊| 2468/2500 [19:03:10<11:17, 21.18s/it] 99%|█████████▉| 2469/2500 [19:03:31<10:56, 21.19s/it] {'loss': 0.0075, 'grad_norm': 2.403390225299505, 'learning_rate': 1.2399999999999999e-08, 'completion_length': 52.20535850524902, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9196429252624512, 'reward_std': 0.07576143741607666, 'kl': 0.18798828125, 'epoch': 0.99} 99%|█████████▉| 2469/2500 [19:03:31<10:56, 21.19s/it] 99%|█████████▉| 2470/2500 [19:03:52<10:33, 21.10s/it] {'loss': 0.0059, 'grad_norm': 0.45990415512356686, 'learning_rate': 1.2e-08, 'completion_length': 51.75893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.146484375, 'epoch': 0.99} 99%|█████████▉| 2470/2500 [19:03:52<10:33, 21.10s/it] 99%|█████████▉| 2471/2500 [19:04:12<10:05, 20.89s/it] {'loss': 0.0098, 'grad_norm': 1.1625670720783343, 'learning_rate': 1.1599999999999998e-08, 'completion_length': 50.36607360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.24462890625, 'epoch': 0.99} 99%|█████████▉| 2471/2500 [19:04:12<10:05, 20.89s/it] 99%|█████████▉| 2472/2500 [19:04:32<09:31, 20.42s/it] {'loss': 0.0056, 'grad_norm': 0.2593373890866697, 'learning_rate': 1.12e-08, 'completion_length': 48.10714530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.140625, 'epoch': 0.99} 99%|█████████▉| 2472/2500 [19:04:32<09:31, 20.42s/it] 99%|█████████▉| 2473/2500 [19:04:54<09:25, 20.93s/it] {'loss': 0.0094, 'grad_norm': 0.7239220015852043, 'learning_rate': 1.08e-08, 'completion_length': 47.98214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.23583984375, 'epoch': 0.99} 99%|█████████▉| 2473/2500 [19:04:54<09:25, 20.93s/it] 99%|█████████▉| 2474/2500 [19:05:13<08:50, 20.41s/it] {'loss': 0.0053, 'grad_norm': 0.23783585111812985, 'learning_rate': 1.0399999999999999e-08, 'completion_length': 46.67857360839844, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.1328125, 'epoch': 0.99} 99%|█████████▉| 2474/2500 [19:05:13<08:50, 20.41s/it] 99%|█████████▉| 2475/2500 [19:05:33<08:28, 20.33s/it] {'loss': 0.0107, 'grad_norm': 1.8126891369517912, 'learning_rate': 1e-08, 'completion_length': 46.58035850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.267578125, 'epoch': 0.99} 99%|█████████▉| 2475/2500 [19:05:33<08:28, 20.33s/it] 99%|█████████▉| 2476/2500 [19:05:55<08:20, 20.84s/it] {'loss': 0.0067, 'grad_norm': 0.8150158630147314, 'learning_rate': 9.599999999999998e-09, 'completion_length': 50.500003814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.16845703125, 'epoch': 0.99} 99%|█████████▉| 2476/2500 [19:05:55<08:20, 20.84s/it] 99%|█████████▉| 2477/2500 [19:06:15<07:54, 20.61s/it] {'loss': 0.0064, 'grad_norm': 2.2440216299510998, 'learning_rate': 9.2e-09, 'completion_length': 55.65178680419922, 'rewards/accuracy_reward': 0.9464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.946428656578064, 'reward_std': 0.07124518603086472, 'kl': 0.16064453125, 'epoch': 0.99} 99%|█████████▉| 2477/2500 [19:06:15<07:54, 20.61s/it] 99%|█████████▉| 2478/2500 [19:06:36<07:33, 20.62s/it] {'loss': 0.02, 'grad_norm': 1.4257770095338358, 'learning_rate': 8.8e-09, 'completion_length': 54.33035850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.5009765625, 'epoch': 0.99} 99%|█████████▉| 2478/2500 [19:06:36<07:33, 20.62s/it] 99%|█████████▉| 2479/2500 [19:07:01<07:42, 22.01s/it] {'loss': 0.009, 'grad_norm': 1.428255652188218, 'learning_rate': 8.399999999999999e-09, 'completion_length': 48.99107360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.224609375, 'epoch': 0.99} 99%|█████████▉| 2479/2500 [19:07:01<07:42, 22.01s/it] 99%|█████████▉| 2480/2500 [19:07:21<07:09, 21.49s/it] {'loss': 0.0076, 'grad_norm': 0.8679192884254746, 'learning_rate': 8e-09, 'completion_length': 51.33035850524902, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.18896484375, 'epoch': 0.99} 99%|█████████▉| 2480/2500 [19:07:21<07:09, 21.49s/it] 99%|█████████▉| 2481/2500 [19:07:42<06:43, 21.26s/it] {'loss': 0.0137, 'grad_norm': 1.0993533035753178, 'learning_rate': 7.6e-09, 'completion_length': 51.90178871154785, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.341552734375, 'epoch': 0.99} 99%|█████████▉| 2481/2500 [19:07:42<06:43, 21.26s/it] 99%|█████████▉| 2482/2500 [19:08:04<06:27, 21.51s/it] {'loss': 0.0056, 'grad_norm': 0.3312093553938809, 'learning_rate': 7.199999999999999e-09, 'completion_length': 44.56250190734863, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.139404296875, 'epoch': 0.99} 99%|█████████▉| 2482/2500 [19:08:04<06:27, 21.51s/it] 99%|█████████▉| 2483/2500 [19:08:24<05:59, 21.12s/it] {'loss': 0.0051, 'grad_norm': 0.4236430054360565, 'learning_rate': 6.8e-09, 'completion_length': 52.83928871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.126953125, 'epoch': 0.99} 99%|█████████▉| 2483/2500 [19:08:24<05:59, 21.12s/it] 99%|█████████▉| 2484/2500 [19:08:47<05:43, 21.45s/it] {'loss': 0.0065, 'grad_norm': 0.4612747476993675, 'learning_rate': 6.4e-09, 'completion_length': 45.00893020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.162109375, 'epoch': 0.99} 99%|█████████▉| 2484/2500 [19:08:47<05:43, 21.45s/it] 99%|█████████▉| 2485/2500 [19:09:07<05:18, 21.26s/it] {'loss': 0.006, 'grad_norm': 1.0439568671426949, 'learning_rate': 6e-09, 'completion_length': 49.48214530944824, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.9821429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.14990234375, 'epoch': 0.99} 99%|█████████▉| 2485/2500 [19:09:07<05:18, 21.26s/it] 99%|█████████▉| 2486/2500 [19:09:28<04:56, 21.16s/it] {'loss': 0.0102, 'grad_norm': 2.757140323586178, 'learning_rate': 5.6e-09, 'completion_length': 47.41964530944824, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.25390625, 'epoch': 0.99} 99%|█████████▉| 2486/2500 [19:09:28<04:56, 21.16s/it] 99%|█████████▉| 2487/2500 [19:09:50<04:35, 21.20s/it] {'loss': 0.0084, 'grad_norm': 0.6919471124707307, 'learning_rate': 5.1999999999999994e-09, 'completion_length': 50.687503814697266, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.21044921875, 'epoch': 0.99} 99%|█████████▉| 2487/2500 [19:09:50<04:35, 21.20s/it] 100%|█████████▉| 2488/2500 [19:10:10<04:12, 21.04s/it] {'loss': 0.0261, 'grad_norm': 2.148020774401749, 'learning_rate': 4.799999999999999e-09, 'completion_length': 50.24107360839844, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.6552734375, 'epoch': 1.0} 100%|█████████▉| 2488/2500 [19:10:10<04:12, 21.04s/it] 100%|█████████▉| 2489/2500 [19:10:31<03:50, 20.95s/it] {'loss': 0.0145, 'grad_norm': 2.600299458190913, 'learning_rate': 4.4e-09, 'completion_length': 49.80357360839844, 'rewards/accuracy_reward': 0.9821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9642857313156128, 'reward_std': 0.10101525485515594, 'kl': 0.36328125, 'epoch': 1.0} 100%|█████████▉| 2489/2500 [19:10:31<03:50, 20.95s/it] 100%|█████████▉| 2490/2500 [19:10:54<03:34, 21.47s/it] {'loss': 0.0062, 'grad_norm': 2.88343589728451, 'learning_rate': 4e-09, 'completion_length': 55.53571701049805, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.15478515625, 'epoch': 1.0} 100%|█████████▉| 2490/2500 [19:10:54<03:34, 21.47s/it] 100%|█████████▉| 2491/2500 [19:11:14<03:09, 21.02s/it] {'loss': 0.0052, 'grad_norm': 0.28811098951803543, 'learning_rate': 3.5999999999999996e-09, 'completion_length': 51.85714530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.12939453125, 'epoch': 1.0} 100%|█████████▉| 2491/2500 [19:11:14<03:09, 21.02s/it] 100%|█████████▉| 2492/2500 [19:11:33<02:44, 20.58s/it] {'loss': 0.0063, 'grad_norm': 0.308851790154275, 'learning_rate': 3.2e-09, 'completion_length': 46.19643020629883, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.15869140625, 'epoch': 1.0} 100%|█████████▉| 2492/2500 [19:11:33<02:44, 20.58s/it] 100%|█████████▉| 2493/2500 [19:11:55<02:26, 20.93s/it] {'loss': 0.0109, 'grad_norm': 2.9780679142857043, 'learning_rate': 2.8e-09, 'completion_length': 56.21428680419922, 'rewards/accuracy_reward': 0.9553571939468384, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.9375001192092896, 'reward_std': 0.1379830539226532, 'kl': 0.27197265625, 'epoch': 1.0} 100%|█████████▉| 2493/2500 [19:11:55<02:26, 20.93s/it] 100%|█████████▉| 2494/2500 [19:12:16<02:06, 21.03s/it] {'loss': 0.0054, 'grad_norm': 0.3790448654711865, 'learning_rate': 2.3999999999999996e-09, 'completion_length': 53.10714530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.13525390625, 'epoch': 1.0} 100%|█████████▉| 2494/2500 [19:12:16<02:06, 21.03s/it] 100%|█████████▉| 2495/2500 [19:12:37<01:44, 20.89s/it] {'loss': 0.0072, 'grad_norm': 1.7769946673254289, 'learning_rate': 2e-09, 'completion_length': 48.91071701049805, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.179931640625, 'epoch': 1.0} 100%|█████████▉| 2495/2500 [19:12:37<01:44, 20.89s/it] 100%|█████████▉| 2496/2500 [19:12:58<01:23, 20.95s/it] {'loss': 0.0059, 'grad_norm': 0.4149791577796093, 'learning_rate': 1.6e-09, 'completion_length': 53.77678871154785, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.146484375, 'epoch': 1.0} 100%|█████████▉| 2496/2500 [19:12:58<01:23, 20.95s/it] 100%|█████████▉| 2497/2500 [19:13:18<01:01, 20.61s/it] {'loss': 0.0059, 'grad_norm': 0.6197850050582991, 'learning_rate': 1.1999999999999998e-09, 'completion_length': 49.66964530944824, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.146484375, 'epoch': 1.0} 100%|█████████▉| 2497/2500 [19:13:18<01:01, 20.61s/it] 100%|█████████▉| 2498/2500 [19:13:38<00:41, 20.65s/it] {'loss': 0.0086, 'grad_norm': 1.976659269914273, 'learning_rate': 8e-10, 'completion_length': 50.45535850524902, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.9821429252624512, 'reward_std': 0.05050762742757797, 'kl': 0.21435546875, 'epoch': 1.0} 100%|█████████▉| 2498/2500 [19:13:39<00:41, 20.65s/it] 100%|█████████▉| 2499/2500 [19:14:01<00:21, 21.30s/it] {'loss': 0.0082, 'grad_norm': 1.5033121469887576, 'learning_rate': 4e-10, 'completion_length': 52.58928680419922, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.20556640625, 'epoch': 1.0} 100%|█████████▉| 2499/2500 [19:14:01<00:21, 21.30s/it] 100%|██████████| 2500/2500 [19:14:22<00:00, 21.15s/it] {'loss': 0.0069, 'grad_norm': 3.0085984436601034, 'learning_rate': 0.0, 'completion_length': 52.32143020629883, 'rewards/accuracy_reward': 0.9910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.9910714626312256, 'reward_std': 0.025253813713788986, 'kl': 0.173828125, 'epoch': 1.0} 100%|██████████| 2500/2500 [19:14:22<00:00, 21.15s/it] {'train_runtime': 69307.1495, 'train_samples_per_second': 0.505, 'train_steps_per_second': 0.036, 'train_loss': 0.01127951154933045, 'epoch': 1.0} 100%|██████████| 2500/2500 [19:15:03<00:00, 21.15s/it] 100%|██████████| 2500/2500 [19:15:03<00:00, 27.72s/it] wandb: wandb: 🚀 View run VLLM-Correct-Qwen2-VL-2B-GRPO-ClevrMath-35k-2025-02-12-22-44-55 at: https://wandb.ai/tanhuajie264-peking-university/vison-open-r1/runs/sgrs8dxl wandb: Find logs at: wandb/run-20250212_230132-sgrs8dxl/logs