May I ask whether the training of caijanfeng/Qwen-2.5-7B-Simple-RL was conducted purely with reinforcement learning (RL), or was it combined with supervised fine-tuning (SFT)? Additionally, what dataset was used for the RL training? Thank you very much!