
Figure 1: CLPO achieves significant improvements across mathematical reasoning benchmarks through guided self-evolution, outperforming state-of-the-art methods on Qwen3-8B.
To address this challenge, we propose CLPO (Curriculum-guided Learning for Policy Optimization), a novel framework that actualizes a paradigm of Guided Self-Evolution (Silver et al., 2018). The core innovation of CLPO is its elevation of rollout information from a mere reward calculation signal to the central driver for constructing a dynamic curriculum. It employs Online Curriculum Learning to assess problem difficulty in real-time, which in turn guides an Adaptive Problem Restructuring mechanism to diversify medium problems and simplify hard ones. Furthermore, we introduce Difficulty-aware Policy Optimization, which integrates the curriculum signal into the optimization process via dynamic KL regularization. Through these mechanisms, CLPO transforms the training process into a structured, adaptive curriculum that co-evolves with the model's capabilities, all without any external dependencies.
News
Method Overview

Figure 2: The CLPO workflow showing the guided self-evolution process with problem rewriting, difficulty assessment, and adaptive training.
Problem Rewriting
Automatically generates problem variations while preserving semantic meaning and ground truth answers.
Difficulty Assessment
Classifies problems into hard, medium, and easy categories based on model performance.
Adaptive Training
Focuses learning on appropriately challenging problems for optimal skill development.
Main Results
Performance comparison on Qwen3-8B across mathematical reasoning benchmarks. CLPO demonstrates superior performance across all benchmarks without relying on external guidance.
Method | Optimization Policy | MATH 500 | Minerva Math | Olympiad Bench | AMC23 | AIME24 | MMLU Pro | Theorem QA | GPQA Diamond | Avg |
---|---|---|---|---|---|---|---|---|---|---|
Supervised Fine-Tuning (SFT) | ||||||||||
RAFT | Ranking-Based Imitation | 76.20 | 35.58 | 36.86 | 50.00 | 26.67 | 65.93 | 43.50 | 36.76 | 46.44 |
Refinement FT | Guided Refinement | 83.20 | 47.58 | 40.71 | 70.00 | 33.33 | 67.84 | 41.29 | 34.47 | 52.30 |
Critique FT | Learning to Critique | 79.00 | 35.23 | 39.64 | 67.50 | 33.33 | 63.16 | 46.00 | 34.84 | 49.84 |
CITL-FT | Mixed-Data SFT | 76.40 | 37.20 | 38.57 | 62.50 | 30.00 | 66.13 | 44.25 | 36.36 | 48.93 |
Reinforcement Learning with Verifiable Rewards (RLVR) | ||||||||||
GRPO | Group-Based RL | 89.20 | 51.47 | 57.40 | 82.50 | 43.33 | 69.86 | 54.75 | 47.80 | 62.04 |
DAPO | Dynamic Sampling | 91.20 | 53.31 | 63.80 | 87.50 | 46.67 | 70.01 | 55.00 | 48.48 | 64.50 |
LUFFY | Off-Policy Imitation | 89.40 | 52.94 | 58.80 | 85.00 | 40.00 | 70.34 | 58.25 | 49.49 | 63.03 |
Critique-GRPO (Simple) | Critique-Driven RL | 89.40 | 52.57 | 60.20 | 87.50 | 40.00 | 70.13 | 59.00 | 48.63 | 63.43 |
Critique-GRPO (CoT) | Critique-Driven RL | 91.20 | 61.50 | 63.80 | 90.00 | 46.67 | 70.98 | 59.50 | 50.50 | 66.77 |
CLPO (Ours) | Guided Self-Evolution | 89.60 | 76.10 | 77.50 | 90.00 | 50.00 | 72.39 | 71.63 | 62.63 | 73.73 |
+9.23 points
Average improvement over best baseline (Critique-GRPO CoT)
+22.79 points
Improvement on Minerva Math benchmark
+13.70 points
Improvement on Olympiad Bench
Quick Start
Setup Environment
Training CLPO
Key Configuration
Citation
If you find this work helpful, please consider citing our paper: