CLPO: CURRICULUM LEARNING MEETS POLICY OPTIMIZATION FOR LLM REASONING

Figure 1: CLPO achieves significant improvements across mathematical reasoning benchmarks through guided self-evolution, outperforming state-of-the-art methods on Qwen3-8B.

To address this challenge, we propose CLPO (Curriculum-guided Learning for Policy Optimization), a novel framework that actualizes a paradigm of Guided Self-Evolution (Silver et al., 2018). The core innovation of CLPO is its elevation of rollout information from a mere reward calculation signal to the central driver for constructing a dynamic curriculum. It employs Online Curriculum Learning to assess problem difficulty in real-time, which in turn guides an Adaptive Problem Restructuring mechanism to diversify medium problems and simplify hard ones. Furthermore, we introduce Difficulty-aware Policy Optimization, which integrates the curriculum signal into the optimization process via dynamic KL regularization. Through these mechanisms, CLPO transforms the training process into a structured, adaptive curriculum that co-evolves with the model's capabilities, all without any external dependencies.

News

[2025/09/29] Full code for CLPO training released.

[2025/09/29] CLPO paper, repository, and project page released.

Method Overview

Figure 2: The CLPO workflow showing the guided self-evolution process with problem rewriting, difficulty assessment, and adaptive training.

Problem Rewriting

Automatically generates problem variations while preserving semantic meaning and ground truth answers.

Difficulty Assessment

Classifies problems into hard, medium, and easy categories based on model performance.

Adaptive Training

Focuses learning on appropriately challenging problems for optimal skill development.

Main Results

Performance comparison on Qwen3-8B across mathematical reasoning benchmarks. CLPO demonstrates superior performance across all benchmarks without relying on external guidance.

Method	Optimization Policy	MATH 500	Minerva Math	Olympiad Bench	AMC23	AIME24	MMLU Pro	Theorem QA	GPQA Diamond	Avg
Supervised Fine-Tuning (SFT)
RAFT	Ranking-Based Imitation	76.20	35.58	36.86	50.00	26.67	65.93	43.50	36.76	46.44
Refinement FT	Guided Refinement	83.20	47.58	40.71	70.00	33.33	67.84	41.29	34.47	52.30
Critique FT	Learning to Critique	79.00	35.23	39.64	67.50	33.33	63.16	46.00	34.84	49.84
CITL-FT	Mixed-Data SFT	76.40	37.20	38.57	62.50	30.00	66.13	44.25	36.36	48.93
Reinforcement Learning with Verifiable Rewards (RLVR)
GRPO	Group-Based RL	89.20	51.47	57.40	82.50	43.33	69.86	54.75	47.80	62.04
DAPO	Dynamic Sampling	91.20	53.31	63.80	87.50	46.67	70.01	55.00	48.48	64.50
LUFFY	Off-Policy Imitation	89.40	52.94	58.80	85.00	40.00	70.34	58.25	49.49	63.03
Critique-GRPO (Simple)	Critique-Driven RL	89.40	52.57	60.20	87.50	40.00	70.13	59.00	48.63	63.43
Critique-GRPO (CoT)	Critique-Driven RL	91.20	61.50	63.80	90.00	46.67	70.98	59.50	50.50	66.77
CLPO (Ours)	Guided Self-Evolution	89.60	76.10	77.50	90.00	50.00	72.39	71.63	62.63	73.73

+9.23 points

Average improvement over best baseline (Critique-GRPO CoT)

+22.79 points

Improvement on Minerva Math benchmark

+13.70 points

Improvement on Olympiad Bench

Quick Start

Setup Environment

git clone https://github.com/csuking/CLPO.git && cd CLPO
conda create -n clpo python=3.10
conda activate clpo
pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt
                

Training CLPO

# Configure paths in the script
bash scripts/clpo_qwen3_8b.sh
                

Key Configuration

# CLPO specific settings
clpo_rewrite_mode="both"           # Rewrite both hard and medium problems
clpo_hard_acc_upper=0.3           # Hard problem threshold
clpo_med_acc_lower=0.3            # Medium problem lower bound
clpo_med_acc_upper=0.7            # Medium problem upper bound
clpo_save_rewrite_data=true       # Save rewritten data for analysis
                

Citation

If you find this work helpful, please consider citing our paper:

@article{clpo2025,
  title={CLPO: Guided Self-Evolution for Mathematical Reasoning},
  author={Your Name and Others},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}