MR_midtrain_9B_v3
v3 meta-reasoning SFT of Qwen3.5-9B (checkpoint global_step_4500). A single
model runs the full meta-reasoning loop — MR (propose exploration directions) →
E (execute each direction, emit a summary) → FA (write the final answer) —
using custom <direction> / <summary> special tokens (vocab 248320).
- Architecture:
Qwen3_5ForConditionalGeneration— the SFT'd text weights hosted in the base conditional-generation arch (text weights live undertext_config; the vision tower is carried but unused/frozen for this text-only model). This is the form that loads directly in both vLLM serving and verl Megatron (mbridge) RL. The original text-onlyQwen3_5ForCausalLMexport remains recoverable from this repo's git history (commit072c7a3). - Base:
Qwen/Qwen3.5-9B. Scaffold:inference.mrv3(single model; per-layer fan-out; termination = an MR step proposes no further directions → separate FA step).
Evaluation (GPT-5.4 rubric judge unless noted)
| bench | metric | score |
|---|---|---|
| SODA2026 | mean | 0.478 |
| IMO ProofBench | pass@1 / best@3 | 0.547 / 0.678 (official Gemini-3.1-Pro judge: 0.490) |
| physics_papers | pass@1 | 0.679 |
v3 SFT lands at/above the best v2 RL checkpoint on all three benches before any v3 RL.
Serving / training note
This repo hosts the conditional-generation arch (Qwen3_5ForConditionalGeneration)
directly: the text-only Qwen3_5ForCausalLM export does not load in vLLM 0.19.1's hybrid
KV-cache unification, and verl's Megatron/mbridge path does not recognize the text-only
qwen3_5_text model type — both want this condgen form. Use temperature 1.0 (0.6
degenerates the E step into repetition loops).
- Downloads last month
- 16