MR_midtrain_9B_v3

v3 meta-reasoning SFT of Qwen3.5-9B (checkpoint global_step_4500). A single model runs the full meta-reasoning loop — MR (propose exploration directions) → E (execute each direction, emit a summary) → FA (write the final answer) — using custom <direction> / <summary> special tokens (vocab 248320).

Architecture: Qwen3_5ForConditionalGeneration — the SFT'd text weights hosted in the base conditional-generation arch (text weights live under text_config; the vision tower is carried but unused/frozen for this text-only model). This is the form that loads directly in both vLLM serving and verl Megatron (mbridge) RL. The original text-only Qwen3_5ForCausalLM export remains recoverable from this repo's git history (commit 072c7a3).
Base: Qwen/Qwen3.5-9B. Scaffold: inference.mrv3 (single model; per-layer fan-out; termination = an MR step proposes no further directions → separate FA step).

Evaluation (GPT-5.4 rubric judge unless noted)

bench	metric	score
SODA2026	mean	0.478
IMO ProofBench	pass@1 / best@3	0.547 / 0.678 (official Gemini-3.1-Pro judge: 0.490)
physics_papers	pass@1	0.679

v3 SFT lands at/above the best v2 RL checkpoint on all three benches before any v3 RL.

Serving / training note

This repo hosts the conditional-generation arch (Qwen3_5ForConditionalGeneration) directly: the text-only Qwen3_5ForCausalLM export does not load in vLLM 0.19.1's hybrid KV-cache unification, and verl's Megatron/mbridge path does not recognize the text-only qwen3_5_text model type — both want this condgen form. Use temperature 1.0 (0.6 degenerates the E step into repetition loops).

Downloads last month: 16

Safetensors

Model size

10B params

Tensor type

BF16

Model tree for HerrHruby/MR_midtrain_9B_v3

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Finetuned

(446)

this model