arch2_4_combined — Pipeline-Native MoE for CPU Inference

A custom decoder-only transformer with delayed dense FFN + delayed MoE experts, designed so its inter-layer dependency graph permits vertical pipelining on CPU. Part of the cflow project — a CPU-first streaming inference engine written in Rust.

Hosted weights: this repository hosts model.cflow (17.39 GB) — the arch2_4_8k_16l model: 16 layers, hidden 8192, ~31B parameters (top-2-of-8 MoE, ~20B active/token), Q4. This is the model benchmarked at 5.94 tok/s below. The 8.34B figures in this card refer to a smaller 4-layer scale point (arch2_4_8k_4l) used for quality and cache-locality validation (val ppl 4.52); that checkpoint is not hosted here.

Key Results

Metric	Value
CPU decode throughput (~31B / 16-layer, Q4, 32 threads)	5.94 tok/s
Effective memory bandwidth	61 GB/s (30% of 204.8 GB/s peak)
Bandwidth reduction from pipelining	2.00x (9.00 → 4.50 MB/token)
Test perplexity (114M, TinyStories, 10K steps)	6.50
Val perplexity (8.34B / 4-layer, TinyStories, 10K steps)	4.52

CPU Decode Benchmark (AWS r6i.8xlarge, Ice Lake Xeon, 256 GB DDR4)

Engine	Model	Quant	tok/s
cflow	arch2_4_8k_16l (~31B MoE, ~20B active)	Q4	5.94
Ollama (llama.cpp)	Qwen2.5-32B (32B dense)	Q4 GGUF	4.75
vLLM CPU	Qwen2.5-32B-Instruct (32B dense)	GPTQ-Int4	1.65

Note: cflow and the baselines run different models — cflow's ~31B MoE has ~20B active params per token vs 32B dense. The total parameter counts are comparable (31B vs 32B), but the architectures and training differ, so the cflow number shows what a co-designed architecture + streaming runtime achieves, not a quality-matched result.

Model Description

arch2_4_combined is a pre-norm decoder-only transformer with a parallel dense FFN + sparse MoE block per layer, using delayed residual injection:

The dense FFN reads from a delayed residual (1 layer behind)
The MoE experts are routed on the current residual but injected 2 layers later
This creates a dependency DAG where dense and expert weight reads for layer N can overlap with compute for layer N-1, reducing critical-path memory bandwidth

The architecture was selected from a screen of 5 pipeline-native candidates. It is the only design that achieves a measured bandwidth reduction (2.00x) while maintaining competitive perplexity.

Architecture Details

Parameter	114M (screening)	~31B (16-layer, hosted)
Hidden dim	512	8,192
Layers	6	16
Attention heads	8	128
Head dim	64	64
Dense FFN hidden	2,048	32,768
Expert FFN hidden	512	4,096
Experts / top-k	8 / 2	8 / 2
Dense delay	1	1
Expert delay	2	2
Vocab	50,257 (GPT-2 BPE)	50,257 (GPT-2 BPE)
Max seq len	512	2,048

Per-Layer Forward Pass

attn_out = attention(attn_norm(x))
x = x + attn_out                              # residual connection
x = x + dense_ffn(ffn_norm(delayed_x))        # dense reads DELAYED residual
if queued_expert: x = x + queued_expert        # inject expert from 2 layers ago
expert_out = moe(ffn_norm(x))                  # router sees CURRENT residual
# expert_out queued for injection at layer + expert_delay

Components

Attention: Multi-head (not GQA), Q/K/V/O projections (no bias), standard RoPE (base=10000, half-interleave), causal masking, KV cache
Dense FFN: GeGLU — down(gelu(gate(x)) * up(x))
MoE: Linear router → top-k selection → softmax over selected → per-expert GeGLU FFN → weighted sum. No auxiliary/load-balancing loss.
Normalization: RMSNorm (eps=1e-6) at attn input, FFN input, and pre-lm_head
Combine style: DelayedSum — dense and router share ffn_norm but read different residual snapshots

Training

114M Screening (5 architectures)


Dataset	TinyStories (431M train tokens, 24M test tokens)
Tokenizer	GPT-2 BPE (50,257 vocab)
Sequence length	512
Optimizer	AdamW (betas=0.9/0.95, eps=1e-8, weight_decay=0.1)
Learning rate	3e-4 with linear warmup (200 steps) + cosine decay to 1e-5
Gradient clipping	Global norm 1.0
Batch size	8
Steps	10,000
Precision	float32
Hardware	RTX 3060 12 GB

8.34B Scale-Up (4-layer — quality & cache validation)

This is the smaller scale point: arch2_4_8k_4l, 4 layers, 8.34B params. It provides the quality numbers (val ppl 4.52, top-1 61.4%) and the PMU cache-locality result. The hosted decode-benchmark model (arch2_4_8k_16l, ~31B) shares this per-layer geometry but has 16 layers.


Dataset	TinyStories (same splits)
Optimizer	8-bit AdamW (bitsandbytes)
Learning rate	1e-4 with linear warmup (500 steps) + cosine decay to 1e-6
Batch size	4 per GPU (global 32)
Steps	10,000
Precision	bf16
Parallelism	FSDP (FULL_SHARD / ZeRO-3)
Gradient checkpointing	Per `DelayedMoELayer`, non-reentrant
Hardware	8x A100 SXM4 80 GB (Lambda Cloud)

Architecture Comparison (114M, TinyStories, 10K steps)

Architecture	dense_delay	expert_delay	Test PPL	Top-1 Acc	BW Reduction
arch1_decoupled_streams	0	0	7.21	54.9%	1.00x
arch2_4_combined	1	2	6.50	56.8%	2.00x
arch3_pipeline_registers	0	0	7.24	55.1%	1.00x
arch4_async_experts	0	2	6.26	57.6%	1.00x
arch5_fixed_point	0	0	6.77	56.2%	1.00x

Key insight: Dense delay is the bandwidth knob; expert delay is the quality knob. arch4_async_experts gets the best perplexity by routing off pre-dense activations (cleaner router signal) but sacrifices the bandwidth win that arch2_4 achieves by also delaying the dense read.

Inference with cflow

cflow is a Rust inference engine that reads .cflow (per-layer streaming) or .vflow (vertical pipeline) weight files. Weights are stored as pre-tiled Q4 (128x256 tiles, ~18 KB each, sized to fit L2 cache).

# Build
cargo build --release --bin cflow-run

# Convert safetensors → .cflow
cargo run --release --bin cflow-convert -- \
  --input checkpoint.safetensors \
  --output model.cflow \
  --model arch2_4

# Run inference
CFLOW_THREADS=32 ./target/release/cflow-run \
  model.cflow 32 \
  --prompt "Once upon a time" \
  --tokenizer tokenizer.json \
  --temperature 0.8

SIMD Support

The runtime auto-detects and dispatches to the best available instruction set:

ISA	Kernel	Notes
AVX-512 + VNNI	Q4×Q8 `vpdpbusd`	Best path (Ice Lake+)
AVX-512F	Q4×f32 FMA	Skylake-X+
AVX2 + FMA	Q4×f32 FMA	Haswell+
AVX + SSE4.1	Q4×f32	Sandy Bridge+
Scalar	Q4×f32	Fallback

Limitations

Not a general-purpose LLM. Trained on TinyStories / FineWeb-Edu subsets at 10K steps — this is an architecture and runtime research artifact, not a production language model.
Custom architecture. Cannot be loaded in Hugging Face Transformers, vLLM, or llama.cpp without adaptation. Requires the cflow Rust runtime or the PyTorch reference in pipeline_native/.
CPU-only. The runtime targets x86-64 CPUs with AVX2 or AVX-512. No GPU backend.
Single-token decode optimized. Batch/prefill throughput is not the focus.

Thesis Scorecard

The cflow project tests 8 claims about CPU inference optimization:

#	Claim	Result
1	Conditional expert reading (top-k only)	Proven
2	Tile-streaming L1/L2 cache locality	Proven (7.29x fewer L1-d misses, PMU-measured)
3	AVX2/AVX-512 Q4 SIMD kernels	Proven
4	Fused QKV and gate+up projections	Proven
5	Compute-order file layout	Proven
6	Software prefetch (`_mm_prefetch`)	Disproven (no benefit; slightly harmful)
7	Vertical pipeline via delayed dependencies	Validated (2.00x bandwidth reduction)
8	Stage-major disk layout readahead	Disproven (no isolated benefit)

Citation

@software{poperszky2026cflow,
  author = {Poperszky, Tom},
  title = {cflow: CPU-First Streaming Inference for Pipeline-Native Transformers},
  year = {2026}
}

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Datasets used to train tompoper/cflow

Evaluation results

Test Perplexity (114M, 10K steps) on TinyStories
self-reported

6.500
Top-1 Accuracy (114M, 10K steps) on TinyStories
self-reported

56.800
Val Perplexity (8.34B / 4-layer, 10K steps) on TinyStories
self-reported

4.520
Top-1 Accuracy (8.34B / 4-layer, 10K steps) on TinyStories
self-reported

61.400