arch2_4_combined — Pipeline-Native MoE for CPU Inference

A custom decoder-only transformer with delayed dense FFN + delayed MoE experts, designed so its inter-layer dependency graph permits vertical pipelining on CPU. Part of the cflow project — a CPU-first streaming inference engine written in Rust.

Hosted weights: this repository hosts model.cflow (17.39 GB) — the arch2_4_8k_16l model: 16 layers, hidden 8192, ~31B parameters (top-2-of-8 MoE, ~20B active/token), Q4. This is the model benchmarked at 5.94 tok/s below. The 8.34B figures in this card refer to a smaller 4-layer scale point (arch2_4_8k_4l) used for quality and cache-locality validation (val ppl 4.52); that checkpoint is not hosted here.

Key Results

Metric Value
CPU decode throughput (~31B / 16-layer, Q4, 32 threads) 5.94 tok/s
Effective memory bandwidth 61 GB/s (30% of 204.8 GB/s peak)
Bandwidth reduction from pipelining 2.00x (9.00 → 4.50 MB/token)
Test perplexity (114M, TinyStories, 10K steps) 6.50
Val perplexity (8.34B / 4-layer, TinyStories, 10K steps) 4.52

CPU Decode Benchmark (AWS r6i.8xlarge, Ice Lake Xeon, 256 GB DDR4)

Engine Model Quant tok/s
cflow arch2_4_8k_16l (~31B MoE, ~20B active) Q4 5.94
Ollama (llama.cpp) Qwen2.5-32B (32B dense) Q4 GGUF 4.75
vLLM CPU Qwen2.5-32B-Instruct (32B dense) GPTQ-Int4 1.65

Note: cflow and the baselines run different models — cflow's ~31B MoE has ~20B active params per token vs 32B dense. The total parameter counts are comparable (31B vs 32B), but the architectures and training differ, so the cflow number shows what a co-designed architecture + streaming runtime achieves, not a quality-matched result.

Model Description

arch2_4_combined is a pre-norm decoder-only transformer with a parallel dense FFN + sparse MoE block per layer, using delayed residual injection:

  • The dense FFN reads from a delayed residual (1 layer behind)
  • The MoE experts are routed on the current residual but injected 2 layers later
  • This creates a dependency DAG where dense and expert weight reads for layer N can overlap with compute for layer N-1, reducing critical-path memory bandwidth

The architecture was selected from a screen of 5 pipeline-native candidates. It is the only design that achieves a measured bandwidth reduction (2.00x) while maintaining competitive perplexity.

Architecture Details

Parameter 114M (screening) ~31B (16-layer, hosted)
Hidden dim 512 8,192
Layers 6 16
Attention heads 8 128
Head dim 64 64
Dense FFN hidden 2,048 32,768
Expert FFN hidden 512 4,096
Experts / top-k 8 / 2 8 / 2
Dense delay 1 1
Expert delay 2 2
Vocab 50,257 (GPT-2 BPE) 50,257 (GPT-2 BPE)
Max seq len 512 2,048

Per-Layer Forward Pass

attn_out = attention(attn_norm(x))
x = x + attn_out                              # residual connection
x = x + dense_ffn(ffn_norm(delayed_x))        # dense reads DELAYED residual
if queued_expert: x = x + queued_expert        # inject expert from 2 layers ago
expert_out = moe(ffn_norm(x))                  # router sees CURRENT residual
# expert_out queued for injection at layer + expert_delay

Components

  • Attention: Multi-head (not GQA), Q/K/V/O projections (no bias), standard RoPE (base=10000, half-interleave), causal masking, KV cache
  • Dense FFN: GeGLU — down(gelu(gate(x)) * up(x))
  • MoE: Linear router → top-k selection → softmax over selected → per-expert GeGLU FFN → weighted sum. No auxiliary/load-balancing loss.
  • Normalization: RMSNorm (eps=1e-6) at attn input, FFN input, and pre-lm_head
  • Combine style: DelayedSum — dense and router share ffn_norm but read different residual snapshots

Training

114M Screening (5 architectures)

Dataset TinyStories (431M train tokens, 24M test tokens)
Tokenizer GPT-2 BPE (50,257 vocab)
Sequence length 512
Optimizer AdamW (betas=0.9/0.95, eps=1e-8, weight_decay=0.1)
Learning rate 3e-4 with linear warmup (200 steps) + cosine decay to 1e-5
Gradient clipping Global norm 1.0
Batch size 8
Steps 10,000
Precision float32
Hardware RTX 3060 12 GB

8.34B Scale-Up (4-layer — quality & cache validation)

This is the smaller scale point: arch2_4_8k_4l, 4 layers, 8.34B params. It provides the quality numbers (val ppl 4.52, top-1 61.4%) and the PMU cache-locality result. The hosted decode-benchmark model (arch2_4_8k_16l, ~31B) shares this per-layer geometry but has 16 layers.

Dataset TinyStories (same splits)
Optimizer 8-bit AdamW (bitsandbytes)
Learning rate 1e-4 with linear warmup (500 steps) + cosine decay to 1e-6
Batch size 4 per GPU (global 32)
Steps 10,000
Precision bf16
Parallelism FSDP (FULL_SHARD / ZeRO-3)
Gradient checkpointing Per DelayedMoELayer, non-reentrant
Hardware 8x A100 SXM4 80 GB (Lambda Cloud)

Architecture Comparison (114M, TinyStories, 10K steps)

Architecture dense_delay expert_delay Test PPL Top-1 Acc BW Reduction
arch1_decoupled_streams 0 0 7.21 54.9% 1.00x
arch2_4_combined 1 2 6.50 56.8% 2.00x
arch3_pipeline_registers 0 0 7.24 55.1% 1.00x
arch4_async_experts 0 2 6.26 57.6% 1.00x
arch5_fixed_point 0 0 6.77 56.2% 1.00x

Key insight: Dense delay is the bandwidth knob; expert delay is the quality knob. arch4_async_experts gets the best perplexity by routing off pre-dense activations (cleaner router signal) but sacrifices the bandwidth win that arch2_4 achieves by also delaying the dense read.

Inference with cflow

cflow is a Rust inference engine that reads .cflow (per-layer streaming) or .vflow (vertical pipeline) weight files. Weights are stored as pre-tiled Q4 (128x256 tiles, ~18 KB each, sized to fit L2 cache).

# Build
cargo build --release --bin cflow-run

# Convert safetensors → .cflow
cargo run --release --bin cflow-convert -- \
  --input checkpoint.safetensors \
  --output model.cflow \
  --model arch2_4

# Run inference
CFLOW_THREADS=32 ./target/release/cflow-run \
  model.cflow 32 \
  --prompt "Once upon a time" \
  --tokenizer tokenizer.json \
  --temperature 0.8

SIMD Support

The runtime auto-detects and dispatches to the best available instruction set:

ISA Kernel Notes
AVX-512 + VNNI Q4×Q8 vpdpbusd Best path (Ice Lake+)
AVX-512F Q4×f32 FMA Skylake-X+
AVX2 + FMA Q4×f32 FMA Haswell+
AVX + SSE4.1 Q4×f32 Sandy Bridge+
Scalar Q4×f32 Fallback

Limitations

  • Not a general-purpose LLM. Trained on TinyStories / FineWeb-Edu subsets at 10K steps — this is an architecture and runtime research artifact, not a production language model.
  • Custom architecture. Cannot be loaded in Hugging Face Transformers, vLLM, or llama.cpp without adaptation. Requires the cflow Rust runtime or the PyTorch reference in pipeline_native/.
  • CPU-only. The runtime targets x86-64 CPUs with AVX2 or AVX-512. No GPU backend.
  • Single-token decode optimized. Batch/prefill throughput is not the focus.

Thesis Scorecard

The cflow project tests 8 claims about CPU inference optimization:

# Claim Result
1 Conditional expert reading (top-k only) Proven
2 Tile-streaming L1/L2 cache locality Proven (7.29x fewer L1-d misses, PMU-measured)
3 AVX2/AVX-512 Q4 SIMD kernels Proven
4 Fused QKV and gate+up projections Proven
5 Compute-order file layout Proven
6 Software prefetch (_mm_prefetch) Disproven (no benefit; slightly harmful)
7 Vertical pipeline via delayed dependencies Validated (2.00x bandwidth reduction)
8 Stage-major disk layout readahead Disproven (no isolated benefit)

Citation

@software{poperszky2026cflow,
  author = {Poperszky, Tom},
  title = {cflow: CPU-First Streaming Inference for Pipeline-Native Transformers},
  year = {2026}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train tompoper/cflow

Evaluation results

  • Test Perplexity (114M, 10K steps) on TinyStories
    self-reported
    6.500
  • Top-1 Accuracy (114M, 10K steps) on TinyStories
    self-reported
    56.800
  • Val Perplexity (8.34B / 4-layer, 10K steps) on TinyStories
    self-reported
    4.520
  • Top-1 Accuracy (8.34B / 4-layer, 10K steps) on TinyStories
    self-reported
    61.400