Qwen3.6-27B GPTQ Int4

Partial GPTQ Int4 quant of Qwen/Qwen3.6-27B, produced with the verbatim recipe from Qwen's own Qwen/Qwen3.5-27B-GPTQ-Int4 — only MLP/FFN layers are Int4, everything else stays BF16.

Quantization

Parameter Value
Library GPTQModel v6.0.3
Bits / Group size 4 / 128
Sym / Desc-act / True-seq / Damp true / false / true / 0.01
Calibration 256 samples × 2048 tok from allenai/c4

BF16 (not quantized): lm_head, embed_tokens, .*attn.* (Gated DeltaNet + Gated Attention), .*mtp.*, .*shared_expert.*, .*visual.*.

Serving on SGLang

Use --quantization moe_wna16 (NOT --quantization gptq — that kernel rejects the BF16 attention this recipe keeps).

python -m sglang.launch_server \
    --model-path raydelossantos/Qwen3.6-27B-GPTQ-Int4 \
    --quantization moe_wna16 --tp 4 --kv-cache-dtype fp8_e5m2 \
    --tool-call-parser qwen3_coder --reasoning-parser qwen3 \
    --trust-remote-code

+ NEXTN speculative decoding (2-3× decode on agentic loads)

SGLANG_ENABLE_SPEC_V2=1 python -m sglang.launch_server \
    --model-path raydelossantos/Qwen3.6-27B-GPTQ-Int4 \
    --quantization moe_wna16 --tp 4 --mem-fraction-static 0.75 \
    --speculative-algo NEXTN --speculative-num-steps 3 \
    --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
    --mamba-scheduler-strategy extra_buffer \
    --kv-cache-dtype fp8_e5m2 \
    --tool-call-parser qwen3_coder --reasoning-parser qwen3 \
    --trust-remote-code

Hardware tested

  • 4× RTX 3090 (96 GB total), TP=4 — ≈ 7.7 GB/GPU weights, 100K ctx fits comfortably
  • Should also work on single A100 40GB+, single H100, or 2× RTX 4090

Acknowledgments

Downloads last month
18,276
Safetensors
Model size
28B params
Tensor type
BF16
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for raydelossantos/Qwen3.6-27B-GPTQ-Int4

Base model

Qwen/Qwen3.6-27B
Quantized
(301)
this model