Qwen3.6-27B GPTQ Int4

Partial GPTQ Int4 quant of Qwen/Qwen3.6-27B, produced with the verbatim recipe from Qwen's own Qwen/Qwen3.5-27B-GPTQ-Int4 — only MLP/FFN layers are Int4, everything else stays BF16.

Quantization

Parameter	Value
Library	GPTQModel v6.0.3
Bits / Group size	4 / 128
Sym / Desc-act / True-seq / Damp	true / false / true / 0.01
Calibration	256 samples × 2048 tok from `allenai/c4`

BF16 (not quantized): lm_head, embed_tokens, .*attn.* (Gated DeltaNet + Gated Attention), .*mtp.*, .*shared_expert.*, .*visual.*.

Serving on SGLang

Use --quantization moe_wna16 (NOT --quantization gptq — that kernel rejects the BF16 attention this recipe keeps).

python -m sglang.launch_server \
    --model-path raydelossantos/Qwen3.6-27B-GPTQ-Int4 \
    --quantization moe_wna16 --tp 4 --kv-cache-dtype fp8_e5m2 \
    --tool-call-parser qwen3_coder --reasoning-parser qwen3 \
    --trust-remote-code

+ NEXTN speculative decoding (2-3× decode on agentic loads)

SGLANG_ENABLE_SPEC_V2=1 python -m sglang.launch_server \
    --model-path raydelossantos/Qwen3.6-27B-GPTQ-Int4 \
    --quantization moe_wna16 --tp 4 --mem-fraction-static 0.75 \
    --speculative-algo NEXTN --speculative-num-steps 3 \
    --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
    --mamba-scheduler-strategy extra_buffer \
    --kv-cache-dtype fp8_e5m2 \
    --tool-call-parser qwen3_coder --reasoning-parser qwen3 \
    --trust-remote-code

Hardware tested

4× RTX 3090 (96 GB total), TP=4 — ≈ 7.7 GB/GPU weights, 100K ctx fits comfortably
Should also work on single A100 40GB+, single H100, or 2× RTX 4090

Acknowledgments

Recipe from Qwen/Qwen3.5-27B-GPTQ-Int4
GPTQModel by ModelCloud

Downloads last month: 18,276

Safetensors

Model size

28B params

Tensor type

BF16

I32

Model tree for raydelossantos/Qwen3.6-27B-GPTQ-Int4

Base model

Qwen/Qwen3.6-27B

Quantized

(301)

this model