tinyLM-8M-exp

Tiny 8M-class Qwen3-config causal LM with math-only novelty-gated GQA.

Architecture

Item	Value
Config type	`tinyqwen3_novelty`
Parameters	8.132M
Layers	8
Hidden size	256
MLP size	896
Query heads	8
KV heads	4
Head dim	32
RoPE theta	2500
Tied embeddings	yes

Attention	Value
Type	GQA
Novelty gate	math-only element-wise RMS-normalized abs-delta
Gate floor	0.05

Training

Item	Value
Tokenizer	`AxiomicLabs/GPT-S2-5M`
Sequence length	512
Microbatch size	1024
Gradient accumulation	4
Effective batch size	4096
Steps	10,000
Validation cadence	every 1,000 steps
Official lm-eval	after final Hub upload on ARC-Easy, ARC-Challenge, PIQA, HellaSwag
LR schedule	warmup, cosine to min by 10,000
Optimizer	Muon for middle 2D weights, AdamW for the rest
Special-token policy	BOS/EOS are document-level; `<

Dataset	Share	Config
`HuggingFaceFW/fineweb-edu`	60.0%	`sample-100BT`
`HuggingFaceTB/smollm-corpus`	30.0%	`cosmopedia-v2` only
`epfml/FineWeb-HQ`	10.0%	`default`

Validation

Metric	Value
Dataset	`Salesforce/wikitext`, `wikitext-103-raw-v1`, validation
Context / stride	512 / 256
Loss	3.2769
Perplexity	26.49
UTF-8 BPB	1.4992
Scored tokens	365,258
UTF-8 bytes	1,151,766

Evaluation

Scores were run after Hub upload against revision d95a00a6edafab4bc2d6b60a28e6893b00f52699. ARC-Easy, ARC-Challenge, PIQA, and HellaSwag use official lm_eval 0-shot log-likelihood scoring. ArithMark-2.0 uses the same continuation NLL scoring style with a custom scorer because it is not available in lm_eval.

Task	n	acc	acc stderr	acc_norm	acc_norm stderr
ARC-Easy	2,376	37.04%	0.99%	35.86%	0.98%
ARC-Challenge	1,172	18.77%	1.14%	22.87%	1.23%
PIQA	1,838	57.67%	1.15%	57.89%	1.15%
HellaSwag	10,042	26.88%	0.44%	27.88%	0.45%
ArithMark-2.0	2,500	25.12%	0.87%	24.44%	0.86%

Load And Generate

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "User01110/tinyLM-8M-exp"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
)

prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
print(inputs.input_ids[0][:2].tolist())  # auto-prefix: [<|im_start|>, <bos>]

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.65,
        top_k=30,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>"),
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

This repo uses a self-contained remote TinyQwen3NoveltyConfig plus model code for a Qwen3-style dense decoder with a math-only novelty-gated attention block.

Downloads last month: 15

Safetensors

Model size

8.13M params

Tensor type

F32

Model tree for User01110/tinyLM-8M-exp

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Finetuned

(1006)

this model