tinyLM-8M-exp

Tiny 8M-class Qwen3-config causal LM with math-only novelty-gated GQA.

Architecture

Item Value
Config type tinyqwen3_novelty
Parameters 8.132M
Layers 8
Hidden size 256
MLP size 896
Query heads 8
KV heads 4
Head dim 32
RoPE theta 2500
Tied embeddings yes
Attention Value
Type GQA
Novelty gate math-only element-wise RMS-normalized abs-delta
Gate floor 0.05

Training

Item Value
Tokenizer AxiomicLabs/GPT-S2-5M
Sequence length 512
Microbatch size 1024
Gradient accumulation 4
Effective batch size 4096
Steps 10,000
Validation cadence every 1,000 steps
Official lm-eval after final Hub upload on ARC-Easy, ARC-Challenge, PIQA, HellaSwag
LR schedule warmup, cosine to min by 10,000
Optimizer Muon for middle 2D weights, AdamW for the rest
Special-token policy BOS/EOS are document-level; `<
Dataset Share Config
HuggingFaceFW/fineweb-edu 60.0% sample-100BT
HuggingFaceTB/smollm-corpus 30.0% cosmopedia-v2 only
epfml/FineWeb-HQ 10.0% default

Validation

Metric Value
Dataset Salesforce/wikitext, wikitext-103-raw-v1, validation
Context / stride 512 / 256
Loss 3.2769
Perplexity 26.49
UTF-8 BPB 1.4992
Scored tokens 365,258
UTF-8 bytes 1,151,766

Evaluation

Scores were run after Hub upload against revision d95a00a6edafab4bc2d6b60a28e6893b00f52699. ARC-Easy, ARC-Challenge, PIQA, and HellaSwag use official lm_eval 0-shot log-likelihood scoring. ArithMark-2.0 uses the same continuation NLL scoring style with a custom scorer because it is not available in lm_eval.

Task n acc acc stderr acc_norm acc_norm stderr
ARC-Easy 2,376 37.04% 0.99% 35.86% 0.98%
ARC-Challenge 1,172 18.77% 1.14% 22.87% 1.23%
PIQA 1,838 57.67% 1.15% 57.89% 1.15%
HellaSwag 10,042 26.88% 0.44% 27.88% 0.45%
ArithMark-2.0 2,500 25.12% 0.87% 24.44% 0.86%

Load And Generate

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "User01110/tinyLM-8M-exp"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
)

prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
print(inputs.input_ids[0][:2].tolist())  # auto-prefix: [<|im_start|>, <bos>]

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.65,
        top_k=30,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>"),
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

This repo uses a self-contained remote TinyQwen3NoveltyConfig plus model code for a Qwen3-style dense decoder with a math-only novelty-gated attention block.

Downloads last month
15
Safetensors
Model size
8.13M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for User01110/tinyLM-8M-exp

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(1006)
this model