Atom2.7m
Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.
The main result is on ArithMark 2.0, a 2,500-example integer-arithmetic continuation benchmark. Atom2.7m scores 69.24% accuracy. This places it above the nearby published range of SmolLM2-1.7B at 66.12% and Qwen2.5-0.5B at 63.04%, while using only 2.74M parameters.
The result shows the leverage of domain-specific design. With arithmetic-aware tokenization and digit features, Atom2.7m reaches the same ArithMark score band as models hundreds of times larger.
Model Details
- Architecture: decoder-only GPT
- Parameters: 2,738,880
- Layers: 5
- Hidden size: 192
- Attention heads: 4
- KV heads: 2
- Attention: grouped-query causal self-attention with RoPE and XSA projection
- Context length: 512
- Vocabulary size: 4,096
- Token embeddings: tied input/output embeddings
- Arithmetic feature embeddings:
place_vocab_size: 66role_vocab_size: 12
Tokenizer
Use this model with trust_remote_code=True. The submission includes an AtomTokenizer remote-code wrapper in tokenization_atom.py so standard Hugging Face callers can use AutoTokenizer.from_pretrained(...).
The tokenizer keeps byte-level BPE for ordinary text, but treats arithmetic sensitive spans specially:
- digits
0-9are atomic and never BPE-merged - digit spans are emitted least-significant-digit first
+ - * / = ( )are isolated atomic tokens- whitespace is isolated from text
- arithmetic feature IDs are derived by the model from token IDs at inference time
Training and custom tooling may still pass aligned place_ids and role_ids, but generic inference and evaluation only need input_ids and attention_mask.
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_dir = "."
model = AutoModelForCausalLM.from_pretrained(
model_dir,
trust_remote_code=True,
).eval()
tokenizer = AutoTokenizer.from_pretrained(
model_dir,
trust_remote_code=True,
)
text = "12 + 34 ="
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
with torch.no_grad():
outputs = model(**inputs)
Evaluation
ArithMark 2.0
Use the included benchmark script:
python benchmark_fusion_arithmark.py \
--checkpoint . \
--data-path arithmark_2.0.jsonl \
--batch-size 64 \
--device cuda \
--output benchmark_results/fusion_arithmark_2.0_results.json
lm-evaluation-harness
For lm-evaluation-harness tasks, use the standard hf model with remote code enabled:
lm_eval \
--model hf \
--model_args pretrained=.,trust_remote_code=True,dtype=bfloat16,max_length=548 \
--tasks hellaswag,arc_easy,arc_challenge,piqa \
--device cuda:0 \
--batch_size auto:1 \
--output_path benchmark_results/lm_eval
max_length=548 is passed to the lm-evaluation-harness wrapper so long
multiple-choice continuations do not trip the harness assertion that a
continuation must fit inside the model window. The tokenizer also advertises
model_max_length=548, matching the longest sequence observed in this eval run.
The checkpoint was trained with a 512-token context, but the RoPE
implementation can score this slightly longer harness window; reduce batch size
or set max_length to the longest sequence found if a task variant contains
longer continuations.
Results
| Benchmark | Metric | Value |
|---|---|---|
| ArithMark 2.0 | acc | 0.6924 |
| arc_challenge | acc_norm | 0.2099 |
| arc_easy | acc_norm | 0.3161 |
| hellaswag | acc_norm | 0.2701 |
| piqa | acc_norm | 0.5299 |
Training Data
The pretraining mixture targeted about 3.5B tokens:
- Ultra-FineWeb: 900M
- FineWeb-Edu: 900M
- FineMath: 450M
- Cosmopedia-v2: 337.5M
- UltraData-Math-L2-preview: 337.5M
- Ultra-FineWeb-L3-en-QA-Synthetic: 225M
- Synthetic-Arithmetic: 350M
Synthetic-Arithmetic is canonical integer equation data. The training curriculum is included as pretraining_curriculum.json.
Limitations
- This is a very small model and should be treated as an experimental research artifact.
- Use
trust_remote_code=TruesoAutoTokenizerapplies the digit-span transform. - Numeric text is represented least-significant-digit first internally.
- Role annotations intentionally target strict integer equations, not broad math prose, decimals, rationals, or QA formats.
Files
model.safetensors: model weightsconfig.json,config.py,configuration_gpt.py,model.py: custom model codetokenizer.json,tokenization_atom.py: tokenizer files and remote-code wrapperbenchmark_fusion_arithmark.py: ArithMark evaluationarithmark_2.0.jsonl: local ArithMark 2.0 data for the standalone benchmark scriptpretraining_curriculum.json: training curriculum
References / Design Influences
- Attention Is All You Need - additive positional information in Transformer inputs
- Exclusive Self Attention - related attention work on reducing self-position dominance in sequence modeling
- Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure - coupling digit positions by arithmetic significance
- Transformers Can Do Arithmetic with the Right Embeddings - digit-position embeddings for arithmetic
- Downloads last month
- -
