gliner-mmbert-small-ptbr-pii-full-3x-v1
GLiNER trained from scratch on top of jhu-clsp/mmBERT-small — a 22-layer multilingual ModernBERT (~140M params) with a 256k vocabulary — on ~984k Brazilian Portuguese PII samples for 69 000 steps with a cosine schedule. Produces the strongest PT-BR PII performance in the series; beats the ettin-68m easter-egg variant by +0.14 F1_partial and +0.16 F1_exact on the cross-source 4-PII average.
Best checkpoint: step 41 400 (selected by peak F1_partial averaged over the 4 PII sources).
Performance
Cross-source headline (4 PII sources)
| source | P_partial | R_partial | F1_partial | F1_exact |
|---|---|---|---|---|
| gliner2_pii_ptbr_reward_split (PT-BR, alvo) | 0.734 | 0.936 | 0.823 | 0.782 |
| nemotron_pii (EN) | 0.757 | 0.954 | 0.844 | 0.825 |
| open_pii_masking_500k (multilíngua) | 0.742 | 0.933 | 0.827 | 0.771 |
| pii_masking_400k (multilíngua) | 0.707 | 0.965 | 0.816 | 0.804 |
| 4-source average | 0.735 | 0.947 | 0.827 | 0.796 |
Negative evidence (spam/phishing, 7 sources): partial F1 = 0.000 — model correctly abstains from flagging non-PII text.
Per-entity breakdown on gliner2_pii_ptbr_reward_split (PT-BR)
| label | P_partial | R_partial | F1_partial |
|---|---|---|---|
| credit card | 1.000 | 1.000 | 1.000 |
| cpf document number | 1.000 | 1.000 | 1.000 |
| pis document number | 1.000 | 1.000 | 1.000 |
| rg document number | 1.000 | 0.992 | 0.996 |
| dob | 0.986 | 1.000 | 0.993 |
| phone number | 0.986 | 0.995 | 0.990 |
| email address | 0.970 | 1.000 | 0.985 |
| location zip | 0.957 | 1.000 | 0.978 |
| last name | 0.957 | 0.990 | 0.974 |
| location street | 0.951 | 0.951 | 0.951 |
| location state abbreviation | 0.812 | 0.975 | 0.886 |
| first name | 0.830 | 0.939 | 0.881 |
| location building number | 0.750 | 0.996 | 0.856 |
| location state | 0.696 | 0.990 | 0.817 |
| personal description of religious convictions | 0.750 | 0.795 | 0.772 |
| location city | 0.709 | 0.836 | 0.767 |
| personal description of organizational affiliation | 0.726 | 0.789 | 0.756 |
| middle name | 0.591 | 0.965 | 0.733 |
| personal description of ethnicity | 0.525 | 0.828 | 0.642 |
| location neighborhood | 0.380 | 0.860 | 0.527 |
| personal description of medical conditions | 0.394 | 0.788 | 0.525 |
| personal description of political opinion | 0.363 | 0.716 | 0.482 |
| personal description of sexual information | 0.316 | 0.708 | 0.437 |
(Entries with zero gold in this source are omitted.)
Progression during training
| step | F1_partial (4-src avg) | F1_exact (4-src avg) |
|---|---|---|
| 17 250 | 0.632 | 0.583 |
| 27 600 | 0.763 | 0.726 |
| 34 500 | 0.793 | 0.756 |
| 41 400 (released) | 0.827 | 0.796 |
| 48 300 | 0.799 | 0.769 |
| 55 200 | 0.811 | 0.781 |
| 62 100 | 0.773 | 0.746 |
Mid-cosine dip is typical; precision recovery after step 48k didn't exceed the 41 400 peak.
Labels (25 canonical)
cpf document number, rg document number, pis document number, credit card, phone number, email address, first name, middle name, last name, dob, location street, location building number, location neighborhood, location city, location state, location state abbreviation, location zip, location full address, personal description of ethnicity, personal description of medical conditions, personal description of organizational affiliation, personal description of political opinion, personal description of religious convictions, personal description of sexual information
Easter egg 🥚
Additional label berco-de-tiradentes was integrated from step 1 — not a post-hoc fine-tune. Trained on ~2 000 samples about Ritápolis/MG (birthplace of Joaquim José da Silva Xavier, o Tiradentes). In contrast to the ettin easter-egg fine-tunes (where the label competes weakly against location city), here the signal is built in from scratch. Try it with threshold ≥ 0.30 — no need for the 0.10 workaround used on ettin variants.
Training recipe
- Backbone:
jhu-clsp/mmBERT-small(22 layers, hidden 384, vocab 256k, max_pos 8192) - Span mode:
token_level - Steps: 69 000, batch: 128, schedule: cosine + 10% warmup
- Focal loss: α=0.75, γ=2.0, reduction=mean
- LR: 1.5e-5 (encoder) / 5e-5 (others), weight decay 0.01
- Precision: bf16 with
HIPBLASLT_ALLOW_TF32=0(MI300X single-GPU partition) - Data:
data/splits/train_with_ritapolis.jsonl— 986 491 rows, ~113.3M tokens (cl100k_base; mean 114.8 tokens/row, max 9 908) — PT-BR PII + 2 000 Ritápolis (berco-de-tiradentes)
Usage
from gliner import GLiNER
model = GLiNER.from_pretrained("arthrod/gliner-mmbert-small-ptbr-pii-full-3x-v1")
text = "Meu CPF é 459.871.232-00 e moro na Rua das Acácias, 542, São Paulo/SP, 01234-567."
labels = ["cpf document number", "location street", "location city", "location state abbreviation", "location zip"]
preds = model.predict_entities(text, labels, threshold=0.3, flat_ner=True)
for p in preds:
print(f"{p['label']:<40} {p['text']:<30} {p['score']:.3f}")
Evaluation details
Holdout: 11-source proportional 5 000-sample holdout.
- 4 PII sources reported above:
gliner2_pii_ptbr_reward_split(PT-BR),nemotron_pii(EN),open_pii_masking_500k,pii_masking_400k(multilingual). - 7 spam/phishing sources (negative evidence): enron_spam_bvk, enron_spam_setfit, phishing_darkknight, phishing_zefang, sms_spam_multilingual, spam_messages_mshenoda, spamassassin.
- Per-source label superset protocol: canonical 25 PT-BR ∪ row gold (capped 100, lowercased).
- Metrics via
nervaluate: strict / exact / partial / ent_type F1.
Related
- arthrod/gliner-ettin-32m-ptbr-pii-easter-egg-v1 — 32M ettin, fast
- arthrod/gliner-ettin-68m-ptbr-pii-easter-egg-v1 — 68M ettin, mid-tier
- Demo: arthrod/gliner-ptbr-pii-demo — interactive playground with all three
- Downloads last month
- 22
Model tree for arthrod/gliner-mmbert-small-ptbr-pii-full-3x-v1
Base model
jhu-clsp/mmBERT-small