gliner-mmbert-small-ptbr-pii-full-3x-v1

GLiNER trained from scratch on top of jhu-clsp/mmBERT-small — a 22-layer multilingual ModernBERT (~140M params) with a 256k vocabulary — on ~984k Brazilian Portuguese PII samples for 69 000 steps with a cosine schedule. Produces the strongest PT-BR PII performance in the series; beats the ettin-68m easter-egg variant by +0.14 F1_partial and +0.16 F1_exact on the cross-source 4-PII average.

Best checkpoint: step 41 400 (selected by peak F1_partial averaged over the 4 PII sources).

Performance

Cross-source headline (4 PII sources)

source P_partial R_partial F1_partial F1_exact
gliner2_pii_ptbr_reward_split (PT-BR, alvo) 0.734 0.936 0.823 0.782
nemotron_pii (EN) 0.757 0.954 0.844 0.825
open_pii_masking_500k (multilíngua) 0.742 0.933 0.827 0.771
pii_masking_400k (multilíngua) 0.707 0.965 0.816 0.804
4-source average 0.735 0.947 0.827 0.796

Negative evidence (spam/phishing, 7 sources): partial F1 = 0.000 — model correctly abstains from flagging non-PII text.

Per-entity breakdown on gliner2_pii_ptbr_reward_split (PT-BR)

label P_partial R_partial F1_partial
credit card 1.000 1.000 1.000
cpf document number 1.000 1.000 1.000
pis document number 1.000 1.000 1.000
rg document number 1.000 0.992 0.996
dob 0.986 1.000 0.993
phone number 0.986 0.995 0.990
email address 0.970 1.000 0.985
location zip 0.957 1.000 0.978
last name 0.957 0.990 0.974
location street 0.951 0.951 0.951
location state abbreviation 0.812 0.975 0.886
first name 0.830 0.939 0.881
location building number 0.750 0.996 0.856
location state 0.696 0.990 0.817
personal description of religious convictions 0.750 0.795 0.772
location city 0.709 0.836 0.767
personal description of organizational affiliation 0.726 0.789 0.756
middle name 0.591 0.965 0.733
personal description of ethnicity 0.525 0.828 0.642
location neighborhood 0.380 0.860 0.527
personal description of medical conditions 0.394 0.788 0.525
personal description of political opinion 0.363 0.716 0.482
personal description of sexual information 0.316 0.708 0.437

(Entries with zero gold in this source are omitted.)

Progression during training

step F1_partial (4-src avg) F1_exact (4-src avg)
17 250 0.632 0.583
27 600 0.763 0.726
34 500 0.793 0.756
41 400 (released) 0.827 0.796
48 300 0.799 0.769
55 200 0.811 0.781
62 100 0.773 0.746

Mid-cosine dip is typical; precision recovery after step 48k didn't exceed the 41 400 peak.

Labels (25 canonical)

cpf document number, rg document number, pis document number, credit card, phone number, email address, first name, middle name, last name, dob, location street, location building number, location neighborhood, location city, location state, location state abbreviation, location zip, location full address, personal description of ethnicity, personal description of medical conditions, personal description of organizational affiliation, personal description of political opinion, personal description of religious convictions, personal description of sexual information

Easter egg 🥚

Additional label berco-de-tiradentes was integrated from step 1 — not a post-hoc fine-tune. Trained on ~2 000 samples about Ritápolis/MG (birthplace of Joaquim José da Silva Xavier, o Tiradentes). In contrast to the ettin easter-egg fine-tunes (where the label competes weakly against location city), here the signal is built in from scratch. Try it with threshold ≥ 0.30 — no need for the 0.10 workaround used on ettin variants.

Training recipe

  • Backbone: jhu-clsp/mmBERT-small (22 layers, hidden 384, vocab 256k, max_pos 8192)
  • Span mode: token_level
  • Steps: 69 000, batch: 128, schedule: cosine + 10% warmup
  • Focal loss: α=0.75, γ=2.0, reduction=mean
  • LR: 1.5e-5 (encoder) / 5e-5 (others), weight decay 0.01
  • Precision: bf16 with HIPBLASLT_ALLOW_TF32=0 (MI300X single-GPU partition)
  • Data: data/splits/train_with_ritapolis.jsonl — 986 491 rows, ~113.3M tokens (cl100k_base; mean 114.8 tokens/row, max 9 908) — PT-BR PII + 2 000 Ritápolis (berco-de-tiradentes)

Usage

from gliner import GLiNER

model = GLiNER.from_pretrained("arthrod/gliner-mmbert-small-ptbr-pii-full-3x-v1")

text = "Meu CPF é 459.871.232-00 e moro na Rua das Acácias, 542, São Paulo/SP, 01234-567."
labels = ["cpf document number", "location street", "location city", "location state abbreviation", "location zip"]
preds = model.predict_entities(text, labels, threshold=0.3, flat_ner=True)
for p in preds:
    print(f"{p['label']:<40} {p['text']:<30} {p['score']:.3f}")

Evaluation details

Holdout: 11-source proportional 5 000-sample holdout.

  • 4 PII sources reported above: gliner2_pii_ptbr_reward_split (PT-BR), nemotron_pii (EN), open_pii_masking_500k, pii_masking_400k (multilingual).
  • 7 spam/phishing sources (negative evidence): enron_spam_bvk, enron_spam_setfit, phishing_darkknight, phishing_zefang, sms_spam_multilingual, spam_messages_mshenoda, spamassassin.
  • Per-source label superset protocol: canonical 25 PT-BR ∪ row gold (capped 100, lowercased).
  • Metrics via nervaluate: strict / exact / partial / ent_type F1.

Related

Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for arthrod/gliner-mmbert-small-ptbr-pii-full-3x-v1

Finetuned
(39)
this model

Space using arthrod/gliner-mmbert-small-ptbr-pii-full-3x-v1 1

Collection including arthrod/gliner-mmbert-small-ptbr-pii-full-3x-v1