gliner-ettin-32m-ptbr-pii-full-3x-v1
GLiNER fine-tune of jhu-clsp/ettin-encoder-32m (~32M params) for Brazilian-Portuguese PII detection, with cross-source generalization tested on English/multilingual PII and spam corpora.
This is part of a 6-config sweep (32M × 68M) × (top-50k curated, top-100k curated, full ~984k), trained on the same MI300X GPU under identical recipe except for batch size and step budget. See Comparison across the sweep below.
Quick start
from gliner import GLiNER
model = GLiNER.from_pretrained("arthrod/gliner-ettin-32m-ptbr-pii-full-3x-v1", subfolder="checkpoint-65550")
text = (
"Sou Maria Silva, moro na Rua das Flores, 123, em São Paulo. "
"Meu CPF é 123.456.789-09 e meu telefone é (11) 91234-5678."
)
labels = [
"first name", "last name", "middle name",
"cpf document number", "rg document number", "pis document number",
"phone number", "email address", "credit card",
"location street", "location building number", "location neighborhood",
"location city", "location state", "location state abbreviation",
"location zip", "location full address",
"dob",
]
for ent in model.predict_entities(text, labels, threshold=0.3):
print(ent["text"], "→", ent["label"], f"({ent['score']:.2f})")
Best checkpoint by PT-BR partial F1 =
checkpoint-65550. All other checkpoints (everyeval_everystep) are also kept in this repo for ablation.
Training recipe
base encoder:
jhu-clsp/ettin-encoder-32mhidden_size:
384, max_len:1024, max_width:100, span_mode:token_leveltraining data:
data/splits/train.jsonlval data:
data/splits/val_5k.jsonlnum_steps:
69000, train_batch_size:256, eval_batch_size:192lr_encoder:
2e-5, lr_others:5e-5weight_decay: encoder=
0.01, other=0.01scheduler:
cosine, warmup_ratio:0.1, max_grad_norm:1.0focal loss: alpha=
0.75, gamma=2.0, reduction=meanprecision: bf16 (HIPBLASLT_ALLOW_TF32=0 on MI300X)
dropout:
0.3, fine_tune:True, subtoken_pooling:firsttraining data summary: data/splits/train.jsonl (~984k samples, 3x step budget)
The sweep uses the same loss recipe (focal α=0.75, γ=2.0, mean reduction), bf16 with HIPBLASLT_ALLOW_TF32=0 (avoids MI300X NaN intermittency), and a cosine schedule with 10% warmup. The local gliner source is patched against an upstream token_level regression that landed in 0.2.26 — this checkpoint is trained against gliner==0.2.25 plus that fix.
Evaluation protocol
Every eval_every steps a 5,000-sample, proportional, source-mixed holdout (data/holdout_5k_original) is re-scored. The holdout was sliced from the upstream PII / spam datasets before the train split was assembled, so there is no leakage. For each row we build a per-source label superset (canonical 24 PT-BR labels ∪ that row's gold labels, capped at 100, lowercased) and call model.inference(..., flat_ner=True, multi_label=False, threshold=0.3). Per-source F1 is computed by nervaluate (strict / exact / partial / ent_type).
Eleven evaluation sources cover PT-BR PII, EN PII, and negative-evidence spam/phishing:
gliner2_pii_ptbr_reward_split— PT-BR PII (the target)nemotron_pii—nvidia/Nemotron-PII, ENopen_pii_masking_500k,pii_masking_400k— multilingual PIIenron_spam_*,phishing_*,sms_spam_multilingual,spam_messages_mshenoda,spamassassin— negative evidence (no PII expected → tests false-positive rate)
Best-checkpoint scores (checkpoint-65550)
Per-source aggregate F1 on the 11-source proportional 5k holdout:
| eval source | strict F1 | partial F1 |
|---|---|---|
| PT-BR (gliner2 reward-split) | 0.6326 | 0.7167 |
| EN (nvidia/Nemotron-PII) | 0.4945 | 0.5721 |
| open-pii-masking-500k | 0.4081 | 0.5584 |
| pii-masking-400k | 0.4193 | 0.5654 |
| enron-spam (bvk) | 0.0000 | 0.0000 |
| enron-spam (setfit) | 0.0000 | 0.0000 |
| phishing (darkknight) | 0.0000 | 0.0000 |
| phishing (zefang) | 0.0000 | 0.0000 |
| SMS spam (multilingual) | 0.0000 | 0.0000 |
| spam-messages (mshenoda) | 0.0000 | 0.0000 |
| spamassassin | 0.0000 | 0.0000 |
Per-entity breakdown (4 PII sources)
Averaged only across sources that contain the label (sources where the label is absent in gold are excluded — including them would falsely deflate the aggregate, e.g. personal description of ethnicity averaged with two sources that lack it would drop from ~0.58 to ~0.29 partial F1). n_src is the number of contributing sources. Sorted by partial F1.
| entity | n_src | strict P | strict R | strict F1 | partial P | partial R | partial F1 |
|---|---|---|---|---|---|---|---|
| rg document number | 1 | 0.929 | 0.944 | 0.936 | 0.956 | 0.972 | 0.964 |
| pis document number | 1 | 0.950 | 0.897 | 0.922 | 0.971 | 0.917 | 0.943 |
| cpf document number | 1 | 0.906 | 0.895 | 0.901 | 0.947 | 0.936 | 0.942 |
| email address | 4 | 0.849 | 0.848 | 0.848 | 0.914 | 0.913 | 0.913 |
| coordinate | 1 | 0.738 | 0.818 | 0.776 | 0.820 | 0.909 | 0.862 |
| phone number | 4 | 0.778 | 0.881 | 0.825 | 0.809 | 0.916 | 0.858 |
| bank routing number | 1 | 0.768 | 0.946 | 0.848 | 0.768 | 0.946 | 0.848 |
| medical record number | 1 | 0.769 | 0.933 | 0.843 | 0.769 | 0.933 | 0.843 |
| dob | 3 | 0.720 | 0.869 | 0.779 | 0.752 | 0.917 | 0.817 |
| date | 2 | 0.693 | 0.761 | 0.724 | 0.777 | 0.850 | 0.811 |
| customer id | 1 | 0.712 | 0.910 | 0.799 | 0.712 | 0.910 | 0.799 |
| url | 1 | 0.598 | 0.583 | 0.591 | 0.793 | 0.773 | 0.783 |
| county | 1 | 0.667 | 0.809 | 0.731 | 0.713 | 0.865 | 0.782 |
| swift bic | 1 | 0.606 | 1.000 | 0.755 | 0.606 | 1.000 | 0.755 |
| device identifier | 1 | 0.645 | 0.909 | 0.755 | 0.645 | 0.909 | 0.755 |
| cvv | 1 | 0.641 | 0.893 | 0.746 | 0.641 | 0.893 | 0.746 |
| biometric identifier | 1 | 0.622 | 0.925 | 0.744 | 0.622 | 0.925 | 0.744 |
| employee id | 1 | 0.621 | 0.923 | 0.742 | 0.621 | 0.923 | 0.742 |
| health plan beneficiary number | 1 | 0.596 | 0.914 | 0.721 | 0.607 | 0.931 | 0.735 |
| location state abbreviation | 1 | 0.580 | 1.000 | 0.734 | 0.580 | 1.000 | 0.734 |
| fax number | 1 | 0.737 | 0.718 | 0.727 | 0.737 | 0.718 | 0.727 |
| ipv4 | 1 | 0.500 | 0.564 | 0.530 | 0.682 | 0.769 | 0.723 |
| credit card | 4 | 0.665 | 0.785 | 0.686 | 0.697 | 0.821 | 0.719 |
| mac address | 1 | 0.438 | 0.467 | 0.452 | 0.688 | 0.733 | 0.710 |
| location full address | 1 | 0.491 | 0.518 | 0.504 | 0.690 | 0.727 | 0.708 |
| middle name | 1 | 0.534 | 0.965 | 0.687 | 0.534 | 0.965 | 0.687 |
| location zip | 4 | 0.544 | 0.911 | 0.661 | 0.554 | 0.936 | 0.676 |
| account number | 2 | 0.627 | 0.730 | 0.674 | 0.627 | 0.730 | 0.674 |
| certificate license number | 1 | 0.524 | 0.892 | 0.660 | 0.524 | 0.892 | 0.660 |
| vehicle identifier | 1 | 0.492 | 1.000 | 0.659 | 0.492 | 1.000 | 0.659 |
| pin | 1 | 0.545 | 0.828 | 0.658 | 0.545 | 0.828 | 0.658 |
| location building number | 3 | 0.573 | 0.772 | 0.647 | 0.582 | 0.782 | 0.656 |
| date time | 1 | 0.544 | 0.623 | 0.581 | 0.608 | 0.696 | 0.649 |
| country | 1 | 0.474 | 0.991 | 0.642 | 0.474 | 0.991 | 0.642 |
| time | 2 | 0.505 | 0.737 | 0.584 | 0.546 | 0.822 | 0.638 |
| personal description of organizational affiliation | 1 | 0.381 | 0.447 | 0.411 | 0.582 | 0.684 | 0.629 |
| location street | 3 | 0.463 | 0.597 | 0.518 | 0.553 | 0.726 | 0.623 |
| company name | 1 | 0.463 | 0.562 | 0.507 | 0.557 | 0.677 | 0.611 |
| license plate | 1 | 0.435 | 0.714 | 0.541 | 0.489 | 0.804 | 0.608 |
| first name | 4 | 0.433 | 0.819 | 0.557 | 0.456 | 0.855 | 0.584 |
| location state | 2 | 0.422 | 0.824 | 0.556 | 0.439 | 0.858 | 0.579 |
| api key | 1 | 0.392 | 0.816 | 0.530 | 0.424 | 0.882 | 0.573 |
| last name | 4 | 0.441 | 0.757 | 0.536 | 0.464 | 0.809 | 0.565 |
| personal description of ethnicity | 2 | 0.360 | 0.673 | 0.468 | 0.419 | 0.778 | 0.544 |
| religious belief | 1 | 0.353 | 0.857 | 0.500 | 0.382 | 0.929 | 0.542 |
| location neighborhood | 1 | 0.370 | 0.702 | 0.485 | 0.398 | 0.754 | 0.521 |
| social security number | 3 | 0.482 | 0.567 | 0.505 | 0.491 | 0.576 | 0.514 |
| personal description of sexual information | 1 | 0.226 | 0.357 | 0.277 | 0.397 | 0.627 | 0.486 |
| language | 1 | 0.314 | 1.000 | 0.477 | 0.314 | 1.000 | 0.477 |
| sex or gender | 2 | 0.296 | 0.875 | 0.443 | 0.301 | 0.887 | 0.449 |
| ipv6 | 1 | 0.111 | 0.182 | 0.138 | 0.361 | 0.591 | 0.448 |
| age | 2 | 0.295 | 0.953 | 0.445 | 0.295 | 0.953 | 0.445 |
| passport number | 1 | 0.353 | 0.600 | 0.444 | 0.353 | 0.600 | 0.444 |
| location city | 4 | 0.288 | 0.710 | 0.407 | 0.313 | 0.769 | 0.442 |
| personal description of religious convictions | 1 | 0.212 | 0.444 | 0.287 | 0.314 | 0.658 | 0.425 |
| tax id number | 3 | 0.325 | 0.535 | 0.379 | 0.378 | 0.581 | 0.422 |
| id card number | 2 | 0.324 | 0.642 | 0.420 | 0.324 | 0.642 | 0.420 |
| user name | 2 | 0.239 | 0.516 | 0.321 | 0.293 | 0.650 | 0.396 |
| personal description of medical conditions | 1 | 0.158 | 0.333 | 0.215 | 0.291 | 0.614 | 0.395 |
| http cookie | 1 | 0.060 | 0.103 | 0.076 | 0.310 | 0.534 | 0.392 |
| sexuality | 1 | 0.253 | 0.741 | 0.377 | 0.259 | 0.759 | 0.387 |
| password | 2 | 0.119 | 0.249 | 0.161 | 0.274 | 0.573 | 0.371 |
| employment status | 1 | 0.223 | 0.671 | 0.334 | 0.227 | 0.685 | 0.341 |
| personal description of political opinion | 1 | 0.107 | 0.342 | 0.163 | 0.197 | 0.631 | 0.300 |
| driver license number | 2 | 0.396 | 0.266 | 0.283 | 0.410 | 0.279 | 0.297 |
| education level | 1 | 0.102 | 0.358 | 0.159 | 0.177 | 0.623 | 0.276 |
| blood type | 1 | 0.035 | 0.125 | 0.055 | 0.150 | 0.531 | 0.234 |
| political view | 1 | 0.110 | 0.769 | 0.192 | 0.126 | 0.885 | 0.221 |
| unique id | 1 | 0.122 | 0.857 | 0.214 | 0.122 | 0.857 | 0.214 |
| title | 1 | 0.089 | 0.636 | 0.156 | 0.097 | 0.700 | 0.171 |
| occupation | 1 | 0.018 | 0.158 | 0.032 | 0.051 | 0.449 | 0.092 |
Reading guide. Structured tokens (email, phone, doc numbers, IPs, MAC) approach
strict ≈ partialbecause their boundaries are unambiguous. Long natural-language spans (full address, the personal description of … labels) carry a meaningful strict-vs-partial gap because exact boundaries are inherently fuzzy — even two human annotators would disagree. For those labels, partial F1 is the operationally meaningful metric.
Sweep over checkpoints
20 eval points (every eval_every steps). Higher is better.
| step | PT-BR strict | PT-BR partial | Nemotron partial | open-pii partial | pii-masking partial |
|---|---|---|---|---|---|
| 3450 | 0.0418 | 0.2343 | 0.2430 | 0.2433 | 0.2756 |
| 6900 | 0.0382 | 0.2149 | 0.1812 | 0.2302 | 0.2328 |
| 10350 | 0.0734 | 0.2940 | 0.2590 | 0.2771 | 0.2248 |
| 13800 | 0.1170 | 0.3049 | 0.3213 | 0.3507 | 0.2452 |
| 17250 | 0.1967 | 0.3828 | 0.3123 | 0.3514 | 0.2705 |
| 20700 | 0.2919 | 0.4961 | 0.3535 | 0.3929 | 0.3474 |
| 24150 | 0.3328 | 0.5115 | 0.3537 | 0.4192 | 0.3535 |
| 27600 | 0.3604 | 0.5117 | 0.3845 | 0.4342 | 0.3690 |
| 31050 | 0.4700 | 0.6083 | 0.4437 | 0.4714 | 0.4472 |
| 34500 | 0.5147 | 0.6458 | 0.4121 | 0.4639 | 0.3983 |
| 37950 | 0.4858 | 0.5946 | 0.4575 | 0.5052 | 0.4288 |
| 41400 | 0.5340 | 0.6477 | 0.4892 | 0.5191 | 0.4856 |
| 44850 | 0.5758 | 0.6812 | 0.5484 | 0.5658 | 0.5332 |
| 48300 | 0.5670 | 0.6703 | 0.4890 | 0.5218 | 0.4836 |
| 51750 | 0.5812 | 0.6716 | 0.5480 | 0.4660 | 0.5083 |
| 55200 | 0.6080 | 0.6959 | 0.5392 | 0.5234 | 0.5287 |
| 58650 | 0.6276 | 0.7160 | 0.5665 | 0.5622 | 0.5599 |
| 62100 | 0.6248 | 0.7076 | 0.5764 | 0.5494 | 0.5655 |
| 65550 | 0.6326 | 0.7167 | 0.5721 | 0.5584 | 0.5654 |
| 69000 | 0.6291 | 0.7135 | 0.5694 | 0.5566 | 0.5636 |
Comparison across the sweep (best by PT-BR partial F1 of each run)
ettin-32m top-50k : PT-BR strict 0.1411 partial 0.3710 Nemotron partial 0.1710 cross-avg 0.2808 (step 8500)
ettin-32m top-100k : PT-BR strict 0.0946 partial 0.3165 Nemotron partial 0.1455 cross-avg 0.2428 (step 5200)
ettin-68m top-50k : PT-BR strict 0.2732 partial 0.4802 Nemotron partial 0.2397 cross-avg 0.3117 (step 9945)
ettin-68m top-100k : PT-BR strict 0.4109 partial 0.6191 Nemotron partial 0.3046 cross-avg 0.4497 (step 8580)
ettin-68m full : PT-BR strict 0.3663 partial 0.5416 Nemotron partial 0.3563 cross-avg 0.4239 (step 19550)
ettin-32m full : PT-BR strict 0.1429 partial 0.3536 Nemotron partial 0.2640 cross-avg 0.2867 (step 19550)
ettin-32m full-3x : PT-BR strict 0.6326 partial 0.7167 Nemotron partial 0.5721 cross-avg 0.6031 (step 65550)
ettin-68m full-3x : PT-BR strict 0.7076 partial 0.7979 Nemotron partial 0.6799 cross-avg 0.6817 (step 41400)
gliner-ettin-32m-ptbr-pii-easter-egg: PT-BR strict 0.6445 partial 0.7344 Nemotron partial 0.6019 cross-avg 0.6243 (step 100)
gliner-ettin-68m-ptbr-pii-easter-egg: PT-BR strict 0.7109 partial 0.8073 Nemotron partial 0.7182 cross-avg 0.7167 (step 100) *** PT-BR best *** cross-source best
Take-away. The 3x re-runs (69 000 steps) blew past the original 23 000-step sweep — the original 32M variants were straightforwardly under-trained, and the longer cosine schedule unlocks 1.5–2× higher F1 on every PII source. Curated subsets (top-50k / top-100k) overfit to PT-BR but trail badly on cross-source generalization; the full-data 3x runs dominate on every metric we care about. For production use, prefer gliner-ettin-{32m,68m}-ptbr-pii-full-3x-v1 — they sit at the Pareto frontier of in-domain F1 and cross-source generalization.
Holdout benchmark (matched protocol on nvidia/Nemotron-PII)
For the ettin-68m full checkpoint at step 17250, evaluated independently with eval_nemotron_prior.py (5000 samples, threshold=0.5, gold-label set, single-label flat NER):
| model | strict F1 | partial F1 |
|---|---|---|
| GLiNER multitask-large (prior) | 0.5834 | 0.6490 |
| mmbert-teacher (prior) | 0.4574 | 0.5418 |
| Albertina ckpt-12k (prior) | 0.1801 | 0.3399 |
| ettin-68m (prior) | 0.1493 | 0.3210 |
| ettin-68m full v1 step-17250 (this sweep) | 0.2747 | 0.4306 |
i.e. +0.125 strict F1 / +0.110 partial F1 over the prior ettin-68m baseline at the same protocol — with this sweep's recipe (per-source label superset + bf16 mean-reduction + token_level fix).
Limitations
- This is a private research checkpoint. Do not use it as a sole source of truth for redaction without human review.
- PT-BR target performance was the optimization objective; spam/phishing partial F1 is a negative-evidence signal and is intentionally not the target.
- The sweep was constrained to ~23k steps for 68M and ~12k–16k for the curated runs by single-MI300X budget. The follow-up triple-step re-run is what should be used in production once it lands.
Files
Each checkpoint-N/ directory contains a self-contained GLiNER snapshot (gliner_config.json, pytorch_model.bin, tokenizer, optimizer state, scheduler state, RNG state). Load any of them with GLiNER.from_pretrained(repo, subfolder="checkpoint-N").
- Downloads last month
- -
Model tree for arthrod/gliner-ettin-32m-ptbr-pii-full-3x-v1
Base model
jhu-clsp/ettin-encoder-32m