gliner-ettin-32m-ptbr-pii-full-3x-v1

GLiNER fine-tune of jhu-clsp/ettin-encoder-32m (~32M params) for Brazilian-Portuguese PII detection, with cross-source generalization tested on English/multilingual PII and spam corpora.

This is part of a 6-config sweep (32M × 68M) × (top-50k curated, top-100k curated, full ~984k), trained on the same MI300X GPU under identical recipe except for batch size and step budget. See Comparison across the sweep below.

Quick start

from gliner import GLiNER

model = GLiNER.from_pretrained("arthrod/gliner-ettin-32m-ptbr-pii-full-3x-v1", subfolder="checkpoint-65550")

text = (
    "Sou Maria Silva, moro na Rua das Flores, 123, em São Paulo. "
    "Meu CPF é 123.456.789-09 e meu telefone é (11) 91234-5678."
)

labels = [
    "first name", "last name", "middle name",
    "cpf document number", "rg document number", "pis document number",
    "phone number", "email address", "credit card",
    "location street", "location building number", "location neighborhood",
    "location city", "location state", "location state abbreviation",
    "location zip", "location full address",
    "dob",
]

for ent in model.predict_entities(text, labels, threshold=0.3):
    print(ent["text"], "→", ent["label"], f"({ent['score']:.2f})")

Best checkpoint by PT-BR partial F1 = checkpoint-65550. All other checkpoints (every eval_every step) are also kept in this repo for ablation.

Training recipe

  • base encoder: jhu-clsp/ettin-encoder-32m

  • hidden_size: 384, max_len: 1024, max_width: 100, span_mode: token_level

  • training data: data/splits/train.jsonl

  • val data: data/splits/val_5k.jsonl

  • num_steps: 69000, train_batch_size: 256, eval_batch_size: 192

  • lr_encoder: 2e-5, lr_others: 5e-5

  • weight_decay: encoder=0.01, other=0.01

  • scheduler: cosine, warmup_ratio: 0.1, max_grad_norm: 1.0

  • focal loss: alpha=0.75, gamma=2.0, reduction=mean

  • precision: bf16 (HIPBLASLT_ALLOW_TF32=0 on MI300X)

  • dropout: 0.3, fine_tune: True, subtoken_pooling: first

  • training data summary: data/splits/train.jsonl (~984k samples, 3x step budget)

The sweep uses the same loss recipe (focal α=0.75, γ=2.0, mean reduction), bf16 with HIPBLASLT_ALLOW_TF32=0 (avoids MI300X NaN intermittency), and a cosine schedule with 10% warmup. The local gliner source is patched against an upstream token_level regression that landed in 0.2.26 — this checkpoint is trained against gliner==0.2.25 plus that fix.

Evaluation protocol

Every eval_every steps a 5,000-sample, proportional, source-mixed holdout (data/holdout_5k_original) is re-scored. The holdout was sliced from the upstream PII / spam datasets before the train split was assembled, so there is no leakage. For each row we build a per-source label superset (canonical 24 PT-BR labels ∪ that row's gold labels, capped at 100, lowercased) and call model.inference(..., flat_ner=True, multi_label=False, threshold=0.3). Per-source F1 is computed by nervaluate (strict / exact / partial / ent_type).

Eleven evaluation sources cover PT-BR PII, EN PII, and negative-evidence spam/phishing:

  • gliner2_pii_ptbr_reward_split — PT-BR PII (the target)
  • nemotron_piinvidia/Nemotron-PII, EN
  • open_pii_masking_500k, pii_masking_400k — multilingual PII
  • enron_spam_*, phishing_*, sms_spam_multilingual, spam_messages_mshenoda, spamassassin — negative evidence (no PII expected → tests false-positive rate)

Best-checkpoint scores (checkpoint-65550)

Per-source aggregate F1 on the 11-source proportional 5k holdout:

eval source strict F1 partial F1
PT-BR (gliner2 reward-split) 0.6326 0.7167
EN (nvidia/Nemotron-PII) 0.4945 0.5721
open-pii-masking-500k 0.4081 0.5584
pii-masking-400k 0.4193 0.5654
enron-spam (bvk) 0.0000 0.0000
enron-spam (setfit) 0.0000 0.0000
phishing (darkknight) 0.0000 0.0000
phishing (zefang) 0.0000 0.0000
SMS spam (multilingual) 0.0000 0.0000
spam-messages (mshenoda) 0.0000 0.0000
spamassassin 0.0000 0.0000

Per-entity breakdown (4 PII sources)

Averaged only across sources that contain the label (sources where the label is absent in gold are excluded — including them would falsely deflate the aggregate, e.g. personal description of ethnicity averaged with two sources that lack it would drop from ~0.58 to ~0.29 partial F1). n_src is the number of contributing sources. Sorted by partial F1.

entity n_src strict P strict R strict F1 partial P partial R partial F1
rg document number 1 0.929 0.944 0.936 0.956 0.972 0.964
pis document number 1 0.950 0.897 0.922 0.971 0.917 0.943
cpf document number 1 0.906 0.895 0.901 0.947 0.936 0.942
email address 4 0.849 0.848 0.848 0.914 0.913 0.913
coordinate 1 0.738 0.818 0.776 0.820 0.909 0.862
phone number 4 0.778 0.881 0.825 0.809 0.916 0.858
bank routing number 1 0.768 0.946 0.848 0.768 0.946 0.848
medical record number 1 0.769 0.933 0.843 0.769 0.933 0.843
dob 3 0.720 0.869 0.779 0.752 0.917 0.817
date 2 0.693 0.761 0.724 0.777 0.850 0.811
customer id 1 0.712 0.910 0.799 0.712 0.910 0.799
url 1 0.598 0.583 0.591 0.793 0.773 0.783
county 1 0.667 0.809 0.731 0.713 0.865 0.782
swift bic 1 0.606 1.000 0.755 0.606 1.000 0.755
device identifier 1 0.645 0.909 0.755 0.645 0.909 0.755
cvv 1 0.641 0.893 0.746 0.641 0.893 0.746
biometric identifier 1 0.622 0.925 0.744 0.622 0.925 0.744
employee id 1 0.621 0.923 0.742 0.621 0.923 0.742
health plan beneficiary number 1 0.596 0.914 0.721 0.607 0.931 0.735
location state abbreviation 1 0.580 1.000 0.734 0.580 1.000 0.734
fax number 1 0.737 0.718 0.727 0.737 0.718 0.727
ipv4 1 0.500 0.564 0.530 0.682 0.769 0.723
credit card 4 0.665 0.785 0.686 0.697 0.821 0.719
mac address 1 0.438 0.467 0.452 0.688 0.733 0.710
location full address 1 0.491 0.518 0.504 0.690 0.727 0.708
middle name 1 0.534 0.965 0.687 0.534 0.965 0.687
location zip 4 0.544 0.911 0.661 0.554 0.936 0.676
account number 2 0.627 0.730 0.674 0.627 0.730 0.674
certificate license number 1 0.524 0.892 0.660 0.524 0.892 0.660
vehicle identifier 1 0.492 1.000 0.659 0.492 1.000 0.659
pin 1 0.545 0.828 0.658 0.545 0.828 0.658
location building number 3 0.573 0.772 0.647 0.582 0.782 0.656
date time 1 0.544 0.623 0.581 0.608 0.696 0.649
country 1 0.474 0.991 0.642 0.474 0.991 0.642
time 2 0.505 0.737 0.584 0.546 0.822 0.638
personal description of organizational affiliation 1 0.381 0.447 0.411 0.582 0.684 0.629
location street 3 0.463 0.597 0.518 0.553 0.726 0.623
company name 1 0.463 0.562 0.507 0.557 0.677 0.611
license plate 1 0.435 0.714 0.541 0.489 0.804 0.608
first name 4 0.433 0.819 0.557 0.456 0.855 0.584
location state 2 0.422 0.824 0.556 0.439 0.858 0.579
api key 1 0.392 0.816 0.530 0.424 0.882 0.573
last name 4 0.441 0.757 0.536 0.464 0.809 0.565
personal description of ethnicity 2 0.360 0.673 0.468 0.419 0.778 0.544
religious belief 1 0.353 0.857 0.500 0.382 0.929 0.542
location neighborhood 1 0.370 0.702 0.485 0.398 0.754 0.521
social security number 3 0.482 0.567 0.505 0.491 0.576 0.514
personal description of sexual information 1 0.226 0.357 0.277 0.397 0.627 0.486
language 1 0.314 1.000 0.477 0.314 1.000 0.477
sex or gender 2 0.296 0.875 0.443 0.301 0.887 0.449
ipv6 1 0.111 0.182 0.138 0.361 0.591 0.448
age 2 0.295 0.953 0.445 0.295 0.953 0.445
passport number 1 0.353 0.600 0.444 0.353 0.600 0.444
location city 4 0.288 0.710 0.407 0.313 0.769 0.442
personal description of religious convictions 1 0.212 0.444 0.287 0.314 0.658 0.425
tax id number 3 0.325 0.535 0.379 0.378 0.581 0.422
id card number 2 0.324 0.642 0.420 0.324 0.642 0.420
user name 2 0.239 0.516 0.321 0.293 0.650 0.396
personal description of medical conditions 1 0.158 0.333 0.215 0.291 0.614 0.395
http cookie 1 0.060 0.103 0.076 0.310 0.534 0.392
sexuality 1 0.253 0.741 0.377 0.259 0.759 0.387
password 2 0.119 0.249 0.161 0.274 0.573 0.371
employment status 1 0.223 0.671 0.334 0.227 0.685 0.341
personal description of political opinion 1 0.107 0.342 0.163 0.197 0.631 0.300
driver license number 2 0.396 0.266 0.283 0.410 0.279 0.297
education level 1 0.102 0.358 0.159 0.177 0.623 0.276
blood type 1 0.035 0.125 0.055 0.150 0.531 0.234
political view 1 0.110 0.769 0.192 0.126 0.885 0.221
unique id 1 0.122 0.857 0.214 0.122 0.857 0.214
title 1 0.089 0.636 0.156 0.097 0.700 0.171
occupation 1 0.018 0.158 0.032 0.051 0.449 0.092

Reading guide. Structured tokens (email, phone, doc numbers, IPs, MAC) approach strict ≈ partial because their boundaries are unambiguous. Long natural-language spans (full address, the personal description of … labels) carry a meaningful strict-vs-partial gap because exact boundaries are inherently fuzzy — even two human annotators would disagree. For those labels, partial F1 is the operationally meaningful metric.

Sweep over checkpoints

20 eval points (every eval_every steps). Higher is better.

step PT-BR strict PT-BR partial Nemotron partial open-pii partial pii-masking partial
3450 0.0418 0.2343 0.2430 0.2433 0.2756
6900 0.0382 0.2149 0.1812 0.2302 0.2328
10350 0.0734 0.2940 0.2590 0.2771 0.2248
13800 0.1170 0.3049 0.3213 0.3507 0.2452
17250 0.1967 0.3828 0.3123 0.3514 0.2705
20700 0.2919 0.4961 0.3535 0.3929 0.3474
24150 0.3328 0.5115 0.3537 0.4192 0.3535
27600 0.3604 0.5117 0.3845 0.4342 0.3690
31050 0.4700 0.6083 0.4437 0.4714 0.4472
34500 0.5147 0.6458 0.4121 0.4639 0.3983
37950 0.4858 0.5946 0.4575 0.5052 0.4288
41400 0.5340 0.6477 0.4892 0.5191 0.4856
44850 0.5758 0.6812 0.5484 0.5658 0.5332
48300 0.5670 0.6703 0.4890 0.5218 0.4836
51750 0.5812 0.6716 0.5480 0.4660 0.5083
55200 0.6080 0.6959 0.5392 0.5234 0.5287
58650 0.6276 0.7160 0.5665 0.5622 0.5599
62100 0.6248 0.7076 0.5764 0.5494 0.5655
65550 0.6326 0.7167 0.5721 0.5584 0.5654
69000 0.6291 0.7135 0.5694 0.5566 0.5636

Comparison across the sweep (best by PT-BR partial F1 of each run)

ettin-32m top-50k       : PT-BR strict 0.1411  partial 0.3710  Nemotron partial 0.1710  cross-avg 0.2808  (step 8500)
ettin-32m top-100k      : PT-BR strict 0.0946  partial 0.3165  Nemotron partial 0.1455  cross-avg 0.2428  (step 5200)
ettin-68m top-50k       : PT-BR strict 0.2732  partial 0.4802  Nemotron partial 0.2397  cross-avg 0.3117  (step 9945)
ettin-68m top-100k      : PT-BR strict 0.4109  partial 0.6191  Nemotron partial 0.3046  cross-avg 0.4497  (step 8580)
ettin-68m full          : PT-BR strict 0.3663  partial 0.5416  Nemotron partial 0.3563  cross-avg 0.4239  (step 19550)
ettin-32m full          : PT-BR strict 0.1429  partial 0.3536  Nemotron partial 0.2640  cross-avg 0.2867  (step 19550)
ettin-32m full-3x       : PT-BR strict 0.6326  partial 0.7167  Nemotron partial 0.5721  cross-avg 0.6031  (step 65550)
ettin-68m full-3x       : PT-BR strict 0.7076  partial 0.7979  Nemotron partial 0.6799  cross-avg 0.6817  (step 41400)
gliner-ettin-32m-ptbr-pii-easter-egg: PT-BR strict 0.6445  partial 0.7344  Nemotron partial 0.6019  cross-avg 0.6243  (step 100)
gliner-ettin-68m-ptbr-pii-easter-egg: PT-BR strict 0.7109  partial 0.8073  Nemotron partial 0.7182  cross-avg 0.7167  (step 100)  *** PT-BR best  *** cross-source best

Take-away. The 3x re-runs (69 000 steps) blew past the original 23 000-step sweep — the original 32M variants were straightforwardly under-trained, and the longer cosine schedule unlocks 1.5–2× higher F1 on every PII source. Curated subsets (top-50k / top-100k) overfit to PT-BR but trail badly on cross-source generalization; the full-data 3x runs dominate on every metric we care about. For production use, prefer gliner-ettin-{32m,68m}-ptbr-pii-full-3x-v1 — they sit at the Pareto frontier of in-domain F1 and cross-source generalization.

Holdout benchmark (matched protocol on nvidia/Nemotron-PII)

For the ettin-68m full checkpoint at step 17250, evaluated independently with eval_nemotron_prior.py (5000 samples, threshold=0.5, gold-label set, single-label flat NER):

model strict F1 partial F1
GLiNER multitask-large (prior) 0.5834 0.6490
mmbert-teacher (prior) 0.4574 0.5418
Albertina ckpt-12k (prior) 0.1801 0.3399
ettin-68m (prior) 0.1493 0.3210
ettin-68m full v1 step-17250 (this sweep) 0.2747 0.4306

i.e. +0.125 strict F1 / +0.110 partial F1 over the prior ettin-68m baseline at the same protocol — with this sweep's recipe (per-source label superset + bf16 mean-reduction + token_level fix).

Limitations

  • This is a private research checkpoint. Do not use it as a sole source of truth for redaction without human review.
  • PT-BR target performance was the optimization objective; spam/phishing partial F1 is a negative-evidence signal and is intentionally not the target.
  • The sweep was constrained to ~23k steps for 68M and ~12k–16k for the curated runs by single-MI300X budget. The follow-up triple-step re-run is what should be used in production once it lands.

Files

Each checkpoint-N/ directory contains a self-contained GLiNER snapshot (gliner_config.json, pytorch_model.bin, tokenizer, optimizer state, scheduler state, RNG state). Load any of them with GLiNER.from_pretrained(repo, subfolder="checkpoint-N").

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for arthrod/gliner-ettin-32m-ptbr-pii-full-3x-v1

Finetuned
(20)
this model

Collection including arthrod/gliner-ettin-32m-ptbr-pii-full-3x-v1