Building on HF

38 7 54

Michael Anthony PRO

MikeDoes

http://www.ai4privacy.com

AI & ML interests

Privacy, Large Language Model, Explainable

Recent Activity

liked a model about 13 hours ago

openai/privacy-filter

reacted to theirpost with 🔥 about 21 hours ago

AI4Privacy datasets are being used to decide what data should never leave the device. A new paper on privacy-preserving cloud computing uses the AI4Privacy PII-Masking-65K dataset to train models that classify text as private or public before it’s ever sent to the cloud. This is a subtle but important shift. Instead of encrypting everything or trusting the cloud by default, the authors ask a simpler question: Can we detect sensitive text early enough to keep it local? Using DistilBERT, trained partly on AI4Privacy PII data, the system learns to: route private text to local processing send non-sensitive text to the cloud train collaboratively using federated learning, without sharing raw data The result: 99.9% accuracy in private vs public text detection Near-centralized performance in downstream tasks like SMS spam detection Privacy protection enforced by design, not policy What stands out here is not just the model performance, but the architectural idea: privacy as a routing decision, backed by large-scale PII annotations. This work reinforces a pattern we keep seeing: scalable privacy systems don’t start with encryption, they start with good PII data. 📄 Full Paper here: https://dl.acm.org/doi/full/10.1145/3773276.3774872 #Ai4Privacy #DataPrivacy #PIIMasking #FederatedLearning #PrivacyEngineering #OpenSourceAI #ResponsibleAI #AcademicResearch #LLMSecurity

posted an update 1 day ago

View all activity

Organizations

liked a model about 13 hours ago

openai/privacy-filter

Token Classification • 1B • Updated 13 days ago • 141k • 1.29k

reacted to their post with 🔥 about 21 hours ago

Post

131

AI4Privacy datasets are being used to decide what data should never leave the device.

A new paper on privacy-preserving cloud computing uses the AI4Privacy PII-Masking-65K dataset to train models that classify text as private or public before it’s ever sent to the cloud.

This is a subtle but important shift.

Instead of encrypting everything or trusting the cloud by default, the authors ask a simpler question:

Can we detect sensitive text early enough to keep it local?

Using DistilBERT, trained partly on AI4Privacy PII data, the system learns to:

route private text to local processing

send non-sensitive text to the cloud

train collaboratively using federated learning, without sharing raw data

The result:

99.9% accuracy in private vs public text detection

Near-centralized performance in downstream tasks like SMS spam detection

Privacy protection enforced by design, not policy

What stands out here is not just the model performance, but the architectural idea:
privacy as a routing decision, backed by large-scale PII annotations.

This work reinforces a pattern we keep seeing: scalable privacy systems don’t start with encryption, they start with good PII data.

📄 Full Paper here: https://dl.acm.org/doi/full/10.1145/3773276.3774872

#Ai4Privacy #DataPrivacy #PIIMasking #FederatedLearning #PrivacyEngineering #OpenSourceAI #ResponsibleAI #AcademicResearch #LLMSecurity

posted an update 1 day ago

Post

131

AI4Privacy datasets are being used to decide what data should never leave the device.

A new paper on privacy-preserving cloud computing uses the AI4Privacy PII-Masking-65K dataset to train models that classify text as private or public before it’s ever sent to the cloud.

This is a subtle but important shift.

Instead of encrypting everything or trusting the cloud by default, the authors ask a simpler question:

Can we detect sensitive text early enough to keep it local?

Using DistilBERT, trained partly on AI4Privacy PII data, the system learns to:

route private text to local processing

send non-sensitive text to the cloud

train collaboratively using federated learning, without sharing raw data

The result:

99.9% accuracy in private vs public text detection

Near-centralized performance in downstream tasks like SMS spam detection

Privacy protection enforced by design, not policy

What stands out here is not just the model performance, but the architectural idea:
privacy as a routing decision, backed by large-scale PII annotations.

This work reinforces a pattern we keep seeing: scalable privacy systems don’t start with encryption, they start with good PII data.

📄 Full Paper here: https://dl.acm.org/doi/full/10.1145/3773276.3774872

#Ai4Privacy #DataPrivacy #PIIMasking #FederatedLearning #PrivacyEngineering #OpenSourceAI #ResponsibleAI #AcademicResearch #LLMSecurity

reacted to their post with ❤️ 2 days ago

Post

137

This new preprint fine-tunes T5-small and Mistral-7B on the AI4Privacy PII-Masking-200K dataset and shows that lightweight models can match and sometimes rival much larger LLMs for privacy tasks.

The study tackles a real deployment question many teams face:

Is PII masking a model-size problem, or a data-quality problem?

Using AI4Privacy’s large-scale, standardized PII annotations, the authors systematically compare:

Encoder–decoder models (T5) vs

Decoder-only models (Mistral)

across accuracy, robustness, latency, and real-world conversational text.

What stood out:

Mistral-7B achieved higher recall and robustness across noisy, informal inputs but with 10× higher latency

T5-small, trained on the same AI4Privacy data, delivered fast, structured, low-cost masking, making it viable for real-time systems

Dataset normalization (not model size) was one of the biggest drivers of performance gains

The models were then deployed in a live Discord bot, where performance dropped under real-world conditions a reminder that benchmarks alone aren’t enough.

The takeaway is hard to ignore:

Privacy-preserving AI scales through data design, not just bigger models.

This work reinforces why open, well-curated datasets like AI4Privacy PII-Masking-200K are becoming foundational infrastructure for privacy-first AI especially for teams that need self-hosted, transparent solutions.

📄 Read the paper: https://arxiv.org/abs/2512.18608

posted an update 2 days ago

Post

137

posted an update 4 days ago

Post

120

PII leakage isn’t just a model problem it’s a data problem.

A recent paper takes a hard look at how well current systems actually detect and redact personal data at scale. One of their key conclusions is something the privacy community keeps rediscovering: without large, structured, and diverse PII datasets, evaluation collapses into guesswork.

To ground their experiments, the authors benchmarked their approach using the 500K PII-Masking dataset from AI4Privacy, leveraging its scale and coverage to test real-world redaction behavior rather than toy examples.

What’s interesting here isn’t just the model performance it’s what the evaluation reveals.

The paper shows that many systems appear robust under narrow tests but fail once PII appears in varied formats, contexts, and combinations. This gap between “works in theory” and “works in practice” is exactly where privacy risks emerge.

This is the value of open, research-grade datasets:

They expose failure modes early

They make comparisons reproducible

They let the community measure progress honestly

When researchers build on shared data foundations, everyone benefits from academic insight to safer downstream applications.

🔗 Read the full paper here: https://arxiv.org/abs/2407.08792

reacted to davidmezzetti's post with ❤️ 6 days ago

Post

149

We're excited to release the new BiomedBERT Small series of models. These 22.7M parameter models, trained for medical literature, are similarity sized to the popular all-MiniLM-v2 models and pack quite a punch.

Read more here: https://huggingface.co/blog/NeuML/biomedbert-small

published 10 datasets 8 days ago

liked a dataset 15 days ago

ai4privacy/pii-masking-health-phi-200k

Viewer • Updated Apr 4 • 252k • 20 • 2

published a dataset 15 days ago

ai4privacy/pii-masking-health-phi-200k

Viewer • Updated Apr 4 • 252k • 20 • 2

Michael Anthony PRO

AI & ML interests

Recent Activity

Organizations

MikeDoes's activity