π€ Sentence Transformers is joining Hugging Face! π€ This formalizes the existing maintenance structure, as I've personally led the project for the past two years on behalf of Hugging Face! Details:
Today, the Ubiquitous Knowledge Processing (UKP) Lab is transferring the project to Hugging Face. Sentence Transformers will remain a community-driven, open-source project, with the same open-source license (Apache 2.0) as before. Contributions from researchers, developers, and enthusiasts are welcome and encouraged. The project will continue to prioritize transparency, collaboration, and broad accessibility.
We see an increasing wish from companies to move from large LLM APIs to local models for better control and privacy, reflected in the library's growth: in just the last 30 days, Sentence Transformer models have been downloaded >270 million times, second only to transformers.
I would like to thank the UKP Lab, and especially Nils Reimers and Iryna Gurevych, both for their dedication to the project and for their trust in myself, both now and two years ago. Back then, neither of you knew me well, yet you trusted me to take the project to new heights. That choice ended up being very valuable for the embedding & Information Retrieval community, and I think this choice of granting Hugging Face stewardship will be similarly successful.
I'm very excited about the future of the project, and for the world of embeddings and retrieval at large!
π New blog: Maintain the unmaintainable β 1M+ Python LOC, 400+ models
How do you stop a million-line library built by thousands of contributors from collapsing under its own weight? At π€ Transformers, we do it with explicit software-engineering tenets, principles that make the codebase hackable at scale.
π Inside the post: β One Model, One File: readability first β you can still open a modeling file and see the full logic, top to bottom. β Modular Transformers: visible inheritance that cuts maintenance cost by ~15Γ while keeping models readable. β Config-Driven Performance: FlashAttention, tensor parallelism, and attention scheduling are config-level features, not rewrites.
Written with @lysandre,@pcuenq and @yonigozlan, this is a deep dive into how Transformers stays fast, open, and maintainable.
Today is a huge day in Argillaβs history. We couldnβt be more excited to share this with the community: weβre joining Hugging Face!
Weβre embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.
Over the past year, weβve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyrβs learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets
After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, weβre now the same team.
To those of you whoβve been following us, this wonβt be a huge surprise, but it will be a big deal in the coming months. This acquisition means weβll double down on empowering the community to build and collaborate on high quality datasets, weβll bring full support for multimodal datasets, and weβll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.
The Document AI team (@Molbap, @rwightman, @danaaubakirova) at Hugging Face is developing a new multimodal data augmentation pipeline utilising both visual and textual aspects of document images.
π₯ What's New: - Polars integration π»ββοΈ - fsspec support for conversion to JSON, CSV, and Parquet - Mode parameter for Image feature - CLI function to convert script-datasets to Parquet - Dataset.take and Dataset.skip
Plus, a bunch of general improvements & bug fixes!
Announcing that we are on our way to solve a long standing issue of document processing: correction of OCR mistakes. Pleias publishes the largest dataset to date with automated OCR correction, 1 billion words in English, French, German and Italian.
OCR quality is long-standing issue of digitization. Cultural heritage texts are especially concerned due to the primary sources being old documents (with many artifacts, blots, degradation) and to the limitation of OCR technology for historical scripts. When we released Common Corpus, a 500 Billion words corpus in the public domain, this was the primary criticism.
Recent breakthrough in post-OCR correction has been made possible thanks to progress in open LLM research and several months of dedicated training and alignment by Pleias as well as the HPC resources from GENCIβIDRIS (Grant 2023-AD011014736) on Jean-Zay.
πͺ Strong 8B-parameters model: often on par with open 30B counterparts. πOpen license: Apache 2.0. π Strong improvement over Idefics1: +12 points on VQAv2, +30 points on TextVQA while having 10x fewer parameters. π Better data: boosting OCR capabilities with 6TB of documents to transcribe, and improving QA capabilities on charts/figures/diagrams. π΅οΈββοΈ Transparent training data: inspect and build upon all the data (10s of TB of data) we trained on. π² More natural image processing: Incorporating strategies to treat images in their native resolution and native aspect ratio. πΈ High-resolution images: image resolutions up to 980 x 980 and integrating strategies that allow to trade computational efficiency for performance. π 2 checkpoints: Releasing both base checkpoint and instruction fine-tuned checkpoint. Chat version to come.
We release Idefics2-8B, a foundation vision language model with SOTA results for its size on many benchmarks.
For Idefics2, we adopted a simple architecture: -Images are fed to a vision encoder, then to a modality projection to match the input dimension of the LLM, and finally to a perceiver resampler for efficient pooling. -Interleaved image-text data are then passed to the LLM.
During the pre-training: -The modality projection and perceiver resampler weights are newly initialized. -We start with pre-trained models for the vision encoder and the LLM, and continue the training with LoRA. -In total, we see 1.5T images!
We pre-train on 3 types of data, all publicly available: -Interleaved image-text documents: our dataset OBELICS HuggingFaceM4/OBELICS -Image caption pairs: only synthetic captions! -PDF documents: IDL and PDFA
We kept the aspect ratio of the images with the Patch n' Pack strategy, with a resolution of up to 980x980. At inference, it's also more efficient for lower-resolution images.
For the SFT, we build The Cauldron, a collection of 50 high-quality datasets in the user/assistant format. It is a ready-to-use dataset for the fine-tuning of any VLM. HuggingFaceM4/the_cauldron
Most current models, like LLaVA-NeXT, encode images with an excessive number of tokens, like 2880. Instead, we put a focus on being efficient at inference by training on a mix of images encoded with 64 tokens, and 320 tokens. The result is that we perform favorably compared to the best models in our size class, while being efficient at inference.