We release EuroBERT !
🚨 Today, we release EuroBERT: a multilingual encoder model family (210M to 2.1B parameters) trained on 5T tokens across 15 languages, with support for sequences up to 8,192 tokens. It’s open-source and designed to power multilingual retrieval, classification, and embeddings.
🔹 Why EuroBERT?
✅ State-of-the-art performance across multilingual retrieval, classification, and regression
✅ Long-context support (8,192 tokens) for document-level understanding
✅ Mathematics & Code training for improved reasoning
✅ Outperforms XLM-RoBERTa, mGTE, and other leading models
EuroBERT builds upon our team’s experience training EuroLLM & CroissantLLM, but encoders are NOT decoders and require specific design decisions! We ran extensive ablations on masking ratios, language and data distributions, annealing, and data quality to ensure optimal performance.
📢 Beyond the results we report, nothing beats people fine-tuning the model for their usecases and sharing real-world feedback, so feel free to do so, everything is on the HuggingFace page !
More details and links in the blog
Work led by Nicolas BOIZARD Hippolyte Gisserot-Boukhlef Duarte Alves Pierre Colombo and with many other great co-authors !