Manuel Faysse

PhD Candidate

Paris, France

Hey! I am Manu, a last year PhD student working on LLM and information retrieval research, but curious about (way too) many other things!

I am nearing the end of my academic post-training phase as a PhD student at CentraleSupélec (with Pierre Colombo) and most recently worked under the distilled supervision of Hervé Jégou at Meta FAIR Paris. My research focuses on practical applications of large language models, with focus on Visual Document Retrieval (ColPali, ViDoRe), LLM pretraining (CroissantLLM, Long Context Modeling at Meta), as well as multimodality, automatic evaluation, model memorization, or confidence estimation and contextualization techniques for neural information retrieval.

My work has been published in top international venues (ICLR, ICML, EMNLP, TMLR, COLM), has been featured in the press (MIT Tech Review, Nature Magazine, Usine Digitale, etc.), gave way to many invited talks (Meta, Amazon, IBM, Naver, LlamaIndex, etc.) and has been listed as a top AI innovation of 2024 (State of AI, Tech Radar). Importantly to me, my work is largely used across the industry, both in early stage startups, established large tech companies or government agencies.

My PhD is funded through the CIFRE French program in collaboration with Illuin Technology, where before joining Meta, I held a Staff Research Scientist position, and spent a share of my time advising and accompanying various R&D efforts in the LLM and Vision LLM space. Don’t hesitate to reach out on X.

news

Feb 18, 2026	Jina releases their embeddings v5 models which claim the top spot on MMTEBv2, the default multilingual IR benchmark. The nano model is based on our EuroBert model which the main author state is the best small multilingual encoder backbone amongst all those they experimented with.
Jan 23, 2026	Our work “Should We Still Pretrain Encoders with Masked Language Modeling?” is accepted at ICLR 2026!
Jan 2, 2026	We release the "Vidore V3" dataset and paper.
Jul 7, 2025	Our Eurobert paper “EuroBERT: Scaling Multilingual Encoders for European Languages” is accepted at COLM!
Jul 3, 2025	We release "Should We Still Pretrain Encoders with Masked Language Modeling?"

selected publications

2025

Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings

Max Conti*, Manuel Faysse*, Gautier Viaud, and 3 more authors

2025

Abs Website

A limitation of modern document retrieval embedding methods is that they typically encode passages (chunks) from the same documents independently, often overlooking crucial contextual information from the rest of the document that could greatly improve individual chunk representations. In this work, we introduce ConTEB (Context-aware Text Embedding Benchmark), a benchmark designed to evaluate retrieval models on their ability to leverage document-wide context. Our results show that state-of-the-art embedding models struggle in retrieval scenarios where context is required. To address this limitation, we propose InSeNT (In-sequence Negative Training), a novel contrastive post-training approach which combined with late chunking pooling enhances contextual representation learning while preserving computational efficiency. Our method significantly improves retrieval quality on ConTEB without sacrificing base model performance. We further find chunks embedded with our method are more robust to suboptimal chunking strategies and larger retrieval corpus sizes.

2024

ColPali: Efficient Document Retrieval with Vision Language Models

Manuel Faysse, Hugues Sibille, Tony Wu, and 4 more authors

2024

Abs Website

Documents are visually rich structures that convey information through text, as well as tables, figures, page layouts, or fonts. While modern document retrieval systems exhibit strong performance on query-to-text matching, they struggle to exploit visual cues efficiently, hindering their performance on practical document retrieval applications such as Retrieval Augmented Generation. To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieving tasks spanning multiple domains, languages, and settings. The inherent shortcomings of modern systems motivate the introduction of a new retrieval model architecture, ColPali, which leverages the document understanding capabilities of recent Vision Language Models to produce high-quality contextualized embeddings solely from images of document pages. Combined with a late interaction matching mechanism, ColPali largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.
CroissantLLM: A Truly Bilingual French-English Language Model

Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, and 13 more authors

2024

Abs Website

We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in language models.

2023

Revisiting Instruction Fine-tuned Model Evaluation to Guide Industrial Applications

Manuel Faysse, Gautier Viaud, Céline Hudelot, and 1 more author

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Dec 2023

Abs Website

Instruction Fine-Tuning (IFT) is a powerful paradigm that strengthens the zero-shot capabilities of Large Language Models (LLMs), but in doing so induces new evaluation metric requirements. We show LLM-based metrics to be well adapted to these requirements, and leverage them to conduct an investigation of task-specialization strategies, quantifying the trade-offs that emerge in practical industrial settings. Our findings offer practitioners actionable insights for real-world IFT model deployment.