Manuel Faysse

On a latent space odyssey.

prof_pic.jpg

PhD Candidate

Paris, France

Hey! I am Manu, a last year PhD student working on LLM and information retrieval research, but curious about (way too) many other things!

I am nearing the end of my academic post-training phase as a PhD student at CentraleSupélec (with Pierre Colombo) and most recently worked under the distilled supervision of Hervé Jégou at Meta FAIR Paris. My research focuses on practical applications of large language models, with focus on Visual Document Retrieval (ColPali, ViDoRe), LLM pretraining (CroissantLLM, Long Context Modeling at Meta), as well as multimodality, automatic evaluation, model memorization, or confidence estimation and contextualization techniques for neural information retrieval.

My work has been published in top international venues (ICLR, ICML, EMNLP, TMLR, COLM), has been featured in the press (MIT Tech Review, Nature Magazine, Usine Digitale, etc.), gave way to many invited talks (Meta, Amazon, IBM, Naver, LlamaIndex, etc.) and has been listed as a top AI innovation of 2024 (State of AI, Tech Radar). Importantly to me, my work is largely used across the industry, both in early stage startups, established large tech companies or government agencies.

My PhD is funded through the CIFRE French program in collaboration with Illuin Technology, where before joining Meta, I held a Staff Research Scientist position, and spent a share of my time advising and accompanying various R&D efforts in the LLM and Vision LLM space. Don’t hesitate to reach out on X.

news

Feb 18, 2026 Jina releases their embeddings v5 models which claim the top spot on MMTEBv2, the default multilingual IR benchmark. The nano model is based on our EuroBert model which the main author state is the best small multilingual encoder backbone amongst all those they experimented with.
Jan 23, 2026 Our work “Should We Still Pretrain Encoders with Masked Language Modeling?” is accepted at ICLR 2026!
Jan 2, 2026 We release the "Vidore V3" dataset and paper.
Jul 7, 2025 Our Eurobert paper “EuroBERT: Scaling Multilingual Encoders for European Languages” is accepted at COLM!
Jul 3, 2025 We release "Should We Still Pretrain Encoders with Masked Language Modeling?"

selected publications

2025

  1. context.png
    Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings
    Max Conti*, Manuel Faysse*, Gautier Viaud, and 3 more authors
    2025

2024

  1. colpali.png
    ColPali: Efficient Document Retrieval with Vision Language Models
    Manuel Faysse, Hugues Sibille, Tony Wu, and 4 more authors
    2024
  2. croissant.png
    CroissantLLM: A Truly Bilingual French-English Language Model
    Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, and 13 more authors
    2024

2023

  1. gavel.png
    Revisiting Instruction Fine-tuned Model Evaluation to Guide Industrial Applications
    Manuel Faysse, Gautier Viaud, Céline Hudelot, and 1 more author
    In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Dec 2023