2025

Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

In this paper, we compare LLM-based annotations with human annotations on large-scale product retrieval evaluation, revealing striking differences in error patterns. Analyzing hard disagreements shows that humans and LLMs make complementary errors: humans frequently misjudge brands, products, and categories (often due to annotation fatigue), while LLMs tend to be overly strict or misunderstand query intent (e.g., interpreting brand name "On Vacation" literally). This complementary error profile demonstrates that LLMs excel at handling bulk annotation work for common queries, while human expertise remains valuable for nuanced cases involving style and trends. We propose a framework that leverages multimodal LLMs to generate query-specific annotation guidelines and conduct relevance assessments for large-scale product retrieval evaluation. The framework achieves comparable quality to human annotations while evaluating 20,000 query-product pairs in ~20 minutes (vs. weeks for humans) at 1,000x lower cost. The complete pipeline includes: (1) query and product retrieval from search logs, (2) LLM-generated query-specific annotation guidelines, (3) multimodal processing combining text and images, (4) vision model-generated image descriptions, and (5) LLM-based relevance annotation. The orange rectangle highlights where a single Multimodal LLM could streamline the process by directly processing both images and text. Abstract Evaluating production-level retrieval systems at scale is a crucial yet challenging task due to the limited availability of a large pool of well-trained human annotators. Large Language Models (LLMs) have the potential to address this scaling issue and offer a viable alternative to humans for the bulk of annotation tasks. In this paper, we propose a framework for assessing the product search engines in a large-scale e-commerce setting, leveraging multimodal LLMs for (i) generating tailored annotation guidelines for individual queries, and (ii) conducting the subsequent annotation task. Our method, validated on a large e-commerce platform, achieves comparable quality to human annotations, significantly reduces time and cost, enables rapid problem discovery, and provides an effective solution for scalable, production-level quality control. ...

April 3, 2025 · Kasra Hosseini, Thomas Kober, Josip Krapac, Roland Vollgraf, Weiwei Cheng, Ana Peleteiro Ramallo · Advances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025

2021

When Time Makes Sense: A Historically-Aware Approach to Targeted Sense Disambiguation

Words change meaning over time, and computational models that ignore this temporal dimension miss crucial context for understanding historical texts. This paper introduces time-sensitive Targeted Sense Disambiguation (TSD), which detects specific word senses in historical documents by accounting for when the text was written. The figure reveals a key insight: optimal date ranges for each language model (measured by F1-score using the sense centroid method) show that matching the model's training period to the target text's era dramatically improves performance. The x-axis represents average points of rolling 100-year quotation date ranges from the Oxford English Dictionary. We train historical BERT models on nineteenth-century English books and create historically evolving sense representations using the OED and its Historical Thesaurus. Results demonstrate that historical language models consistently outperform modern ones, and time-sensitive methods prove especially valuable for older documents - confirming that when it comes to word meaning, time really does make sense. Abstract As languages evolve historically, making computational approaches sensitive to time can improve performance on specific tasks. In this work, we assess whether applying historical language models and time-aware methods help with determining the correct sense of polysemous words. We outline the task of time-sensitive Targeted Sense Disambiguation (TSD), which aims to detect instances of a sense or set of related senses in historical and time-stamped texts, and address two main goals: 1) we scrutinize the effect of applying historical language models on the performance of several TSD methods and 2) we assess different disambiguation methods that take into account the year in which a text was produced. We train historical BERT models on a corpus of nineteenth-century English books and draw on the Oxford English Dictionary (and its Historical Thesaurus) to create historically evolving sense representations. Our results show that using historical language models consistently improves performance whereas time-sensitive disambiguation helps especially with older documents. ...

August 1, 2021 · Kaspar Beelen, Federico Nanni, Mariona Coll Ardanuy, Kasra Hosseini, Giorgia Tolfo, Barbara McGillivray · Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Neural Language Models for Nineteenth-Century English

Modern NLP models trained on contemporary text struggle with historical language because words shift meanings, new terms emerge, and grammatical patterns evolve. This paper addresses the challenge by training four types of neural language models specifically on nineteenth-century English: 47,685 books spanning 1760-1900, totaling approximately 5.1 billion tokens. The models range from static embeddings (word2vec and fastText) to contextualized models (BERT and Flair), with red dashed lines marking temporal boundaries for time-sliced training. This temporal slicing enables models tailored to specific historical periods, capturing how "railway" meant nothing in 1760 but dominated discourse by 1850. The resulting models consistently improve performance on downstream tasks when analyzing historical documents, demonstrating that language models must be historically situated to work effectively. Abstract We present four types of neural language models trained on a large historical dataset of books in English, published between 1760-1900 and comprised of ~5.1 billion tokens. The language model architectures include static (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the two static models, and four instances considering different time slices for BERT. Our models have already been used in various downstream tasks where they consistently improved performance. In this paper, we describe how the models have been created and outline their reuse potential. ...

May 24, 2021 · Kasra Hosseini, Kaspar Beelen, Giovanni Colavizza, Mariona Coll Ardanuy · arXiv preprint

2020

Living Machines: A study of atypical animacy

During the Industrial Revolution, something subtle yet profound happened to English: machines began to "live" in the language. Writers increasingly attributed animate qualities to inanimate objects, describing machines that could work, fail, or behave. This paper tackles the challenge of detecting such atypical animacy using BERT contextualized embeddings trained on nineteenth-century English books. The table demonstrates a striking pattern: when different time-period language models predict tokens for "They were told that the [MASK] stopped working," more recent models increasingly suggest machine-related words (engine, machinery) rather than human-related ones. This fully unsupervised approach captures the gradual linguistic shift as machines became conceptualized as active agents rather than passive tools, providing fine-grained evidence of how technological change reshapes the way we speak. Abstract This paper proposes a new approach to animacy detection, the task of determining whether an entity is represented as animate in a text. In particular, this work is focused on atypical animacy and examines the scenario in which typically inanimate objects, specifically machines, are given animate attributes. To address it, we have created the first dataset for atypical animacy detection, based on nineteenth-century sentences in English, with machines represented as either animate or inanimate. Our method builds on recent innovations in language modeling, specifically BERT contextualized word embeddings, to better capture fine-grained contextual properties of words. We present a fully unsupervised pipeline, which can be easily adapted to different contexts, and report its performance on an established animacy dataset and our newly introduced resource. We show that our method provides a substantially more accurate characterization of atypical animacy, especially when applied to highly complex forms of language use. ...

December 1, 2020 · Mariona Coll Ardanuy, Federico Nanni, Kaspar Beelen, Kasra Hosseini, Ruth Ahnert, Jon Lawrence, Katherine McDonough, Giorgia Tolfo, Daniel CS Wilson, Barbara McGillivray · Proceedings of the 28th International Conference on Computational Linguistics (COLING)

Assessing the Impact of OCR Quality on Downstream NLP Tasks

Growing volumes of historical documents are being digitized through optical character recognition (OCR), yet the impact of OCR errors on downstream NLP tasks remains only partially understood. This study systematically quantifies how OCR quality affects popular, out-of-the-box NLP tools across six tasks: sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modeling, and language model fine-tuning. The figure shows named entity recognition accuracy declining as OCR quality decreases (measured by Levenshtein similarity to human-corrected text), with each point representing one article. Results reveal that some tasks are more robust to OCR errors than others, with certain analyses suffering irredeemable degradation, providing preliminary guidelines for choosing appropriate methods when working with OCR-generated historical texts. Abstract A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks - sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning - using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR. ...

January 1, 2020 · Daniel van Strien, Kaspar Beelen, Mariona Coll Ardanuy, Kasra Hosseini, Barbara McGillivray, Giovanni Colavizza · Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH