NLP

Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

In this paper, we compare LLM-based annotations with human annotations on large-scale product retrieval evaluation, revealing striking differences in error patterns. Analyzing hard disagreements shows that humans and LLMs make complementary errors: humans frequently misjudge brands, products, and categories (often due to annotation fatigue), while LLMs tend to be overly strict or misunderstand query intent (e.g., interpreting brand name "On Vacation" literally). This complementary error profile demonstrates that LLMs excel at handling bulk annotation work for common queries, while human expertise remains valuable for nuanced cases involving style and trends. We propose a framework that leverages multimodal LLMs to generate query-specific annotation guidelines and conduct relevance assessments for large-scale product retrieval evaluation. The framework achieves comparable quality to human annotations while evaluating 20,000 query-product pairs in ~20 minutes (vs. weeks for humans) at 1,000x lower cost. The complete pipeline includes: (1) query and product retrieval from search logs, (2) LLM-generated query-specific annotation guidelines, (3) multimodal processing combining text and images, (4) vision model-generated image descriptions, and (5) LLM-based relevance annotation. The orange rectangle highlights where a single Multimodal LLM could streamline the process by directly processing both images and text. Abstract Evaluating production-level retrieval systems at scale is a crucial yet challenging task due to the limited availability of a large pool of well-trained human annotators. Large Language Models (LLMs) have the potential to address this scaling issue and offer a viable alternative to humans for the bulk of annotation tasks. In this paper, we propose a framework for assessing the product search engines in a large-scale e-commerce setting, leveraging multimodal LLMs for (i) generating tailored annotation guidelines for individual queries, and (ii) conducting the subsequent annotation task. Our method, validated on a large e-commerce platform, achieves comparable quality to human annotations, significantly reduces time and cost, enables rapid problem discovery, and provides an effective solution for scalable, production-level quality control. ...

Hunting for Treasure: Living with Machines and the British Library Newspaper Collection

Press Picker: an interactive visualisation tool that lets users scroll vertically through British Library newspaper titles, revealing their temporal coverage from the 1700s to the 2000s. Each row represents a newspaper title; red marks indicate available digitised issues. The tool exposes the uneven distribution of the collection across time and geography, helping researchers understand what is - and is not - represented before drawing conclusions from computational analyses at scale. Abstract This chapter describes how the Living with Machines project approached the British Library’s digitised newspaper collection - one of the largest in the world. Through an open-access digitisation programme, the British Library has made hundreds of millions of articles available for computational analysis. Yet working with this collection at scale requires understanding its contours: which titles are included, what time periods they cover, and where the gaps lie. The chapter introduces Press Picker, an interactive visualisation tool for exploring the temporal and geographic distribution of newspaper titles, and presents an Environmental Scan surveying how digitised newspapers have been used in humanities research. Together, these contributions help researchers navigate the collection critically, making visible the biases and silences that shape any large-scale digitised corpus. ...

A Dataset for Toponym Resolution in Nineteenth-Century English Newspapers

Unlocking the geography of nineteenth-century British newspapers requires robust methods for identifying and linking place names - a challenging task complicated by OCR errors, historical spelling variations, and local references that assume regional knowledge. This paper introduces a meticulously annotated dataset of 343 articles from four English locations (Manchester, Ashton-under-Lyne, Poole, and Dorchester) spanning 1780-1870, containing 3,364 manually annotated toponyms. Unlike previous datasets, this resource emphasizes the geographical peculiarities of provincial press, where local place names dominate but vary dramatically by region and decade. The table shows the careful distribution of annotations across time and place, revealing how newspaper geography evolved during industrialization. With high inter-annotator agreement (0.87 for detection, 0.89 for linking), this benchmark dataset enables researchers to develop and test toponym resolution methods specifically designed for noisy historical texts with strong local contexts. Abstract We present a new dataset for the task of toponym resolution in digitized historical newspapers in English. It consists of 343 annotated articles from newspapers based in four different locations in England (Manchester, Ashton-under-Lyne, Poole and Dorchester), published between 1780 and 1870. The articles have been manually annotated with mentions of places, which are linked - whenever possible - to their corresponding entry on Wikipedia. The dataset consists of 3,364 annotated toponyms, of which 2,784 have been provided with a link to Wikipedia. The dataset is published in the British Library shared research repository, and is especially of interest to researchers working on improving semantic access to historical newspaper content. ...

When Time Makes Sense: A Historically-Aware Approach to Targeted Sense Disambiguation

Words change meaning over time, and computational models that ignore this temporal dimension miss crucial context for understanding historical texts. This paper introduces time-sensitive Targeted Sense Disambiguation (TSD), which detects specific word senses in historical documents by accounting for when the text was written. The figure reveals a key insight: optimal date ranges for each language model (measured by F1-score using the sense centroid method) show that matching the model's training period to the target text's era dramatically improves performance. The x-axis represents average points of rolling 100-year quotation date ranges from the Oxford English Dictionary. We train historical BERT models on nineteenth-century English books and create historically evolving sense representations using the OED and its Historical Thesaurus. Results demonstrate that historical language models consistently outperform modern ones, and time-sensitive methods prove especially valuable for older documents - confirming that when it comes to word meaning, time really does make sense. Abstract As languages evolve historically, making computational approaches sensitive to time can improve performance on specific tasks. In this work, we assess whether applying historical language models and time-aware methods help with determining the correct sense of polysemous words. We outline the task of time-sensitive Targeted Sense Disambiguation (TSD), which aims to detect instances of a sense or set of related senses in historical and time-stamped texts, and address two main goals: 1) we scrutinize the effect of applying historical language models on the performance of several TSD methods and 2) we assess different disambiguation methods that take into account the year in which a text was produced. We train historical BERT models on a corpus of nineteenth-century English books and draw on the Oxford English Dictionary (and its Historical Thesaurus) to create historically evolving sense representations. Our results show that using historical language models consistently improves performance whereas time-sensitive disambiguation helps especially with older documents. ...

Neural Language Models for Nineteenth-Century English

Modern NLP models trained on contemporary text struggle with historical language because words shift meanings, new terms emerge, and grammatical patterns evolve. This paper addresses the challenge by training four types of neural language models specifically on nineteenth-century English: 47,685 books spanning 1760-1900, totaling approximately 5.1 billion tokens. The models range from static embeddings (word2vec and fastText) to contextualized models (BERT and Flair), with red dashed lines marking temporal boundaries for time-sliced training. This temporal slicing enables models tailored to specific historical periods, capturing how "railway" meant nothing in 1760 but dominated discourse by 1850. The resulting models consistently improve performance on downstream tasks when analyzing historical documents, demonstrating that language models must be historically situated to work effectively. Abstract We present four types of neural language models trained on a large historical dataset of books in English, published between 1760-1900 and comprised of ~5.1 billion tokens. The language model architectures include static (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the two static models, and four instances considering different time slices for BERT. Our models have already been used in various downstream tasks where they consistently improved performance. In this paper, we describe how the models have been created and outline their reuse potential. ...

Living Machines: A study of atypical animacy

During the Industrial Revolution, something subtle yet profound happened to English: machines began to "live" in the language. Writers increasingly attributed animate qualities to inanimate objects, describing machines that could work, fail, or behave. This paper tackles the challenge of detecting such atypical animacy using BERT contextualized embeddings trained on nineteenth-century English books. The table demonstrates a striking pattern: when different time-period language models predict tokens for "They were told that the [MASK] stopped working," more recent models increasingly suggest machine-related words (engine, machinery) rather than human-related ones. This fully unsupervised approach captures the gradual linguistic shift as machines became conceptualized as active agents rather than passive tools, providing fine-grained evidence of how technological change reshapes the way we speak. Abstract This paper proposes a new approach to animacy detection, the task of determining whether an entity is represented as animate in a text. In particular, this work is focused on atypical animacy and examines the scenario in which typically inanimate objects, specifically machines, are given animate attributes. To address it, we have created the first dataset for atypical animacy detection, based on nineteenth-century sentences in English, with machines represented as either animate or inanimate. Our method builds on recent innovations in language modeling, specifically BERT contextualized word embeddings, to better capture fine-grained contextual properties of words. We present a fully unsupervised pipeline, which can be easily adapted to different contexts, and report its performance on an established animacy dataset and our newly introduced resource. We show that our method provides a substantially more accurate characterization of atypical animacy, especially when applied to highly complex forms of language use. ...

A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching

When a historical newspaper mentions "Manchester," does it refer to Manchester, England, or one of the 30+ other Manchesters worldwide? Candidate selection narrows down which entities a recognized place name could plausibly refer to, a critical but often overlooked step before full entity resolution. This paper applies state-of-the-art neural networks to toponym matching, handling the substantial variation that makes place names so challenging: cross-lingual variations (München vs. Munich), regional differences (neighborhood names that don't appear in gazetteers), and OCR errors that corrupt spellings. The evaluation table shows F1 scores across these challenging scenarios in English and Spanish datasets. By improving candidate selection, the method enables more accurate downstream analysis of where historical texts are actually talking about, unlocking the geographic dimension of digitized archives. Abstract Recognizing toponyms and resolving them to their real-world referents is required to provide advanced semantic access to textual data. This process is often hindered by the high degree of variation in toponyms. Candidate selection is the task of identifying the potential entities that can be referred to by a previously recognized toponym. While it has traditionally received little attention, candidate selection has a significant impact on downstream tasks (i.e. entity resolution), especially in noisy or non-standard text. In this paper, we introduce a deep learning method for candidate selection through toponym matching, using state-of-the-art neural network architectures. We perform an intrinsic toponym matching evaluation based on several datasets, which cover various challenging scenarios (cross-lingual and regional variations, as well as OCR errors) and assess its performance in the context of geographical candidate selection in English and Spanish. ...

DeezyMatch: A Flexible Deep Learning Approach to Fuzzy String Matching

How do you match "Newe Yorke" to "New York" or recognize that "Londan" (from a poorly OCR'd document) refers to "London"? DeezyMatch addresses fuzzy string matching through deep learning with transfer learning capabilities, particularly valuable when training data is scarce. The architecture has two components: a pair classifier (left) that trains neural networks to recognize similar strings with learnable parameters (blue) that can be fine-tuned for new domains, and a candidate ranker (right) that generates vector representations and ranks matches using similarity metrics like cosine distance. By enabling transfer learning, DeezyMatch handles the messy realities of historical text analysis, where spelling variations, OCR errors, and limited annotated examples are the norm rather than the exception. Abstract We present DeezyMatch, a free, open-source software library written in Python for fuzzy string matching and candidate ranking. Its pair classifier supports various deep neural network architectures for training new classifiers and for fine-tuning a pretrained model, which paves the way for transfer learning in fuzzy string matching. This approach is especially useful where only limited training examples are available. The learned DeezyMatch models can be used to generate rich vector representations from string inputs. The candidate ranker component in DeezyMatch uses these vector representations to find, for a given query, the best matching candidates in a knowledge base. It uses an adaptive searching algorithm applicable to large knowledge bases and query sets. We describe DeezyMatch’s functionality, design and implementation, accompanied by a use case in toponym matching and candidate ranking in realistic noisy datasets. ...

Smart Monitoring for Conservation Areas

The pipeline has two phases. During training (top), articles are vectorised, classified as threat-relevant or not, and evaluated; a smart-annotation loop based on active learning then selects the most informative samples for human review, iteratively expanding both training and validation sets. At prediction time (bottom), the trained classifier labels new, unlabelled news articles, then a downstream module extracts place mentions, organisations, facilities, and dates from the positives before reporting results to WWF. The architecture achieved 96% recall and 82% precision on conservation-threat detection, enabling near-real-time monitoring of emerging risks to protected areas worldwide. Abstract This report documents the outcomes of a Data Study Group held at The Alan Turing Institute in collaboration with WWF Conservation Intelligence. The challenge focused on developing data science techniques to automatically detect news articles reporting emerging threats to protected areas. The project explored approaches ranging from keyword-based filtering to fine-tuned neural language models (BERT) for classifying news articles as relevant conservation threats, particularly infrastructure developments near protected sites. The best-performing model achieved 96% recall and 82% precision, significantly outperforming baseline approaches and demonstrating the feasibility of real-time, automated conservation threat monitoring using NLP. ...

Assessing the Impact of OCR Quality on Downstream NLP Tasks

Growing volumes of historical documents are being digitized through optical character recognition (OCR), yet the impact of OCR errors on downstream NLP tasks remains only partially understood. This study systematically quantifies how OCR quality affects popular, out-of-the-box NLP tools across six tasks: sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modeling, and language model fine-tuning. The figure shows named entity recognition accuracy declining as OCR quality decreases (measured by Levenshtein similarity to human-corrected text), with each point representing one article. Results reveal that some tasks are more robust to OCR errors than others, with certain analyses suffering irredeemable degradation, providing preliminary guidelines for choosing appropriate methods when working with OCR-generated historical texts. Abstract A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks - sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning - using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR. ...

Resolving places, past and present: toponym resolution in historical British newspapers using multiple resources

Historical newspapers encode a rich geography that connects local stories to global events - but unlocking this spatial dimension requires resolving place names that have changed, moved, or disappeared over time. This heat map visualizes all geotagged articles in our WikiGazetteer, revealing the remarkable geographic scope of 19th-century British newspapers: from dense coverage of the British Isles to international reporting spanning every continent. Earth's surface was subdivided into 0.25° blocks (approximately 25km), with the logarithmic colorbar showing concentrations of place mentions. Our approach combines three key innovations: an expansive definition of locatable entities (not just cities, but buildings, streets, and regions), knowledge bases derived from contemporaneous historical sources, and contextual information to disambiguate ambiguous place names. By bridging historical and modern geographic resources, this method enables researchers to trace how newspapers constructed spatial narratives during a period of rapid globalization and imperial expansion. Abstract Newspapers and their metadata are richly geographical, not only in their distribution but also their content. Attending to these spatial features is a prerequisite in newspaper research. Following other projects to have geoparsed place names in newspapers, we describe our approach to linking historical geospatial information in text to real-world locations which 1) adopts an expansive definition of what counts as a locatable entity; 2) uses knowledge bases derived from contemporaneous sources; and 3) leverages contextual information to disambiguate hard-to-locate places. This method depends on combining historical and non-historical resources and the paper discusses the potential benefits for humanities research. ...

2025

Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

2023