2026

MapReader: Software and Principles for Computational Map Studies

Britain's Ordnance Survey created the first comprehensive, detailed picture of Great Britain starting in the early nineteenth century, producing tens of thousands of map sheets across multiple series and editions. Thanks to digitization efforts by the National Library of Scotland, anyone can now browse these collections online, but researchers face a fundamental challenge: how do you analyze thousands of maps simultaneously rather than viewing them sheet by sheet? MapReader addresses this through a radical reimagining of maps as computational data. Rather than manually tracing features into Geographic Information Systems with pixel-level precision, MapReader divides map images into user-defined patches, treating each grid square as a unit for creative labeling and automated classification. This epistemological shift rejects the notion that maps are objective records of landscapes, instead embracing them as historical arguments about space and place. The patch-based approach enables computational map studies at previously impossible scales, revealing spatial patterns across local, regional, and national levels while maintaining the critical interpretive lens essential to humanities inquiry. Summary MapReader represents an epistemological shift in how historians and humanities scholars engage with digitized map collections at scale. Developed through the Living with Machines project, this chapter introduces computational map studies as a new field that combines scholarly traditions of map interpretation with computational methods designed for analyzing entire collections rather than individual sheets. ...

January 1, 2026 · Katherine McDonough, Ruth Ahnert, Kaspar Beelen, Kasra Hosseini, Jon Lawrence, Valeria Vitale, Kalle Westerling, Daniel Wilson, Rosie Wood · University of London Press (Early Access)

2025

Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

In this paper, we compare LLM-based annotations with human annotations on large-scale product retrieval evaluation, revealing striking differences in error patterns. Analyzing hard disagreements shows that humans and LLMs make complementary errors: humans frequently misjudge brands, products, and categories (often due to annotation fatigue), while LLMs tend to be overly strict or misunderstand query intent (e.g., interpreting brand name "On Vacation" literally). This complementary error profile demonstrates that LLMs excel at handling bulk annotation work for common queries, while human expertise remains valuable for nuanced cases involving style and trends. We propose a framework that leverages multimodal LLMs to generate query-specific annotation guidelines and conduct relevance assessments for large-scale product retrieval evaluation. The framework achieves comparable quality to human annotations while evaluating 20,000 query-product pairs in ~20 minutes (vs. weeks for humans) at 1,000x lower cost. The complete pipeline includes: (1) query and product retrieval from search logs, (2) LLM-generated query-specific annotation guidelines, (3) multimodal processing combining text and images, (4) vision model-generated image descriptions, and (5) LLM-based relevance annotation. The orange rectangle highlights where a single Multimodal LLM could streamline the process by directly processing both images and text. Abstract Evaluating production-level retrieval systems at scale is a crucial yet challenging task due to the limited availability of a large pool of well-trained human annotators. Large Language Models (LLMs) have the potential to address this scaling issue and offer a viable alternative to humans for the bulk of annotation tasks. In this paper, we propose a framework for assessing the product search engines in a large-scale e-commerce setting, leveraging multimodal LLMs for (i) generating tailored annotation guidelines for individual queries, and (ii) conducting the subsequent annotation task. Our method, validated on a large e-commerce platform, achieves comparable quality to human annotations, significantly reduces time and cost, enables rapid problem discovery, and provides an effective solution for scalable, production-level quality control. ...

April 3, 2025 · Kasra Hosseini, Thomas Kober, Josip Krapac, Roland Vollgraf, Weiwei Cheng, Ana Peleteiro Ramallo · Advances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025

Automated dynamic phenotyping of whole oilseed rape (Brassica napus) plants from images collected under controlled conditions

Predicting crop yields under changing climate conditions requires understanding how individual plant components develop over time, but manually measuring leaves, flowers, and pods across thousands of high-resolution images is impractical. This study adapts MapReader, originally developed for analyzing historical maps, to automatically segment and classify plant structures in whole oilseed rape (Brassica napus) images. Panel A shows the original plant image, while panels B-D compare three modeling approaches: 6-label multi-class classification (B), chain of binary classifiers (C), and the top-performing combined approach (D) that first separates plant from background, then classifies five plant structures (branches in light yellow, pods in orange, leaves in red, buds in magenta, flowers in purple). The combined approach achieved macro-averaged F1-score of 88.50 and weighted F1-score of 97.71, matching MapReader's performance on historical maps. This interdisciplinary transfer demonstrates how computer vision methods can cross domains from digital humanities to agricultural science, enabling automated phenotyping that could help ensure future food security by integrating genetic and environmental factors into crop yield models. Abstract Introduction: Recent advancements in sensor technologies have enabled collection of many large, high-resolution plant images datasets that could be used to non-destructively explore the relationships between genetics, environment and management factors on phenotype or the physical traits exhibited by plants. The phenotype data captured in these datasets could then be integrated into models of plant development and crop yield to more accurately predict how plants may grow as a result of changing management practices and climate conditions, better ensuring future food security. However, automated methods capable of reliably and efficiently extracting meaningful measurements of individual plant components (e.g. leaves, flowers, pods) from imagery of whole plants are currently lacking. In this study, we explore interdisciplinary application of MapReader, a computer vision pipeline for annotating and classifying patches of larger images that was originally developed for semantic exploration of historical maps, to time-series images of whole oilseed rape (Brassica napus) plants. ...

January 1, 2025 · Evangeline Corcoran, Kasra Hosseini, Laura Siles, Smita Kurup, Sebastian Ahnert · Frontiers in Plant Science

2024

MapReader: Open software for the visual analysis of maps

MapReader is an open-source software library that transforms how researchers extract information from large image collections, particularly historical maps. This diagram illustrates the modular pipeline architecture and data flow through two core tasks: patch classification (dividing images into small cells and classifying visual features) and text spotting (detecting and recognizing text). Starting from input images (top), users can download maps, annotate patches manually, train computer vision models, and perform inference at scale. The flexible pipeline accommodates both small manually-annotated datasets and large-scale automated analysis, as demonstrated by processing approximately 30.5 million patches in one study. Inspired by biomedical imaging methods and adapted for historians, MapReader has proven its versatility by successfully transferring to plant phenotype research, showcasing the power of open and reproducible research methods. This release, developed through the Living with Machines project, includes extensive documentation and tutorials designed to make large-scale visual map analysis accessible to historians and researchers across disciplines. Summary MapReader is an interdisciplinary software library for processing digitized maps and other types of images with two tasks: patch classification and text spotting. Patch classification works by ‘patching’ images into small, custom-sized cells which are then classified according to the user’s needs. Text spotting detects and recognizes text. MapReader offers a flexible pipeline which can be used both for manual annotation of small datasets as well as for computer-vision-based inference of large collections. As an example, in one study, we annotated 62,020 patches, trained a suite of computer vision models and performed model inference on approximately 30.5 million patches. ...

September 1, 2024 · Rosie Wood, Kasra Hosseini, Kalle Westerling, Andrew Smith, Kaspar Beelen, Daniel C. S. Wilson, Katherine McDonough · Journal of Open Source Software

2023

Hunting for Treasure: Living with Machines and the British Library Newspaper Collection

Press Picker: an interactive visualisation tool that lets users scroll vertically through British Library newspaper titles, revealing their temporal coverage from the 1700s to the 2000s. Each row represents a newspaper title; red marks indicate available digitised issues. The tool exposes the uneven distribution of the collection across time and geography, helping researchers understand what is - and is not - represented before drawing conclusions from computational analyses at scale. Abstract This chapter describes how the Living with Machines project approached the British Library’s digitised newspaper collection - one of the largest in the world. Through an open-access digitisation programme, the British Library has made hundreds of millions of articles available for computational analysis. Yet working with this collection at scale requires understanding its contours: which titles are included, what time periods they cover, and where the gaps lie. The chapter introduces Press Picker, an interactive visualisation tool for exploring the temporal and geographic distribution of newspaper titles, and presents an Environmental Scan surveying how digitised newspapers have been used in humanities research. Together, these contributions help researchers navigate the collection critically, making visible the biases and silences that shape any large-scale digitised corpus. ...

February 6, 2023 · Giorgia Tolfo, Olivia Vane, Kaspar Beelen, Kasra Hosseini, Jon Lawrence, David Beavan, Katherine McDonough · Digitised Newspapers - A New Eldorado for Historians?, De Gruyter Oldenbourg

2022

Faking feature importance: A cautionary tale on the use of differentially-private synthetic data

Synthetic data promises to unlock sensitive datasets for early-stage analysis while preserving privacy - but can it reliably tell you which features matter? This paper reveals a cautionary finding: differentially-private synthetic data struggles to preserve feature importance rankings across real-world datasets. While the Adult dataset shows promising similarity at higher epsilon values, the Household and Polish datasets tell a different story. The results demonstrate that without compromising privacy guarantees (epsilon > 0.4), synthetic data often performs no better than naive baselines, with high variance across runs. This has critical implications for using synthetic data in the exploratory phase of machine learning workflows in sensitive domains like healthcare and finance. Abstract Synthetic datasets are often presented as a silver-bullet solution to the problem of privacy-preserving data publishing. However, for many applications, synthetic data has been shown to have limited utility when used to train predictive models. One promising potential application of these data is in the exploratory phase of the machine learning workflow, which involves understanding, engineering and selecting features. This phase often involves considerable time, and depends on the availability of data. There would be substantial value in synthetic data that permitted these steps to be carried out while, for example, data access was being negotiated, or with fewer information governance restrictions. This paper presents an empirical analysis of the agreement between the feature importance obtained from raw and from synthetic data, on a range of artificially generated and real-world datasets (where feature importance represents how useful each feature is when predicting a the outcome). We employ two differentially-private methods to produce synthetic data, and apply various utility measures to quantify the agreement in feature importance as this varies with the level of privacy. Our results indicate that synthetic data can sometimes preserve several representations of the ranking of feature importance in simple settings but their performance is not consistent and depends upon a number of factors. Particular caution should be exercised in more nuanced real-world settings, where synthetic data can lead to differences in ranked feature importance that could alter key modelling decisions. This work has important implications for developing synthetic versions of highly sensitive data sets in fields such as finance and healthcare.

March 2, 2022 · Oscar Giles, Kasra Hosseini, Grigorios Mingas, Oliver Strickson, Louise Bowler, Camila Rangel Smith, Harrison Wilde, Jen Ning Lim, Bilal Mateen, Kasun Amarasinghe, Rayid Ghani, Alison Heppenstall, Nik Lomax, Nick Malleson, Martin O'Reilly, Sebastian Vollmer · arXiv preprint

A Dataset for Toponym Resolution in Nineteenth-Century English Newspapers

Unlocking the geography of nineteenth-century British newspapers requires robust methods for identifying and linking place names - a challenging task complicated by OCR errors, historical spelling variations, and local references that assume regional knowledge. This paper introduces a meticulously annotated dataset of 343 articles from four English locations (Manchester, Ashton-under-Lyne, Poole, and Dorchester) spanning 1780-1870, containing 3,364 manually annotated toponyms. Unlike previous datasets, this resource emphasizes the geographical peculiarities of provincial press, where local place names dominate but vary dramatically by region and decade. The table shows the careful distribution of annotations across time and place, revealing how newspaper geography evolved during industrialization. With high inter-annotator agreement (0.87 for detection, 0.89 for linking), this benchmark dataset enables researchers to develop and test toponym resolution methods specifically designed for noisy historical texts with strong local contexts. Abstract We present a new dataset for the task of toponym resolution in digitized historical newspapers in English. It consists of 343 annotated articles from newspapers based in four different locations in England (Manchester, Ashton-under-Lyne, Poole and Dorchester), published between 1780 and 1870. The articles have been manually annotated with mentions of places, which are linked - whenever possible - to their corresponding entry on Wikipedia. The dataset consists of 3,364 annotated toponyms, of which 2,784 have been provided with a link to Wikipedia. The dataset is published in the British Library shared research repository, and is especially of interest to researchers working on improving semantic access to historical newspaper content. ...

January 24, 2022 · Mariona Coll Ardanuy, David Beavan, Kaspar Beelen, Kasra Hosseini, Jon Lawrence, Katherine McDonough, Federico Nanni, Daniel van Strien, Daniel C. S. Wilson · Journal of Open Humanities Data

MapReader: a computer vision pipeline for the semantic exploration of maps at scale

Historical maps contain rich information about past landscapes, but extracting data from thousands of maps has traditionally required painstaking manual annotation. MapReader automates this process using computer vision, making large-scale map analysis accessible to users without deep learning expertise. The pipeline divides maps into patches (see insets), trains neural networks to recognize visual features like railways (a, shown in red in c,d) and buildings (b, shown in black in c,d), then reconstructs predictions across entire map sheets. Applied to approximately 16,000 nineteenth-century British Ordnance Survey maps (roughly 30.5 million patches), MapReader transforms visual cartographic information into structured, machine-readable data. The resulting datasets can be queried spatially, analyzed for patterns, and linked to other historical sources, enabling researchers to ask questions at scales previously impossible. Abstract We present MapReader, a free, open-source software library written in Python for analyzing large map collections. MapReader allows users with little computer vision expertise to i) retrieve maps via web-servers; ii) preprocess and divide them into patches; iii) annotate patches; iv) train, fine-tune, and evaluate deep neural network models; and v) create structured data about map content. We demonstrate how MapReader enables historians to interpret a collection of ≈16K nineteenth-century maps of Britain (≈30.5M patches), foregrounding the challenge of translating visual markers into machine-readable data. We present a case study focusing on rail and buildings. We also show how the outputs from the MapReader pipeline can be linked to other, external datasets. We release ≈62K manually annotated patches used here for training and evaluating the models. ...

January 1, 2022 · Kasra Hosseini, Daniel C. S. Wilson, Kaspar Beelen, Katherine McDonough · Proceedings of the 6th ACM SIGSPATIAL International Workshop on Geospatial Humanities

2021

When Time Makes Sense: A Historically-Aware Approach to Targeted Sense Disambiguation

Words change meaning over time, and computational models that ignore this temporal dimension miss crucial context for understanding historical texts. This paper introduces time-sensitive Targeted Sense Disambiguation (TSD), which detects specific word senses in historical documents by accounting for when the text was written. The figure reveals a key insight: optimal date ranges for each language model (measured by F1-score using the sense centroid method) show that matching the model's training period to the target text's era dramatically improves performance. The x-axis represents average points of rolling 100-year quotation date ranges from the Oxford English Dictionary. We train historical BERT models on nineteenth-century English books and create historically evolving sense representations using the OED and its Historical Thesaurus. Results demonstrate that historical language models consistently outperform modern ones, and time-sensitive methods prove especially valuable for older documents - confirming that when it comes to word meaning, time really does make sense. Abstract As languages evolve historically, making computational approaches sensitive to time can improve performance on specific tasks. In this work, we assess whether applying historical language models and time-aware methods help with determining the correct sense of polysemous words. We outline the task of time-sensitive Targeted Sense Disambiguation (TSD), which aims to detect instances of a sense or set of related senses in historical and time-stamped texts, and address two main goals: 1) we scrutinize the effect of applying historical language models on the performance of several TSD methods and 2) we assess different disambiguation methods that take into account the year in which a text was produced. We train historical BERT models on a corpus of nineteenth-century English books and draw on the Oxford English Dictionary (and its Historical Thesaurus) to create historically evolving sense representations. Our results show that using historical language models consistently improves performance whereas time-sensitive disambiguation helps especially with older documents. ...

August 1, 2021 · Kaspar Beelen, Federico Nanni, Mariona Coll Ardanuy, Kasra Hosseini, Giorgia Tolfo, Barbara McGillivray · Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Neural Language Models for Nineteenth-Century English

Modern NLP models trained on contemporary text struggle with historical language because words shift meanings, new terms emerge, and grammatical patterns evolve. This paper addresses the challenge by training four types of neural language models specifically on nineteenth-century English: 47,685 books spanning 1760-1900, totaling approximately 5.1 billion tokens. The models range from static embeddings (word2vec and fastText) to contextualized models (BERT and Flair), with red dashed lines marking temporal boundaries for time-sliced training. This temporal slicing enables models tailored to specific historical periods, capturing how "railway" meant nothing in 1760 but dominated discourse by 1850. The resulting models consistently improve performance on downstream tasks when analyzing historical documents, demonstrating that language models must be historically situated to work effectively. Abstract We present four types of neural language models trained on a large historical dataset of books in English, published between 1760-1900 and comprised of ~5.1 billion tokens. The language model architectures include static (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the two static models, and four instances considering different time slices for BERT. Our models have already been used in various downstream tasks where they consistently improved performance. In this paper, we describe how the models have been created and outline their reuse potential. ...

May 24, 2021 · Kasra Hosseini, Kaspar Beelen, Giovanni Colavizza, Mariona Coll Ardanuy · arXiv preprint

Maps of a Nation? The Digitized Ordnance Survey for New Historical Research

Although the Ordnance Survey has been the subject of historical research, scholars have not systematically used its maps as primary sources, partly due to technical barriers in accessing and processing large collections. This paper outlines a computer vision pipeline for analyzing thousands of digitized 25-inch Ordnance Survey maps simultaneously rather than individually. The visualization shows digitization coverage across different map editions, revealing where sheet holdings remain undigitized and how coverage varies between editions. By creating item-level metadata and applying machine learning methods to extract spatial features, the 'patchwork method' transforms map collections into interrogable corpora. This approach enables new forms of historical inquiry based on spatial analysis and allows scholars to adopt an overall view of territory. The paper highlights parallels between today's users of digitized maps and nineteenth-century predecessors who faced a similar inflection point as the project to map the nation approached completion. Abstract Although the Ordnance Survey has itself been the subject of historical research, scholars have not systematically used its maps as primary sources of information. This is partly for disciplinary reasons and partly for the technical reason that high-quality maps have not until recently been available digitally, geo-referenced, and in color. A final, and crucial, addition has been the creation of item-level metadata which allows map collections to become corpora which can for the first time be interrogated en masse as source material. By applying new Computer Vision methods leveraging machine learning, we outline a research pipeline for working with thousands (rather than a handful) of maps at once, which enables new forms of historical inquiry based on spatial analysis. Our ‘patchwork method’ draws on the longstanding desire to adopt an overall or ‘complete’ view of a territory, and in so doing highlights certain parallels between the situation faced by today’s users of digitized maps, and a similar inflexion point faced by their predecessors in the nineteenth century, as the project to map the nation approached a form of completion. ...

April 17, 2021 · Kasra Hosseini, Katherine McDonough, Daniel van Strien, Olivia Vane, Daniel C. S. Wilson · Journal of Victorian Culture

2020

Living Machines: A study of atypical animacy

During the Industrial Revolution, something subtle yet profound happened to English: machines began to "live" in the language. Writers increasingly attributed animate qualities to inanimate objects, describing machines that could work, fail, or behave. This paper tackles the challenge of detecting such atypical animacy using BERT contextualized embeddings trained on nineteenth-century English books. The table demonstrates a striking pattern: when different time-period language models predict tokens for "They were told that the [MASK] stopped working," more recent models increasingly suggest machine-related words (engine, machinery) rather than human-related ones. This fully unsupervised approach captures the gradual linguistic shift as machines became conceptualized as active agents rather than passive tools, providing fine-grained evidence of how technological change reshapes the way we speak. Abstract This paper proposes a new approach to animacy detection, the task of determining whether an entity is represented as animate in a text. In particular, this work is focused on atypical animacy and examines the scenario in which typically inanimate objects, specifically machines, are given animate attributes. To address it, we have created the first dataset for atypical animacy detection, based on nineteenth-century sentences in English, with machines represented as either animate or inanimate. Our method builds on recent innovations in language modeling, specifically BERT contextualized word embeddings, to better capture fine-grained contextual properties of words. We present a fully unsupervised pipeline, which can be easily adapted to different contexts, and report its performance on an established animacy dataset and our newly introduced resource. We show that our method provides a substantially more accurate characterization of atypical animacy, especially when applied to highly complex forms of language use. ...

December 1, 2020 · Mariona Coll Ardanuy, Federico Nanni, Kaspar Beelen, Kasra Hosseini, Ruth Ahnert, Jon Lawrence, Katherine McDonough, Giorgia Tolfo, Daniel CS Wilson, Barbara McGillivray · Proceedings of the 28th International Conference on Computational Linguistics (COLING)

A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching

When a historical newspaper mentions "Manchester," does it refer to Manchester, England, or one of the 30+ other Manchesters worldwide? Candidate selection narrows down which entities a recognized place name could plausibly refer to, a critical but often overlooked step before full entity resolution. This paper applies state-of-the-art neural networks to toponym matching, handling the substantial variation that makes place names so challenging: cross-lingual variations (München vs. Munich), regional differences (neighborhood names that don't appear in gazetteers), and OCR errors that corrupt spellings. The evaluation table shows F1 scores across these challenging scenarios in English and Spanish datasets. By improving candidate selection, the method enables more accurate downstream analysis of where historical texts are actually talking about, unlocking the geographic dimension of digitized archives. Abstract Recognizing toponyms and resolving them to their real-world referents is required to provide advanced semantic access to textual data. This process is often hindered by the high degree of variation in toponyms. Candidate selection is the task of identifying the potential entities that can be referred to by a previously recognized toponym. While it has traditionally received little attention, candidate selection has a significant impact on downstream tasks (i.e. entity resolution), especially in noisy or non-standard text. In this paper, we introduce a deep learning method for candidate selection through toponym matching, using state-of-the-art neural network architectures. We perform an intrinsic toponym matching evaluation based on several datasets, which cover various challenging scenarios (cross-lingual and regional variations, as well as OCR errors) and assess its performance in the context of geographical candidate selection in English and Spanish. ...

November 13, 2020 · Mariona Coll Ardanuy, Kasra Hosseini, Katherine McDonough, Amrey Krause, Daniel van Strien, Federico Nanni · Proceedings of the 28th International Conference on Advances in Geographic Information Systems (SIGSPATIAL)

DeezyMatch: A Flexible Deep Learning Approach to Fuzzy String Matching

How do you match "Newe Yorke" to "New York" or recognize that "Londan" (from a poorly OCR'd document) refers to "London"? DeezyMatch addresses fuzzy string matching through deep learning with transfer learning capabilities, particularly valuable when training data is scarce. The architecture has two components: a pair classifier (left) that trains neural networks to recognize similar strings with learnable parameters (blue) that can be fine-tuned for new domains, and a candidate ranker (right) that generates vector representations and ranks matches using similarity metrics like cosine distance. By enabling transfer learning, DeezyMatch handles the messy realities of historical text analysis, where spelling variations, OCR errors, and limited annotated examples are the norm rather than the exception. Abstract We present DeezyMatch, a free, open-source software library written in Python for fuzzy string matching and candidate ranking. Its pair classifier supports various deep neural network architectures for training new classifiers and for fine-tuning a pretrained model, which paves the way for transfer learning in fuzzy string matching. This approach is especially useful where only limited training examples are available. The learned DeezyMatch models can be used to generate rich vector representations from string inputs. The candidate ranker component in DeezyMatch uses these vector representations to find, for a given query, the best matching candidates in a knowledge base. It uses an adaptive searching algorithm applicable to large knowledge bases and query sets. We describe DeezyMatch’s functionality, design and implementation, accompanied by a use case in toponym matching and candidate ranking in realistic noisy datasets. ...

October 1, 2020 · Kasra Hosseini, Federico Nanni, Mariona Coll Ardanuy · Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Smart Monitoring for Conservation Areas

The pipeline has two phases. During training (top), articles are vectorised, classified as threat-relevant or not, and evaluated; a smart-annotation loop based on active learning then selects the most informative samples for human review, iteratively expanding both training and validation sets. At prediction time (bottom), the trained classifier labels new, unlabelled news articles, then a downstream module extracts place mentions, organisations, facilities, and dates from the positives before reporting results to WWF. The architecture achieved 96% recall and 82% precision on conservation-threat detection, enabling near-real-time monitoring of emerging risks to protected areas worldwide. Abstract This report documents the outcomes of a Data Study Group held at The Alan Turing Institute in collaboration with WWF Conservation Intelligence. The challenge focused on developing data science techniques to automatically detect news articles reporting emerging threats to protected areas. The project explored approaches ranging from keyword-based filtering to fine-tuned neural language models (BERT) for classifying news articles as relevant conservation threats, particularly infrastructure developments near protected sites. The best-performing model achieved 96% recall and 82% precision, significantly outperforming baseline approaches and demonstrating the feasibility of real-time, automated conservation threat monitoring using NLP. ...

June 5, 2020 · Kasra Hosseini, Mariona Coll Ardanuy, Daniel J. Patterson, Lorena Garcia-Velez, Lucia Castro-Gonzalez, Lena Deecke, et al. · Data Study Group Final Report: WWF, The Alan Turing Institute

Assessing the Impact of OCR Quality on Downstream NLP Tasks

Growing volumes of historical documents are being digitized through optical character recognition (OCR), yet the impact of OCR errors on downstream NLP tasks remains only partially understood. This study systematically quantifies how OCR quality affects popular, out-of-the-box NLP tools across six tasks: sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modeling, and language model fine-tuning. The figure shows named entity recognition accuracy declining as OCR quality decreases (measured by Levenshtein similarity to human-corrected text), with each point representing one article. Results reveal that some tasks are more robust to OCR errors than others, with certain analyses suffering irredeemable degradation, providing preliminary guidelines for choosing appropriate methods when working with OCR-generated historical texts. Abstract A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks - sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning - using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR. ...

January 1, 2020 · Daniel van Strien, Kaspar Beelen, Mariona Coll Ardanuy, Kasra Hosseini, Barbara McGillivray, Giovanni Colavizza · Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH

2019

Resolving places, past and present: toponym resolution in historical British newspapers using multiple resources

Historical newspapers encode a rich geography that connects local stories to global events - but unlocking this spatial dimension requires resolving place names that have changed, moved, or disappeared over time. This heat map visualizes all geotagged articles in our WikiGazetteer, revealing the remarkable geographic scope of 19th-century British newspapers: from dense coverage of the British Isles to international reporting spanning every continent. Earth's surface was subdivided into 0.25° blocks (approximately 25km), with the logarithmic colorbar showing concentrations of place mentions. Our approach combines three key innovations: an expansive definition of locatable entities (not just cities, but buildings, streets, and regions), knowledge bases derived from contemporaneous historical sources, and contextual information to disambiguate ambiguous place names. By bridging historical and modern geographic resources, this method enables researchers to trace how newspapers constructed spatial narratives during a period of rapid globalization and imperial expansion. Abstract Newspapers and their metadata are richly geographical, not only in their distribution but also their content. Attending to these spatial features is a prerequisite in newspaper research. Following other projects to have geoparsed place names in newspapers, we describe our approach to linking historical geospatial information in text to real-world locations which 1) adopts an expansive definition of what counts as a locatable entity; 2) uses knowledge bases derived from contemporaneous sources; and 3) leverages contextual information to disambiguate hard-to-locate places. This method depends on combining historical and non-historical resources and the paper discusses the potential benefits for humanities research. ...

November 5, 2019 · Mariona Coll Ardanuy, Katherine McDonough, Amrey Krause, Daniel C. S. Wilson, Kasra Hosseini, Daniel van Strien · Proceedings of the 13th Workshop on Geographic Information Retrieval