Toponym Resolution Dataset Overview
Unlocking the geography of nineteenth-century British newspapers requires robust methods for identifying and linking place names - a challenging task complicated by OCR errors, historical spelling variations, and local references that assume regional knowledge. This paper introduces a meticulously annotated dataset of 343 articles from four English locations (Manchester, Ashton-under-Lyne, Poole, and Dorchester) spanning 1780-1870, containing 3,364 manually annotated toponyms. Unlike previous datasets, this resource emphasizes the geographical peculiarities of provincial press, where local place names dominate but vary dramatically by region and decade. The table shows the careful distribution of annotations across time and place, revealing how newspaper geography evolved during industrialization. With high inter-annotator agreement (0.87 for detection, 0.89 for linking), this benchmark dataset enables researchers to develop and test toponym resolution methods specifically designed for noisy historical texts with strong local contexts.

Abstract

We present a new dataset for the task of toponym resolution in digitized historical newspapers in English. It consists of 343 annotated articles from newspapers based in four different locations in England (Manchester, Ashton-under-Lyne, Poole and Dorchester), published between 1780 and 1870. The articles have been manually annotated with mentions of places, which are linked - whenever possible - to their corresponding entry on Wikipedia. The dataset consists of 3,364 annotated toponyms, of which 2,784 have been provided with a link to Wikipedia. The dataset is published in the British Library shared research repository, and is especially of interest to researchers working on improving semantic access to historical newspaper content.

Keywords: benchmark, dataset, geographic information retrieval, newspapers, nineteenth-century English, toponym resolution