Software Engineering

2026

MapReader: Software and Principles for Computational Map Studies

Britain's Ordnance Survey created the first comprehensive, detailed picture of Great Britain starting in the early nineteenth century, producing tens of thousands of map sheets across multiple series and editions. Thanks to digitization efforts by the National Library of Scotland, anyone can now browse these collections online, but researchers face a fundamental challenge: how do you analyze thousands of maps simultaneously rather than viewing them sheet by sheet? MapReader addresses this through a radical reimagining of maps as computational data. Rather than manually tracing features into Geographic Information Systems with pixel-level precision, MapReader divides map images into user-defined patches, treating each grid square as a unit for creative labeling and automated classification. This epistemological shift rejects the notion that maps are objective records of landscapes, instead embracing them as historical arguments about space and place. The patch-based approach enables computational map studies at previously impossible scales, revealing spatial patterns across local, regional, and national levels while maintaining the critical interpretive lens essential to humanities inquiry. Summary MapReader represents an epistemological shift in how historians and humanities scholars engage with digitized map collections at scale. Developed through the Living with Machines project, this chapter introduces computational map studies as a new field that combines scholarly traditions of map interpretation with computational methods designed for analyzing entire collections rather than individual sheets. ...

2024

MapReader: Open software for the visual analysis of maps

MapReader is an open-source software library that transforms how researchers extract information from large image collections, particularly historical maps. This diagram illustrates the modular pipeline architecture and data flow through two core tasks: patch classification (dividing images into small cells and classifying visual features) and text spotting (detecting and recognizing text). Starting from input images (top), users can download maps, annotate patches manually, train computer vision models, and perform inference at scale. The flexible pipeline accommodates both small manually-annotated datasets and large-scale automated analysis, as demonstrated by processing approximately 30.5 million patches in one study. Inspired by biomedical imaging methods and adapted for historians, MapReader has proven its versatility by successfully transferring to plant phenotype research, showcasing the power of open and reproducible research methods. This release, developed through the Living with Machines project, includes extensive documentation and tutorials designed to make large-scale visual map analysis accessible to historians and researchers across disciplines. Summary MapReader is an interdisciplinary software library for processing digitized maps and other types of images with two tasks: patch classification and text spotting. Patch classification works by ‘patching’ images into small, custom-sized cells which are then classified according to the user’s needs. Text spotting detects and recognizes text. MapReader offers a flexible pipeline which can be used both for manual annotation of small datasets as well as for computer-vision-based inference of large collections. As an example, in one study, we annotated 62,020 patches, trained a suite of computer vision models and performed model inference on approximately 30.5 million patches. ...

2023

Ten simple rules for working with other people's code

Working with other people's code is a common challenge in research, yet receives less attention than best practices for writing code. This paper presents ten pragmatic rules for researchers at all levels who need to use, modify, or build upon existing research software. The rules are organized into four interconnected phases: planning your approach, understanding the codebase, making changes safely, and publishing your work. Unlike industry software development practices that may introduce unrealistic time burdens, these guidelines acknowledge research realities such as time pressures, niche tools built without formal software engineering practices, and subtle bugs that can be difficult to detect. The framework is iterative rather than linear, recognizing that as understanding of a codebase grows, goals and strategies may need to be reassessed. Abstract Every time that you use a computer, you are using someone else’s code, whether that be an operating system, a word processor, a web application, research tools, or simply code snippets. Almost all code has some bugs and errors. In day to day life, these bugs are usually not too important or at least obvious when they do happen (think of an operating system crashing). However, in research, there is a perfect storm that makes working with other people’s code particularly challenging - research needs to be correct and accurate, researchers often use niche and non-commercial tools that are not built with best software practices, bugs can be subtle and hard to detect, and researchers have time pressures to get things done quickly. It is no surprise then that working with other people’s code is a common frustration for researchers and is even considered a rite of passage. ...

2022

MapReader: a computer vision pipeline for the semantic exploration of maps at scale

Historical maps contain rich information about past landscapes, but extracting data from thousands of maps has traditionally required painstaking manual annotation. MapReader automates this process using computer vision, making large-scale map analysis accessible to users without deep learning expertise. The pipeline divides maps into patches (see insets), trains neural networks to recognize visual features like railways (a, shown in red in c,d) and buildings (b, shown in black in c,d), then reconstructs predictions across entire map sheets. Applied to approximately 16,000 nineteenth-century British Ordnance Survey maps (roughly 30.5 million patches), MapReader transforms visual cartographic information into structured, machine-readable data. The resulting datasets can be queried spatially, analyzed for patterns, and linked to other historical sources, enabling researchers to ask questions at scales previously impossible. Abstract We present MapReader, a free, open-source software library written in Python for analyzing large map collections. MapReader allows users with little computer vision expertise to i) retrieve maps via web-servers; ii) preprocess and divide them into patches; iii) annotate patches; iv) train, fine-tune, and evaluate deep neural network models; and v) create structured data about map content. We demonstrate how MapReader enables historians to interpret a collection of ≈16K nineteenth-century maps of Britain (≈30.5M patches), foregrounding the challenge of translating visual markers into machine-readable data. We present a case study focusing on rail and buildings. We also show how the outputs from the MapReader pipeline can be linked to other, external datasets. We release ≈62K manually annotated patches used here for training and evaluating the models. ...

2020

DeezyMatch: A Flexible Deep Learning Approach to Fuzzy String Matching

How do you match "Newe Yorke" to "New York" or recognize that "Londan" (from a poorly OCR'd document) refers to "London"? DeezyMatch addresses fuzzy string matching through deep learning with transfer learning capabilities, particularly valuable when training data is scarce. The architecture has two components: a pair classifier (left) that trains neural networks to recognize similar strings with learnable parameters (blue) that can be fine-tuned for new domains, and a candidate ranker (right) that generates vector representations and ranks matches using similarity metrics like cosine distance. By enabling transfer learning, DeezyMatch handles the messy realities of historical text analysis, where spelling variations, OCR errors, and limited annotated examples are the norm rather than the exception. Abstract We present DeezyMatch, a free, open-source software library written in Python for fuzzy string matching and candidate ranking. Its pair classifier supports various deep neural network architectures for training new classifiers and for fine-tuning a pretrained model, which paves the way for transfer learning in fuzzy string matching. This approach is especially useful where only limited training examples are available. The learned DeezyMatch models can be used to generate rich vector representations from string inputs. The candidate ranker component in DeezyMatch uses these vector representations to find, for a given query, the best matching candidates in a knowledge base. It uses an adaptive searching algorithm applicable to large knowledge bases and query sets. We describe DeezyMatch’s functionality, design and implementation, accompanied by a use case in toponym matching and candidate ranking in realistic noisy datasets. ...

2019

AxiSEM3D: broad-band seismic wavefields in 3-D global earth models with undulating discontinuities

Earth is not a perfect onion: its internal boundaries undulate, with topography on the core-mantle boundary reaching 10 km and transition zone thickness varying by hundreds of kilometers. Simulating seismic wave propagation through such realistic 3-D complexity at high frequencies pushes traditional methods to their limits. AxiSEM3D solves this through a hybrid approach combining spectral element and pseudospectral methods, parametrizing the azimuthal dimension with a locally adaptive Fourier series that adjusts resolution to match structural complexity. The efficiency gains are dramatic: two to three orders of magnitude faster than full 3-D methods (SPECFEM) at periods of 5 seconds or below, with speedup increasing at higher frequencies. Using particle relabelling transformation to honor undulating discontinuities while keeping the mesh spherical, AxiSEM3D enables 1 Hz simulations of 3-D mantle models with moderate computational resources, making previously inaccessible frequency ranges practical for routine use. Abstract We present a novel numerical method to simulate global seismic wave propagation in realistic aspherical 3-D earth models across the observable frequency band of global seismic data. Our method, named AxiSEM3D, is a hybrid of spectral element method (SEM) and pseudospectral method. It describes the azimuthal dimension of global wavefields with a substantially reduced number of degrees of freedom via a global Fourier series parametrization, of which the number of terms can be locally adapted to the inherent azimuthal complexity of the wavefields. AxiSEM3D allows for material heterogeneities, such as velocity, density, anisotropy and attenuation, as well as for finite undulations on radial discontinuities, both solid–solid and solid–fluid, and thereby a variety of aspherical Earth features such as ellipticity, surface topography, variable crustal thickness, undulating transition zone and core–mantle boundary topography. Undulating discontinuities are honoured by means of the ‘particle relabelling transformation’, so that the spectral element mesh can be kept spherical. The implementation of the particle relabelling transformation is verified by benchmark solutions against a discretized 3-D SEM, considering ellipticity, topography and bathymetry (with the ocean approximated as a hydrodynamic load) and a tomographic mantle model with an undulating transition zone. For the state-of-the-art global tomographic models with aspherical geometry but without a 3-D crust, efficiency comparisons suggest that AxiSEM3D can be two to three orders of magnitude faster than a discretized 3-D method for a seismic period at 5 s or below, with the speed-up increasing with frequency and decreasing with model complexity. We also verify AxiSEM3D for localized small-scale heterogeneities with strong perturbation strength. With reasonable computing resources, we have achieved a corner frequency of up to 1 Hz for 3-D mantle models. ...

2018

SubMachine: Web-Based Tools for Exploring Seismic Tomography and Other Models of Earth's Deep Interior

Comparing seismic tomography models has traditionally required downloading multiple datasets, learning different formats, and writing custom visualization code. SubMachine addresses this by providing web-based tools for interactive exploration of more than 45 global and regional tomography models through a standard browser interface. The platform enables side-by-side model comparison, statistical analysis, and integration with complementary datasets including plate reconstructions, crustal structure, shear wave splitting, and gravity anomalies. By making these Earth models accessible without installation or specialized software, SubMachine facilitates collaborative research across the solid Earth community and supports quantitative comparison of different imaging approaches. Abstract We present SubMachine, a collection of web-based tools for the interactive visualization, analysis, and quantitative comparison of global-scale data sets of the Earth’s interior. SubMachine focuses on making regional and global-scale seismic tomography models easily accessible to the wider solid Earth community, in order to facilitate collaborative exploration. We have written software tools to visualize and explore over 30 tomography models - individually, side-by-side, or through statistical and averaging tools. SubMachine also serves various nontomographic data sets that are pertinent to the interpretation of mantle structure and complement the tomographies. These include plate reconstruction models, normal mode observations, global crustal structure, shear wave splitting, as well as geoid, marine gravity, vertical gravity gradients, and global topography in adjustable degrees of spherical harmonic resolution. By providing repository infrastructure, SubMachine encourages and supports community contributions via submission of data sets or feedback on the implemented toolkits. ...

2017

ObspyDMT: a Python toolbox for retrieving and processing large seismological data sets

Seismological research increasingly depends on large datasets, but retrieving data from multiple centers with different protocols and formats can consume more time than the actual science. ObspyDMT addresses this by providing a unified Python toolbox that handles the complexities automatically. A single command can query decades of global seismic data and generate summary visualizations. Top panel: The explosive growth of available waveforms since 1990, rising from thousands to millions of seismograms. Bottom panel: Automatic global seismicity map colored by earthquake depth. The tool requires no Python knowledge when used from the command line, yet can be integrated into automated workflows for routine tasks like data archiving, instrument correction, and quality control that are essential but time-consuming. Abstract We present obspyDMT, a free, open-source software toolbox for the query, retrieval, processing and management of seismological data sets, including very large, heterogeneous and/or dynamically growing ones. ObspyDMT simplifies and speeds up user interaction with data centers, in more versatile ways than existing tools. The user is shielded from the complexities of interacting with different data centers and data exchange protocols and is provided with powerful diagnostic and plotting tools to check the retrieved data and metadata. While primarily a productivity tool for research seismologists and observatories, easy-to-use syntax and plotting functionality also make obspyDMT an effective teaching aid. Written in the Python programming language, it can be used as a stand-alone command-line tool (requiring no knowledge of Python) or can be integrated as a module with other Python codes. It facilitates data archiving, preprocessing, instrument correction and quality control – routine but nontrivial tasks that can consume much user time. We describe obspyDMT’s functionality, design and technical implementation, accompanied by an overview of its use cases. As an example of a typical problem encountered in seismogram preprocessing, we show how to check for inconsistencies in response files of two example stations. We also demonstrate the fully automated request, remote computation and retrieval of synthetic seismograms from the Synthetics Engine (Syngine) web service of the Data Management Center (DMC) at the Incorporated Research Institutions for Seismology (IRIS). ...

2015

Instaseis: instant global seismograms based on a broadband waveform database

Need a seismogram for any earthquake at any station on Earth? Traditionally, each calculation requires running a wave propagation simulation. Instaseis changes this by precomputing and storing Green's functions in a database that enables extraction of arbitrary seismograms in milliseconds. The efficiency is remarkable: generating a complete Instaseis database costs approximately half the computational time of computing seismograms for just a single source using traditional methods. This figure shows CPU hours required to generate full databases with 1-hour seismograms for Earth and Mars using two time integration schemes. By storing basis coefficients of Lagrange polynomials rather than raw waveforms, Instaseis achieves 4th order spatial accuracy while exactly honoring velocity discontinuities like the core-mantle boundary. This transforms workflows that previously required supercomputer access into laptop-scale computations. Abstract We present a new method and implementation (Instaseis) to store global Green’s functions in a database which allows for near-instantaneous (on the order of milliseconds) extraction of arbitrary seismograms. Using the axisymmetric spectral element method (AxiSEM), the generation of these databases, based on reciprocity of the Green’s functions, is very efficient and is approximately half as expensive as a single AxiSEM forward run. Thus, this enables the computation of full databases at half the cost of the computation of seismograms for a single source in the previous scheme and allows to compute databases at the highest frequencies globally observed. By storing the basis coefficients of the numerical scheme (Lagrange polynomials), the Green’s functions are 4th order accurate in space and the spatial discretization respects discontinuities in the velocity model exactly. High-order temporal interpolation using Lanczos resampling allows to retrieve seismograms at any sampling rate. AxiSEM is easily adaptable to arbitrary spherically symmetric models of Earth as well as other planets. In this paper, we present the basic rationale and details of the method as well as benchmarks and illustrate a variety of applications. ...

2014

AxiSEM: broadband 3-D seismic wavefields in axisymmetric media

Computing how seismic waves propagate through Earth's full 3-D structure at high frequencies typically requires supercomputers running for days or weeks. AxiSEM achieves the same accuracy in a fraction of the time by exploiting a key simplification: for spherically symmetric Earth models, the azimuthal dimension can be computed analytically rather than numerically. This reduces the computational domain from 3-D to 2-D while maintaining full 3-D accuracy in the output wavefields. The figure shows a 3-D wavefield simulation from an earthquake in Italy, computed with AxiSEM at frequencies across the observable seismic band. This efficiency breakthrough enables applications previously impractical, from computing millions of synthetic seismograms for tomographic inversions to generating databases for near-instantaneous retrieval of seismograms from any source-receiver combination. Seismic wave propagation in a spherically symmetric Earth model computed using AxiSEM. Warm colors (red/yellow) show P-waves, while cold colors (blue/green) show S-waves. Abstract We present a methodology to compute 3-D global seismic wavefields for realistic earthquake sources in visco-elastic anisotropic media, covering applications across the observable seismic frequency band with moderate computational resources. This is accommodated by mandating axisymmetric background models that allow for a multipole expansion such that only a 2-D computational domain is needed, whereas the azimuthal third dimension is computed analytically on the fly. This dimensional collapse opens doors for storing space–time wavefields on disk that can be used to compute Fréchet sensitivity kernels for waveform tomography. We use the corresponding publicly available AxiSEM (www.axisem.info) open-source spectral-element code, demonstrate its excellent scalability on supercomputers, a diverse range of applications ranging from normal modes to small-scale lowermost mantle structures, tomographic models, and comparison with observed data, and discuss further avenues to pursue with this methodology. ...

2013

ObsPyLoad: A Tool for Fully Automated Retrieval of Seismological Waveform Data

As seismological data centers exploded with new waveform data in the 2010s, researchers faced a growing challenge: downloading, homogenizing, and quality-controlling data from multiple centers with different interfaces and formats consumed more time than the actual science. ObsPyLoad addresses this data avalanche with a fully automated solution that queries metadata holdings and retrieves seismograms from multiple data centers simultaneously - a major advantage over individual center tools. This schematic shows the elegant program flow: from initial configuration through metadata queries to IRIS and ORFEUS data centers, followed by parallel waveform downloads with built-in quality control and retry logic. A simple command-line call without parameters downloads event-based data for all earthquakes from the past 30 days, while extensive options allow customization for geographic regions, time windows, magnitude thresholds, and update modes. By handling the tedious infrastructure work automatically, ObsPyLoad lets seismologists focus on scientific questions rather than data wrangling. Abstract We confront the data avalanche: the amount of waveform data available from seismological data centers has been growing enormously over the past few years. This is a highly welcome development from a scientific point of view, but the time and effort spent on identification, retrieval, and quality control of subsets of these data may quickly exceed tolerable limits for an individual researcher. ...