Data Linking Team Wins Best Young Research Paper Award
24 September 2025

Photo: FedCSIS
Thomas Asselborn, Magnus Bender, Ralf Möller, and Sylvia Melzer from the ‘Data Linking’ research field at the CSMC have been honoured with one of the Best Young Research Paper Awards at the renowned Conference on Computer Science and Intelligence Systems (FedCSIS) in Kraków.
Their paper, ‘Treating OCR Output as a Language (TOOL) – Improving OCR Output with Seq2Seq Translation’, impressed the jury in the thematic track ‘AI in Digital Humanities, Computational Social Sciences and Economics Research (AI HuSo)’ with its novel approach to improving text recognition. It addresses a key challenge in the digital humanities and social sciences: the frequent inaccuracies of automatic text recognition (Optical Character Recognition, OCR), especially when dealing with historical, low-quality, or multilingual documents.
OCR errors are often particularly pronounced for older fonts, for which good models are rarely available. Typical examples include confusion between the long ‘s’ and ‘f’ or between ‘C’ and ‘(’. Thomas Asselborn, Magnus Bender (now Aarhus University ), Ralf Möller, and Sylvia Melzer observed that such errors are not random, but systematic. Building on this insight, they proposed an innovative solution: their TOOL approach treats noisy OCR output as its own ‘language’. Using transformer-based sequence-to-sequence translation models like Marian, the system automatically translates OCR text to clean text, correcting systematic errors. This method is scalable, model-independent, and flexible with respect to different languages.
A special focus was placed on the choice of a suitable historical dataset for their experiments, which was kindly provided by the Federal Institute for Vocational Education and Training (BIBB) in Bonn. While TOOL was demonstrated on this dataset, the method itself is fundamentally independent of the data source and could be applied to other printed or even ‘bookhand’-written texts. In their experiments on German texts from 1871 to the present, TOOL significantly improved token-level accuracy in OCR post-processing.
FedCSIS is among the leading international conferences at the intersection of computer science and interdisciplinary research. The latest edition took place in Krakow from 14 to 17 September. The track ‘AI in the Humanities, Social Sciences and Economics’ features innovative approaches that leverage artificial intelligence for questions in the humanities and social sciences. The recognition by FedCSIS highlights the importance of research at CSMC in the dynamic field of digital manuscript studies.