Beyond Text and Pixels: Spatial Reasoning for Manuscript Exploration

This project focuses on two core computer vision tasks: text-based image retrieval, automatic retrieval of manuscript folios using natural-language queries that integrate visual and textual content, and visual spatial reasoning (VSR), which models and interprets spatial relationships between visual elements and text regions. A unified system will be developed to combine these capabilities to enable scholars to issue complex queries, such as “retrieve folios where a marginal illustration appears beneath its rubric” or “locate pages where an iconographic motif occurs adjacent to an annotation”. The research will involve curating and annotating novel datasets of folio-level image–text pairs and spatial relation labels, fine-tuning vision-language models for retrieval performance, and adapting relation networks or transformer-based detectors for VSR. The final deliverable will be an interactive web-based prototype that integrates retrieval and spatial reasoning modules, promoting large-scale, semantically rich exploration of digitised collections. By extending vision-language and relational reasoning methods to the complex, multilingual, and multiscriptural context of digitised manuscripts, this project will deliver open-source tools, annotated datasets, and evaluation benchmarks that advance both manuscript scholarship and computer vision research.