Programme

First Session: 08:30 AM - 10:30 AM

08:30 - 08:45 Introduction by the Organizers

08:45 - 09:10 First Keynote Talk by Peter Stokes

09:10 - 09:35 Second Keynote Talk by Sebastian Bosch

09:35 - 10:05 Short Presentations

Giuliano Giuffrida: A first analysis of the digital corpus of the Vatican Apostolic Library

Paul Dilley: The Potential of Digital Paleography for the Medinet Madi Coptic Manichaean Corpus

The Medinet Madi Corpus consists of seven Coptic manuscripts which were discovered in 1929 in Medinet Madi in the Fayum and reached the Cairo Antiquities market in 1930. There three of them were purchased by Carl Schmidt for the Staatliche Museen Preussischer Kulturbesitz, and four by the Irish-American philanthropist, Chester Beatty; they are currently preserved in the Berlin Papyrussammlung and the Chester Beatty Library in Dublin. The manuscripts had been dampened, and presented a difficult challenge for the famous conservator Hugo Ibscher, and later, his son Rolf. While they were largely successful in separating the individual papyrus leaves and putting them under glass (or, later, Perspex), chemical treatments were applied which have led to a marked decrease in legibility, especially in the case of those conserved by Rolf Ibscher. The general difficulty of editing these leaves, combined with the catastrophic effects of World War II, which put an end to the initial editing process, means that approximately half the corpus has still not received a critical edition, despite being of comparable importance to the Nag Hammadi Library, which was discovered in 1945 and fully edited by the end of the 1980s. In March 2019, a number of pages from four of the Medinet Madi manuscripts in the Chester Beatty Library underwent multispectral imaging by the mobile manuscript lab of the University of Hamburg’s Centre for the Study of Manuscript Cultures. After processing by Ivan Shevchuk and colleagues, the images have revealed between roughly 100% to 400% more text, depending on the page. These form the basis not only for a far more complete edition than was previously possible, but also for subsequent digital approaches to the study of the texts.

Despite these challenges, the Medinet Madi Coptic Manichaean Corpus is of great potential utility for the digital paleography of Late Antique Egypt. The more than 1,000 conserved leaves, even if some are fragmentary or otherwise difficult to read, are written in the same dialect of Coptic, “Lycopolitan,” and appear to have been produced within the same working group of scribes, although multiple hands are evident. Despite this, beyond a few stray remarks, there is no discussion of paleography by the editors of Medinet Madi codices; conversely, there is no discussion of the Medinet Madi codices in general treatments of paleography in Late Antique Egypt (e.g. Orsini 2018). Along with Isabelle Marthot-Santaniello and colleagues, I have begun working with READ developer Stephen White to adapt its textual editing and paleographic annotation capabilities to both Greek and Coptic; I will use the MSI images within READ, which will require finding a way to reduce their size significantly while maintaining their legibility. In addition, I hope that automated handwriting recognition can also be applied to these texts, which will require binarization of the MSI images. There are large sections which are clearly written by the same hand, enough to provide training data for handwriting recognition, which will in turn be useful in distinguishing between several hands within the same codex, as well as recognizing the same hand across two or more codices.

Simona Stoyanova, Jonathan Prag: Integrating palaeographic research into the digital epigraphy of multilingual Sicily

Simon Castellan, Vincent Cohen-Addad, Sophie Giffard-Roisin, Ester Salgarella: Writers Behind Words: Detecting Scribal Variation in Linear A Inscriptions

Who were the people of Bronze Age Crete responsible for recording information essential to the running of palatial administrations? They mostly survive through the signs they left behind on inscribed clay tablets, used for the day-to-day bookkeeping of Minoan centers. Their writing system, called “Minoan” Linear A, is a logo-syllabic script attested in the time-span ca. 1800-1450 BCE on Crete and the Aegean islands, which remains undeciphered to this day. Little is known about the number, role, social status and formal training of the individuals who inscribed those tablets. Traditional paleographical analysis has shed some light on the problem, without however a systematic approach to the whole corpus. In this work, we set out to explore the use of machine learning techniques applied to this ancient script in order to detect subtle paleographical variation and from there, put forward potential scribal identifications. In this endeavor, we rely on the dataset available in SigLA, a recent paleographical database of Linear A inscriptions, under ongoing development. We propose two approaches to detect variation between attestations of the same sign. A first supervised approach uses neural networks to automatically detect a set of paleographical features considered of interest by paleographers. To complement this first approach, we are also working on a second approach, unsupervised, which aims at learning low-dimensional representations of the data that are relevant for detecting subtle paleographical variations. One of the challenges in this endeavor is the lack of standardisation of sign shapes, as Linear A shows remarkable graphic variation between attestations of the same sign, at times drawn differently even within the same inscription.

By blending the results obtained by these machine learning approaches with our paleographical expertise, our final goal is to propose clusters of inscriptions that share enough similarities to have been written by the same individual. These approaches are tested against the evidence we have for a related script, Linear B5, for which scribal identifications have been put forward by way of traditional paleographical analysis. This research will eventually throw light on the overall spread of literacy on Bronze Age Minoan Crete.

Simon Gabay, Jean-Baptiste Camps, Ariane Pinche, Claire Jahan: SegmOnto: common vocabulary and practices for analysing the layout of manuscripts (and more)

Document and layout description is a traditional task for philologists but also a need for many computational applications in document analysis, ranging from content categorisation to text recognition, and has, for this reason, been the subject of much research in computer vision. For tools relying on machine learning, the existence and availability of data sets, and their interoperability are important issues.

The definition of ontologies or controlled vocabularies for the description of manuscript layout has attracted some attention. Codicological dictionaries exist, part of which have already been integrated into SKOS. On the other hand, digitisation standards have developed their own taxonomy, such as the PAGE XML Format. In between these two approaches, initiatives like the TEI offer elements commonly used by editors. With the apparition of efficient layout analysers and user-friendly interfaces to use them, the need for efficient models is rising, and so is the need for large amounts of data, based on the aggregation of heterogeneous documents. For this, researchers need to agree on a common limited vocabulary, based on existing standards; share common practices to ease the interoperability of their ground truth.

The SegmOnto project gathers scholars from different backgrounds who have decided to tackle both issues. It mainly addresses the case of manuscripts, but also old prints. Our work is characterised by two key choices:

1. focus on common material features rather than semantic descriptions (e.g. marginal text, rather than gloss, commentary, note, etc.).

2. use of two levels of description: zones (main text, notes, figure, damage, seal. . . ) and lines (default, musical, interlinear, rubric, drop capital).

It has to be noted that the detection of zones and lines can rely on the position on the page, but also often on the variation of hands as well as script: cursive vs block letters, square letters vs rashi script, roman vs italic. module ink: e.g. blue and red, just red or a different type of black.

10:00 - 10:30: Discussion

Coffee Break: 10:30 AM - 11:00 AM

Second Session: 11:00 AM - 12:30 PM

11:00 - 12:15 Presentation of Papers

11:00 - 11:15: Chahan Vidal-Gorène and Aliénor Decours-Perez: Computational Approach of Armenian Paleography

11:15 - 11:30: Jean-Baptiste Camps, Chahan Vidal-Gorène and Marguerite Vernet: Handling Heavily Abbreviated Manuscripts: HTR engines vs text normalisation approaches

11:30 - 11:45: Daniel Stoekl Ben Ezra, Pawel Jablonski and Bronson Brown-DeVost: Exploiting Insertion Symbols for Marginal Additions in the Recognition Process to Establish Reading Order

11:45 - 12:00: Nikita Srivatsan, Jason Vega, Christina Skelton and Taylor Berg-Kirkpatrick: Neural Representation Learning for Scribal Hands of Linear B

12:00 - 12:15: Olga Serbaeva Saraogi and Stephen White: READ for solving manuscript riddles: a preliminary study of the manuscripts of the 3rd ṣaṭka of the Jayadrathayāmala

IWCP 2021