First Session: 08:30 AM - 10:30 AM
08:30 - 08:45 Introduction by the Organizers
08:45 - 09:10 First Keynote Talk by Peter Stokes
Although quantitative methods in palaeography have been known for many years, it is mainly over the last decade that palaeographers and others in the humanities have become increasingly aware of computational methods, and the Document Analysis community is similarly becoming increasingly aware of work in the humanities. Nevertheless, historical handwriting and the objects that contain them are complex and multidimensional, and it seems clear that understanding them fully will require further and deeper collaboration between computer sciences, natural scientists, humanists and others. Achieving this remains a real challenge, and so this talk will present some of the difficulties that have been encountered over the last ten years or so, some of the promising advances that have been made, and some thoughts about possible future progress.
09:10 - 09:35 Second Keynote Talk by Sebastian Bosch
This talk presents the laboratory of the Centre for the Study of Manuscript Cultures (CSMC) in Hamburg, Germany. It offers the possibility to study written artefacts by means of material technology and aims to bridge the gap between the humanities, natural sciences and technology. The lab has a wide range of high-end instruments, most of which are mobile and allow non-destructive analysis of artefacts on site. This enables us to answer various research questions such as typology and classification of inks, provenance studies, recovery of faded inscriptions and palimpsests, identification of colourants, binders and paper fibres, reconstruction of the history of manuscripts, authentication and dating. This talk provides an overview of the available techniques and possibilities of our laboratory and then presents case studies from various interdisciplinary projects.
09:35 - 10:05 Short Presentations
- Giuliano Giuffrida
- Paul Dilley
- Simona Stoyanova, Jonathan Prag
- Simon Castellan, Vincent Cohen-Addad, Sophie Giffard-Roisin, Ester Salgarella
- Simon Gabay, Jean-Baptiste Camps, Ariane Pinche, Claire Jahan
Giuliano Giuffrida: A first analysis of the digital corpus of the Vatican Apostolic Library
The Vatican Apostolic Library is digitizing its collection of manuscripts. About 6 Million of FITS images have been acquired so far, distributed among 21,000 shelfmarks, for a total of 400 TB of data. In collaboration with the University of Rome Tor Vergata, we started an analysis of the high resolution (400 dpi) images.
We propose a new approach to discriminate pixels containing ink from “empty” pixels - i.e. pixels that represent only the parchment without ink - based on the analysis of millions of Pixel Intensity Histograms. This first work focuses on 4,622 shelfmarks and 1.7 Million of images, the corpus object of this analysis being distributed among 16 groups of manuscripts (8 Latins, 5 Greeks and 3 Hebrews) produced between the 3rd and the 19th century.
A second analysis concerns the extraction of the filling coefficient using two parallel independent approaches. The first one uses aggregated data to extract a representative value for the whole shelfmark, whereas the second one extracts a value for the filling coefficient for each page using morphological operators. This second analysis is focused on the shelfmarks from Vat.lat.1 to Vat.lat.2000, covering a large range of centuries (6th to 19th c.). Both analyses are performed using original Java and Python codes together with Apache Common Math, OpenCV, and other libraries.
These works are part of an ambitious project that aims to extract codicological parameters from the whole Vatican Apostolic Library digital corpus. We plan to analyze the dataset using classification and clustering algorithms and share it with scholars and researchers for further analysis. Here I will show some preliminary results of our work, highlighting the different approaches that we are following and comparing our findings with previous works.
Paul Dilley: The Potential of Digital Paleography for the Medinet Madi Coptic Manichaean Corpus
The Medinet Madi Corpus consists of seven Coptic manuscripts which were discovered in 1929 in Medinet Madi in the Fayum and reached the Cairo Antiquities market in 1930. There three of them were purchased by Carl Schmidt for the Staatliche Museen Preussischer Kulturbesitz, and four by the Irish-American philanthropist, Chester Beatty; they are currently preserved in the Berlin Papyrussammlung and the Chester Beatty Library in Dublin. The manuscripts had been dampened, and presented a difficult challenge for the famous conservator Hugo Ibscher, and later, his son Rolf. While they were largely successful in separating the individual papyrus leaves and putting them under glass (or, later, Perspex), chemical treatments were applied which have led to a marked decrease in legibility, especially in the case of those conserved by Rolf Ibscher. The general difficulty of editing these leaves, combined with the catastrophic effects of World War II, which put an end to the initial editing process, means that approximately half the corpus has still not received a critical edition, despite being of comparable importance to the Nag Hammadi Library, which was discovered in 1945 and fully edited by the end of the 1980s. In March 2019, a number of pages from four of the Medinet Madi manuscripts in the Chester Beatty Library underwent multispectral imaging by the mobile manuscript lab of the University of Hamburg’s Centre for the Study of Manuscript Cultures. After processing by Ivan Shevchuk and colleagues, the images have revealed between roughly 100% to 400% more text, depending on the page. These form the basis not only for a far more complete edition than was previously possible, but also for subsequent digital approaches to the study of the texts.
Despite these challenges, the Medinet Madi Coptic Manichaean Corpus is of great potential utility for the digital paleography of Late Antique Egypt. The more than 1,000 conserved leaves, even if some are fragmentary or otherwise difficult to read, are written in the same dialect of Coptic, “Lycopolitan,” and appear to have been produced within the same working group of scribes, although multiple hands are evident. Despite this, beyond a few stray remarks, there is no discussion of paleography by the editors of Medinet Madi codices; conversely, there is no discussion of the Medinet Madi codices in general treatments of paleography in Late Antique Egypt (e.g. Orsini 2018). Along with Isabelle Marthot-Santaniello and colleagues, I have begun working with READ developer Stephen White to adapt its textual editing and paleographic annotation capabilities to both Greek and Coptic; I will use the MSI images within READ, which will require finding a way to reduce their size significantly while maintaining their legibility. In addition, I hope that automated handwriting recognition can also be applied to these texts, which will require binarization of the MSI images. There are large sections which are clearly written by the same hand, enough to provide training data for handwriting recognition, which will in turn be useful in distinguishing between several hands within the same codex, as well as recognizing the same hand across two or more codices.
Simona Stoyanova, Jonathan Prag: Integrating palaeographic research into the digital epigraphy of multilingual Sicily
In this paper we introduce the ERC Crossreads project (University of Oxford), which aims at the first comprehensive study of the material linguistic culture of ancient Sicily over a period of 1,500 years. A multilingual and multicultural region, the island’s epigraphic material will allow us to explore written culture on a variety of durable supports (monumental inscriptions, graffiti, pot shards, brick stamps, metal plaques) in a variety of languages (Greek, Latin, Punic, Elymian, Sikel, Oscan).
Crossreads is composed of five distinct modules, each with its own dataset, connected by linking data: (1) epigraphic texts, encoded on the EpiDoc standard and published on the Inscriptions of Sicily website; (2) a second text layer for linguistic annotation; (3) a comprehensive petrographic analysis of the types of stone used for inscriptions in Sicily; (4) an image database (IIIF); (5) a systematic study of letterforms in this diverse corpus across time, supports and languages. This last seeks to answer a range of questions, such as: Are letterforms different depending upon material? Does the stone-type and surface determine the choice of forms in comparison with other materials? Are letter forms different depending on text function, e.g. private vs public? Do letterforms cross over between languages?
Comparison across such diverse sets of data presents technological challenges. The palaeographic analysis will develop modular tools from the Archetype framework, in collaboration with King’s Digital Lab; the TEI texts together with the IIIF image server form the basis for our palaeographic annotations and visualisation. The main challenge is designing an effective mechanism to integrate information from the multiple datasets in Crossreads to enable rich and complex queries. These results will additionally need to be visualised in a meaningful way, linking images to text to palaeographic annotations. All of this we aim to achieve using linked open data principles.
Simon Castellan, Vincent Cohen-Addad, Sophie Giffard-Roisin, Ester Salgarella: Writers Behind Words: Detecting Scribal Variation in Linear A Inscriptions
Who were the people of Bronze Age Crete responsible for recording information essential to the running of palatial administrations? They mostly survive through the signs they left behind on inscribed clay tablets, used for the day-to-day bookkeeping of Minoan centers. Their writing system, called “Minoan” Linear A, is a logo-syllabic script attested in the time-span ca. 1800-1450 BCE on Crete and the Aegean islands, which remains undeciphered to this day. Little is known about the number, role, social status and formal training of the individuals who inscribed those tablets. Traditional paleographical analysis has shed some light on the problem, without however a systematic approach to the whole corpus. In this work, we set out to explore the use of machine learning techniques applied to this ancient script in order to detect subtle paleographical variation and from there, put forward potential scribal identifications. In this endeavor, we rely on the dataset available in SigLA, a recent paleographical database of Linear A inscriptions, under ongoing development. We propose two approaches to detect variation between attestations of the same sign. A first supervised approach uses neural networks to automatically detect a set of paleographical features considered of interest by paleographers. To complement this first approach, we are also working on a second approach, unsupervised, which aims at learning low-dimensional representations of the data that are relevant for detecting subtle paleographical variations. One of the challenges in this endeavor is the lack of standardisation of sign shapes, as Linear A shows remarkable graphic variation between attestations of the same sign, at times drawn differently even within the same inscription.
By blending the results obtained by these machine learning approaches with our paleographical expertise, our final goal is to propose clusters of inscriptions that share enough similarities to have been written by the same individual. These approaches are tested against the evidence we have for a related script, Linear B5, for which scribal identifications have been put forward by way of traditional paleographical analysis. This research will eventually throw light on the overall spread of literacy on Bronze Age Minoan Crete.
Simon Gabay, Jean-Baptiste Camps, Ariane Pinche, Claire Jahan: SegmOnto: common vocabulary and practices for analysing the layout of manuscripts (and more)
Document and layout description is a traditional task for philologists but also a need for many computational applications in document analysis, ranging from content categorisation to text recognition, and has, for this reason, been the subject of much research in computer vision. For tools relying on machine learning, the existence and availability of data sets, and their interoperability are important issues.
The definition of ontologies or controlled vocabularies for the description of manuscript layout has attracted some attention. Codicological dictionaries exist, part of which have already been integrated into SKOS. On the other hand, digitisation standards have developed their own taxonomy, such as the PAGE XML Format. In between these two approaches, initiatives like the TEI offer elements commonly used by editors. With the apparition of efficient layout analysers and user-friendly interfaces to use them, the need for efficient models is rising, and so is the need for large amounts of data, based on the aggregation of heterogeneous documents. For this, researchers need to agree on a common limited vocabulary, based on existing standards; share common practices to ease the interoperability of their ground truth.
The SegmOnto project gathers scholars from different backgrounds who have decided to tackle both issues. It mainly addresses the case of manuscripts, but also old prints. Our work is characterised by two key choices:
1. focus on common material features rather than semantic descriptions (e.g. marginal text, rather than gloss, commentary, note, etc.).
2. use of two levels of description: zones (main text, notes, figure, damage, seal. . . ) and lines (default, musical, interlinear, rubric, drop capital).
It has to be noted that the detection of zones and lines can rely on the position on the page, but also often on the variation of hands as well as script: cursive vs block letters, square letters vs rashi script, roman vs italic. module ink: e.g. blue and red, just red or a different type of black.
10:00 - 10:30: Discussion
Coffee Break: 10:30 AM - 11:00 AM
Second Session: 11:00 AM - 12:30 PM
11:00 - 12:15 Presentation of Papers
11:00 - 11:15: Chahan Vidal-Gorène and Aliénor Decours-Perez: Computational Approach of Armenian Paleography
Armenian Paleography is a relatively recent field of study, which emerged in the late 19th century. The typologies are fairly well established and the paleographers agree to distinguish four main types of writing to describe the Armenian manuscripts. Although these types characterize clearly different lengthy periods of production, they are less appropriate for more complex tasks, like precise datation of a document. A neural network composed of stacks of convolutional layers shows the relevance of the classification, but also highlights considerable disparity within each type. We propose a precise description of the specificities of Armenian letters in order to explore some other possible classifications. Even though the outcomes need to be further developed, the intermediate evaluations show a 8.07% gain with an extended classification.
11:15 - 11:30: Jean-Baptiste Camps, Chahan Vidal-Gorène and Marguerite Vernet: Handling Heavily Abbreviated Manuscripts: HTR engines vs text normalisation approaches
Although abbreviations are fairly common in handwritten sources, particularly in medieval and modern Western manuscripts, previous research dealing with computational approaches to their expansion is scarce. Yet abbreviations present particular challenges to computational approaches such as handwritten text recognition and natural language processing tasks. Often, pre-processing ultimately aims to lead from a digitised image of the source to a normalised text, which includes expansion of the abbreviations. We explore different setups to obtain such a normalised text, either directly, by training HTR engines on normalised (i.e., expanded, disabbreviated) text, or by decomposing the process into discrete steps, each making use of specialist models for recognition, word segmentation and normalisation. The case studies considered here are drawn from the medieval Latin tradition.
11:30 - 11:45: Daniel Stoekl Ben Ezra, Pawel Jablonski and Bronson Brown-DeVost: Exploiting Insertion Symbols for Marginal Additions in the Recognition Process to Establish Reading Order
In modern and medieval manuscripts, a frequent phenomenon is additions or corrections that are marginal or interlinear with regard to the main text-blocks. With the recent success of Long-Short-Term-Memory Neural Networks (LSTM) in Handwritten-Text-Recognition (HTR) systems, many have chosen lines as the primary structural unit. Due to this approach, establishing the reading order for such additions becomes a non-trivial problem, because they must be inserted between words inside line-units at undefined locations. Even a perfect reading order detection system ordering all lines of a text in the correct order would not be able to deal with inline insertions. The present paper proposes to include markers for the insertion points in the recognition training process, those indicators can then teach the recognition models themselves to detect scribal insertion markers for marginal or interlinear additions.
11:45 - 12:00: Nikita Srivatsan, Jason Vega, Christina Skelton and Taylor Berg-Kirkpatrick: Neural Representation Learning for Scribal Hands of Linear B
In this work, we present an investigation into the use of neural feature extraction in performing scribal hand analysis of the Linear B writing system. While prior work has demonstrated the usefulness of strategies such as phylogenetic systematics in tracing Linear B's history, these approaches have relied on manually extracted features which can be very time consuming to define by hand. Instead we propose learning features using a fully unsupervised neural network that does not require any human annotation. Specifically our model assigns each glyph written by the same scribal hand a shared vector embedding to represent that author's stylistic patterns, and each glyph representing the same syllabic sign a shared vector embedding to represent the identifying shape of that character. Thus the properties of each image in our dataset are represented as the combination of a scribe embedding and a sign embedding. We train this model using both a reconstructive loss governed by a decoder that seeks to reproduce glyphs from their corresponding embeddings, and a discriminative loss which measures the model's ability to predict whether or not an embedding corresponds to a given image. Among the key contributions of this work we (1) present a new dataset of Linear B glyphs, annotated by scribal hand and sign type, (2) propose a neural model for disentangling properties of scribal hands from glyph shape, and (3) quantitatively evaluate the learned embeddings on find-place prediction and similarity to manually extracted features, showing improvements over simpler baseline methods.
12:00 - 12:15: Olga Serbaeva Saraogi and Stephen White: READ for solving manuscript riddles: a preliminary study of the manuscripts of the 3rd ṣaṭka of the Jayadrathayāmala
This is a part of an in-depth study of a set of the manuscripts related to the Jayadrathayāmala. Taking JY.3.9 as a test-chapter, a comparative paleography analysis of the 11 manuscripts was made within the READ software framework. The workflow within READ minimized the effort to make a few important discoveries (manuscripts containing more than one script, identification of the manuscripts potentially written by the same person) as well as to create an overview of the shift from Nāgarī to Newārī and, finally, to Devanāgarī scripts within the history of manuscript transmission of a single chapter. Combined with an exploratory statistical analysis in R of the syllable frequency in each manuscript, the paleography analysis helped to establish that there are potentially two lines of manuscript transmission of the JY.3.9.