Training-Free Pattern Recognition: Sifting Through Thousands of Digital Images of Palm-Leaf Manuscripts
Giovanni Ciotti, Hussein Adnan Mohammed
Finding patterns in scanned manuscripts is becoming crucial for humanities scholars as more documents are digitised. However, current detection methods require a lot of training data, which is often difficult to create in humanities research because it involves manually labelling things like words, drawings, and stamps in each manuscript. This makes these methods sometimes impractical, in particular when only few people have the expertise to create data sets that can be used for training.
A new approach was developed that does not require training data and uses a powerful classifier. It builds on the work in Mohammed, Märgner and Stiehl 2018, which demonstrated a state-of-the-art classification rate for the task of writer identification on manuscript images using the learning-free classifier based on Naïve Bayes Nearest-Neighbour (NBNN) proposed in Mohammed et al. 2017 without any pre-processing steps. In addition, Terzić and du Buf 2014 proposed a category-level object detection method based on the NBNN algorithm with state-of-the-art performance on datasets of objects in complex scenes. The new proposed method builds on the aforementioned methods to benefit from their strengths.
This method was successfully applied to the publicly available data set called AMADI_LontarSet, as well as the following research case. A peculiar variation of the Sanskrit invocation hariḥ om written in an unusual version of Tamilian Grantha script was randomly detected on the margins of palm-leaf pothi manuscripts originating from the cultural area of Tamil Nadu. Just a few occurrences of it had been detected over the last decade or so in the digital images of pothis belonging to the École Française D’extrême-Orient (EFEO) and the French Institute of Puducherry (IFP): two and twenty-one, respectively. Using the developed method, it was possible to automatically retrieve six more occurrences in a matter of a few hours from the pothis of the EFEO collection, which consists of 1,625 manuscripts, corresponding to a whopping 155,372 images.
Finally, a user-friendly software tool (Visual-Pattern Detector (VPD)) based on this method was created and made freely available.