Akis: Osmanlıca Transkripsiyon Aracı | Sabanci University Dhlab

Akis: Ottoman Transcription Tool

Goal

The Ottoman Transcription Tool project Akis is carried out in cooperation with DH Lab and VERİM (Data Analytics Research and Application Center). Our aim is to develop recognition technologies that can transcribe Ottoman Turkish handwritten and printed works written in Arabic and Persian script into Latin script. Thus we want to make texts in Ottoman archives and libraries written in Ottoman Turkish more accessible to researchers and general users from different disciplines.

Researchers in studies in the fields of social sciences and humanities such as history, literature, and political science that focus on the period before the Turkish letter revolution of 1928, usually first transcribe the Ottoman Turkish texts. However, such transcriptions can only be done manually by experts who can read Ottoman Turkish. With the recent digitization of old texts prepared in Ottoman Turkish, a very large repository has become much more accessible and the pace of digitization is increasing day by day. However, as this digital image repository is still made available by manual transcription, this work is increasingly beyond the capacity of individual human labor.

Thanks to the automatic transcription system and application aimed to be developed within the scope of the project, we aim to eliminate this manual transcription process and to perform the automatic transcription of documents within a determined scope. With the success of this project, texts published before 1928 will become accessible to different segments of society.

Academic and Social Impact

We will make the software that is developed in the multi-disciplinary Akis project available as a web application. It will be accessible to experts working in the field of computer science and social and human scientists working in the field of Ottoman studies.

The automatic transcription of handwritten texts that use the Ottoman alphabet and printed texts will be critical, as it will drastically reduce research processes in Social Sciences and Humanities.

We foresee that making Ottoman Turkish texts accessible to various segments of society will have a significant impact on society, especially in the fields of culture and education.

Context

The common written language of the Ottoman Empire was Ottoman Turkish, also known as Ottoman Turkish, for which the Arabic-Persian alphabet was used. Ottoman Turkish is a written language that was used from the end of the 14th century to the middle of the 20th century. It is based on the Turkish syntax structure, and covers Arabic and Persian words, word groups and morphological features (Timurtaş, 2017; Ergin, 2020).

Ottoman Turkish is the main writing system that is encountered both in manuscripts prepared by scribes in earlier periods and in printed works that started to be produced in printing houses in the period after 1729.

The same writing system was used in the early period of the Turkish Republic until the Turkish letter revolution in 1928. Accordingly, primary sources used by the researchers in all of the studies on the geography of the empire and the republic, carried out in different disciplines of social sciences and humanities such as history, literature, art history, architectural history, political science and sociology, and focusing on the period before 1928, mainly consist of Ottoman Turkish texts in which the Arabic-Persian alphabet was used.

Ottoman-Turkish (14th -20th Centuries)

Handwritten Materials and Manuscripts Arabic

and Persian alphabet

The Transition Period to Printed Culture (1729-early 20th Century)

Handwritten Materials and Manuscripts + Printed Culture Arabic

and Persian Alphabet

Language Reform - 1928

Printed Culture

Latin-based alphabet

akis2 — Latin script: Watch out, you are going to get kicked!
Arabic script: Doing this is as hard as reading me!
(From the Turkish journal Akbaba, Istanbul, 1926)

Method

In the Akis Ottoman Transcription Tool project the latest handwriting recognition methods will be used for the transcription of handwritten and printed works that use the so-called ‘Nesih’ (Naskh)script into Latin script.

With the application of deep learning technologies in handwriting recognition systems, automatic recognition of historical texts has yielded highly successful results in many languages. However, to date, these latest approaches have not been applied to texts produced in Ottoman Turkish.The constraints that a few similar commercial products impose on free use and the lack of transparent measures of performance reduce the usefulness of these products for the general public and researchers.

nesih — Types of writing used in Ottoman Turkish (Hutut-u Mütenevvia Plate), Calligrapher: Hamid Ayaç, SSM Collection. Nesih writing style, indicated by the red area, stands out as the most legible style.

nesih el yazma — Nesih style manuscripts from the 16th and 17th centuries

nesih matbu — Printed documents in nesih style from the second half of the 19th and early 20th centuries

How is Akis developed?

Creating a Data Set: First, we create a large-size data set to be used in to-be-developed training and testing artificial intelligence and machine learning models.

Pre-processing:Although segmenting the collected document images into lines is relatively easy for printed texts, differently written titles, background noise (clutter), or color changes such as yellowing on the paper can make it difficult to detect lines. This difficulty may be exacerbated by the slight rotation that may occur during the scanning of the page. In this respect, we perform the necessary pre-processing steps such as noise removal and rotation correction before the line-segmentation phase.

Application Development: In the last phase of the project, we will develop a web application suitable for researchers from different fields. Users will be able to transmit the text image they want to be transcribed to our system via the application and receive the result in real time through the application. For documents with complex page layouts such as newspapers, researchers will be able to manually select certain sections, upload them to the system and acquire the transcription.

Segmentation and Labeling: We use both traditional and deep learning-based approaches to the segmentation of the text. We separate the collected document images into lines semi-automatically; automatically lined pages are also checked for errors. We create the Turkish transcription of each line manually. To facilitate this process, we are developing a labeling input software that will display rows one by one to the labeler.

Transcription: We use the samples in the dataset, which is formed after the line images and matching tags are obtained by going through the pre-processing and segmentation stages, in training a deep learning-based recognition system through supervised learning.