Akis: Ottoman Transcription Tool
Goal
The Ottoman Transcription Tool project Akis is carried out in cooperation with DH Lab and VERİM (Data Analytics Research and Application Center). Our aim is to develop recognition technologies that can transcribe Ottoman Turkish handwritten and printed works written in Arabic and Persian script into Latin script. Thus we want to make texts in Ottoman archives and libraries written in Ottoman Turkish more accessible to researchers and general users from different disciplines.
Researchers in studies in the fields of social sciences and humanities such as history, literature, and political science that focus on the period before the Turkish letter revolution of 1928, usually first transcribe the Ottoman Turkish texts. However, such transcriptions can only be done manually by experts who can read Ottoman Turkish. With the recent digitization of old texts prepared in Ottoman Turkish, a very large repository has become much more accessible and the pace of digitization is increasing day by day. However, as this digital image repository is still made available by manual transcription, this work is increasingly beyond the capacity of individual human labor.
Thanks to the automatic transcription system and application aimed to be developed within the scope of the project, we aim to eliminate this manual transcription process and to perform the automatic transcription of documents within a determined scope. With the success of this project, texts published before 1928 will become accessible to different segments of society.
Academic and Social Impact
We will make the software that is developed in the multi-disciplinary Akis project available as a web application. It will be accessible to experts working in the field of computer science and social and human scientists working in the field of Ottoman studies.
The automatic transcription of handwritten texts that use the Ottoman alphabet and printed texts will be critical, as it will drastically reduce research processes in Social Sciences and Humanities.
We foresee that making Ottoman Turkish texts accessible to various segments of society will have a significant impact on society, especially in the fields of culture and education.
Context
The common written language of the Ottoman Empire was Ottoman Turkish, also known as Ottoman Turkish, for which the Arabic-Persian alphabet was used. Ottoman Turkish is a written language that was used from the end of the 14th century to the middle of the 20th century. It is based on the Turkish syntax structure, and covers Arabic and Persian words, word groups and morphological features (Timurtaş, 2017; Ergin, 2020).
Ottoman Turkish is the main writing system that is encountered both in manuscripts prepared by scribes in earlier periods and in printed works that started to be produced in printing houses in the period after 1729.
The same writing system was used in the early period of the Turkish Republic until the Turkish letter revolution in 1928. Accordingly, primary sources used by the researchers in all of the studies on the geography of the empire and the republic, carried out in different disciplines of social sciences and humanities such as history, literature, art history, architectural history, political science and sociology, and focusing on the period before 1928, mainly consist of Ottoman Turkish texts in which the Arabic-Persian alphabet was used.
Ottoman-Turkish (14th -20th Centuries)
Handwritten Materials and Manuscripts Arabic
and Persian alphabet
The Transition Period to Printed Culture (1729-early 20th Century)
Handwritten Materials and Manuscripts + Printed Culture Arabic
and Persian Alphabet
Language Reform - 1928
Printed Culture
Latin-based alphabet
Method
In the Akis Ottoman Transcription Tool project the latest handwriting recognition methods will be used for the transcription of handwritten and printed works that use the so-called ‘Nesih’ (Naskh)script into Latin script.
With the application of deep learning technologies in handwriting recognition systems, automatic recognition of historical texts has yielded highly successful results in many languages. However, to date, these latest approaches have not been applied to texts produced in Ottoman Turkish.The constraints that a few similar commercial products impose on free use and the lack of transparent measures of performance reduce the usefulness of these products for the general public and researchers.
How is Akis developed?
Creating a Data Set: First, we create a large-size data set to be used in to-be-developed training and testing artificial intelligence and machine learning models.
Pre-processing:Although segmenting the collected document images into lines is relatively easy for printed texts, differently written titles, background noise (clutter), or color changes such as yellowing on the paper can make it difficult to detect lines. This difficulty may be exacerbated by the slight rotation that may occur during the scanning of the page. In this respect, we perform the necessary pre-processing steps such as noise removal and rotation correction before the line-segmentation phase.
Application Development: In the last phase of the project, we will develop a web application suitable for researchers from different fields. Users will be able to transmit the text image they want to be transcribed to our system via the application and receive the result in real time through the application. For documents with complex page layouts such as newspapers, researchers will be able to manually select certain sections, upload them to the system and acquire the transcription.
Segmentation and Labeling: We use both traditional and deep learning-based approaches to the segmentation of the text. We separate the collected document images into lines semi-automatically; automatically lined pages are also checked for errors. We create the Turkish transcription of each line manually. To facilitate this process, we are developing a labeling input software that will display rows one by one to the labeler.
Transcription: We use the samples in the dataset, which is formed after the line images and matching tags are obtained by going through the pre-processing and segmentation stages, in training a deep learning-based recognition system through supervised learning.
Method
Akis yazılımımızın demo videosunu izleyebilirsiniz.
Research Team
Principal Investigator
Senior Researchers
University of Vienna
Istanbul Medeniyet University
Sabancı University
PURE Summer 2022 Undergraduate Researchers
Research Assistants
Bilge Çağlar
Boğaziçi University
Sabancı University
Boğaziçi University
Sabancı University
Middle East Technical University
Boğaziçi University
Sabancı University