Content area

Abstract

The absence of word boundaries between words in scriptio continua script hinders the development of NLP models for such scripts. The objective of this research is to facilitate the building of NLP models for scriptio continua scripts by designing a word segmentation model for predicting word boundaries between characters in sentences, focusing particularly on ancient Tamil scripts. We have utilized an NGRAM Naive Bayes model to predict the existence of word boundaries between two characters in a scriptio continua text. We trained and assessed the model on a dataset of ancient Tamil writing, achieving an accuracy of 91.28%. Efficiently segmenting ancient Tamil texts not only helps preserve and comprehend historical manuscripts, but it also enables advancements in automated text segmentation. This model will assist archeologists in constructing NLP models utilizing ancient Tamil, allowing for the extraction of significant information from ancient Tamil manuscripts without the need for a language expert. Additional research may be undertaken to examine more effective techniques for word segmentation with better performance, managing scripts from several centuries, and developing models for additional scripts.

Details

1009240
Business indexing term
Title
Word segmentation of ancient Tamil text extracted from inscriptions
Publication title
Volume
13
Issue
1
Pages
97
Publication year
2025
Publication date
Dec 2025
Publisher
Springer Nature B.V.
Place of publication
London
Country of publication
Netherlands
Publication subject
e-ISSN
20507445
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-03-31
Milestone dates
2025-01-17 (Registration); 2024-08-12 (Received); 2024-12-01 (Accepted)
Publication history
 
 
   First posting date
31 Mar 2025
ProQuest document ID
3184228305
Document URL
https://www.proquest.com/scholarly-journals/word-segmentation-ancient-tamil-text-extracted/docview/3184228305/se-2?accountid=208611
Copyright
Copyright Springer Nature B.V. Dec 2025
Last updated
2025-04-16
Database
ProQuest One Academic