Content area

Abstract

This investigation deals with the problem of language identification of noisy texts, which could represent the primary step of many natural language processing or information retrieval tasks. Language identification is the task of automatically identifying the language of a given text. Although there exists several methods in the literature, their performances are not so convincing in practice. In this contribution, we propose two statistical approaches: the high frequency approach and the nearest prototype approach. In the first one, 5 algorithms of language identification are proposed and implemented, namely: character based identification (CBA), word based identification (WBA), special characters based identification (SCA), sequential hybrid algorithm (HA1) and parallel hybrid algorithm (HA2). In the second one, we use 11 similarity measures combined with several types of character N-Grams. For the evaluation task, the proposed methods are tested on forum datasets containing 32 different languages. Furthermore, an experimental comparison is made between the proposed approaches and some referential language identification tools such as: LIGA, NTC, Google translate and Microsoft Word. Results show that the proposed approaches are interesting and outperform the baseline methods of language identification on forum texts.

Details

10000008
Title
Effective language identification of forum texts based on statistical approaches
Publication title
Volume
52
Issue
4
First page
491
Publication year
2016
Publication date
Jul 2016
Publisher
Elsevier Science Ltd.
Place of publication
Oxford
Country of publication
United Kingdom
ISSN
03064573
e-ISSN
18735371
CODEN
IPMADK
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
ProQuest document ID
1792389416
Document URL
https://www.proquest.com/scholarly-journals/effective-language-identification-forum-texts/docview/1792389416/se-2?accountid=208611
Copyright
Copyright Pergamon Press Inc. Jul 2016
Last updated
2025-11-15
Database
ProQuest One Academic