Content area
Full Text
Lang Resources & Evaluation (2011) 45:331344 DOI 10.1007/s10579-011-9159-7
ORIGINAL PAPER
Guy De Pauw Peter Waiganjo Wagacha Gilles-Maurice de Schryver
Published online: 19 July 2011 Springer Science+Business Media B.V. 2011
Abstract Research in machine translation and corpus annotation has greatly beneted from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the SAWA
corpus, a two-million-word parallel corpus EnglishSwahili. We describe the data collection phase and zero in on the difculties of nding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing EnglishSwahili translation dictionaries. We particularly focus on the difculties
The research presented in this paper was made possible through the support of the VLIR-IUC-UON program and was partly funded by the SAWA BOF UA-2007 project. The rst author is funded as a Postdoctoral Fellow of the Research FoundationFlanders (FWO).
G. De Pauw (&)
CLiPS, Department of Linguistics, University of Antwerp, Antwerp, Belgium e-mail: [email protected]
G. De Pauw P. W. Wagacha
School of Computing and Informatics, University of Nairobi, Nairobi, Kenya
P. W. Wagachae-mail: [email protected]
G.-M. de Schryver
Department of African Languages and Cultures, Ghent University, Ghent, Belgium e-mail: [email protected]
G.-M. de Schryver
Xhosa Department, University of the Western Cape, Cape Town, South Africa
Exploring the SAWA corpus: collection and deployment of a parallel corpus EnglishSwahili
123
332 G. De Pauw et al.
of translating English into the morphologically more complex Bantu language of Swahili.
Keywords Parallel corpus Swahili English Machine translation
Projection of annotation African language technology
1 Introduction
Typical language technology applications such as information extraction, spell checking and machine translation can provide an invaluablebut all too often ignoredimpetus in bridging the digital divide between the Western world and developing countries. In Africa, quite a few localization efforts are currently underway that allow improved ICT access...