Exploring the sawa corpus: collection and deployment of a parallel corpus English--Swahili - Citegraph

Paper Info

Title
Exploring the sawa corpus: collection and deployment of a parallel corpus English--Swahili

Abstract
Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus English--Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing English--Swahili translation dictionaries. We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili.

Year	DOI	Venue
2011	10.1007/s10579-011-9159-7	Language Resources and Evaluation
Keywords	Field	DocType
sawa corpus,accessible data,swahili translation dictionary,two-million-word parallel corpus,word-aligned parallel corpus,parallel corpus,language pair,swahili portion,annotated parallel corpus,corpus annotation,english,machine translation	Example-based machine translation,Computer science,Swahili,Machine translation,Artificial intelligence,Corpus linguistics,Natural language processing,Annotation,Bantu languages,Text corpus,Speech recognition,Linguistics,Sentence	Journal
Volume	Issue	ISSN
45	3	1574-020X
Citations	PageRank	References
3	0.47	15
Authors
3

Authors (3 rows)

Cited by (3 rows)

References (15 rows)

Name	Order	Citations	PageRank
Guy Pauw	1	75	12.47
Peter Waiganjo Wagacha	2	7	0.92
Gilles-Maurice Schryver	3	17	2.17

1