Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation - Citegraph

Paper Info

Title
Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation

Abstract
The Arabic Treebank (ATB), released by the Linguistic Data Consortium, contains multiple annotation files for each source file, due in part to the role of diacritic inclusion in the annotation process. The data is made available in both "vocalized" and "unvocalized" forms, with and without the diacritic marks, respectively. Much parsing work with the ATB has used the unvocalized form, on the basis that it more closely represents the "real-world" situation. We point out some problems with this usage of the unvocalized data and explain why the unvocalized form does not in fact represent "real-world" data. This is due to some aspects of the treebank annotation that to our knowledge have never before been published.

Year	Venue	Field
2008	SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008	Linguistic Data Consortium,Annotation,Arabic,Computer science,Source code,Speech recognition,Diacritic,Artificial intelligence,Treebank,Natural language processing,Parsing
DocType	Citations	PageRank
Conference	3	0.49
References	Authors
1	3

Authors (3 rows)

Cited by (3 rows)

References (1 rows)

Name	Order	Citations	PageRank
Mohamed Maamouri	1	112	13.34
Seth Kulick	2	221	29.66
Ann Bies	3	136	20.02

1