Abstract | ||
---|---|---|
In this position paper, we describe our perspective on how meaningful resources for lower-resourced languages should be developed in connection with the speakers of those languages. Before advancing that position, we first examine two massively multilingual resources used in language technology development, identifying shortcomings that limit their usefulness. We explore the contents of the names stored in Wikidata for a few lower-resourced languages and find that many of them are not in fact in the languages they claim to be, requiring non-trivial effort to correct. We discuss quality issues present in WildAnn and evaluate whether it is a useful supplement to hand-annotated data. We then discuss the importance of creating annotations for lower-resourced languages in a thoughtful and ethical way that includes the language speakers as part of the development process. We conclude with recommended guidelines for resource development. |
Year | DOI | Venue |
---|---|---|
2022 | 10.18653/v1/2022.findings-acl.44 | FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022) |
DocType | Volume | Citations |
Conference | Findings of the Association for Computational Linguistics: ACL 2022 | 0 |
PageRank | References | Authors |
0.34 | 0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Constantine Lignos | 1 | 0 | 1.01 |
Nolan Holley | 2 | 0 | 0.34 |
Chester Palen-Michel | 3 | 0 | 0.68 |
Jonne Sälevä | 4 | 0 | 0.34 |