Title
Surface Statistics of an Unknown Language Indicate How to Parse It.
Abstract
We introduce a novel framework for delexicalized dependency parsing in a new language. We show that useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training achieves further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous worku0027s interpretable typological features that require parsed corpora or expert categorization of the language. Our best method improved attachment scores on held-out test languages by an average of 5.6 percentage points over past work that does not inspect the unparsed data (McDonald et al., 2011), and by 20.65 points over past “grammar induction” work that does not use training languages (Naseem et al., 2010).
Year
DOI
Venue
2018
10.1162/tacl_a_00248
TACL
Field
DocType
Volume
Computer science,Dependency grammar,Natural language processing,Artificial intelligence,Constructed language,Parsing
Journal
6
Citations 
PageRank 
References 
2
0.36
1
Authors
2
Name
Order
Citations
PageRank
Dingquan Wang1112.51
Jason Eisner21825173.00