Abstract | ||
---|---|---|
We present in this paper the development of a specialized comparable corpora compilation tool, for which quality would be close to a manually compiled corpus. The comparability is based on three levels: domain, topic and type of discourse. Domain and topic can be filtered with the keywords used through web search. But the detection of the type of discourse needs a wide linguistic analysis. The first step of our work is to automate the detection of the type of discourse that can be found in a scientific domain (science and popular science) in French and Japanese languages. First, a contrastive stylistic analysis of the two types of discourse is done on both languages. This analysis leads to the creation of a reusable, generic and robust typology. Machine learning algorithms are then applied to the typology, using shallow parsing. We obtain good results, with an average precision of 80% and an average recall of 70% that demonstrate the efficiency of this typology. This classification tool is then inserted in a corpus compilation tool which is a text collection treatment chain realized through IBM UIMA system. Starting from two specialized web documents collection in French and Japanese, this tool creates the corresponding corpus. |
Year | Venue | Keywords |
---|---|---|
2011 | BUCC@ACL/IJCNLP | average recall,classification tool,corpus compilation tool,wide linguistic analysis,japanese language,robust typology,corresponding corpus,contrastive stylistic analysis,scientific domain,average precision,specialized comparable corpus,machine learning |
Field | DocType | Citations |
Shallow parsing,IBM,Information retrieval,Computer science,Typology,Natural language processing,Artificial intelligence,Comparability,Recall,Linguistic analysis | Conference | 2 |
PageRank | References | Authors |
0.40 | 12 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Lorraine Goeuriot | 1 | 266 | 30.94 |
Emmanuel Morin | 2 | 2 | 0.40 |
Béatrice Daille | 3 | 306 | 34.40 |