Title
Compilation of specialized comparable corpora in French and Japanese
Abstract
We present in this paper the development of a specialized comparable corpora compilation tool, for which quality would be close to a manually compiled corpus. The comparability is based on three levels: domain, topic and type of discourse. Domain and topic can be filtered with the keywords used through web search. But the detection of the type of discourse needs a wide linguistic analysis. The first step of our work is to automate the detection of the type of discourse that can be found in a scientific domain (science and popular science) in French and Japanese languages. First, a contrastive stylistic analysis of the two types of discourse is done on both languages. This analysis leads to the creation of a reusable, generic and robust typology. Machine learning algorithms are then applied to the typology, using shallow parsing. We obtain good results, with an average precision of 80% and an average recall of 70% that demonstrate the efficiency of this typology. This classification tool is then inserted in a corpus compilation tool which is a text collection treatment chain realized through IBM UIMA system. Starting from two specialized web documents collection in French and Japanese, this tool creates the corresponding corpus.
Year
Venue
Keywords
2011
BUCC@ACL/IJCNLP
average recall,classification tool,corpus compilation tool,wide linguistic analysis,japanese language,robust typology,corresponding corpus,contrastive stylistic analysis,scientific domain,average precision,specialized comparable corpus,machine learning
Field
DocType
Citations 
Shallow parsing,IBM,Information retrieval,Computer science,Typology,Natural language processing,Artificial intelligence,Comparability,Recall,Linguistic analysis
Conference
2
PageRank 
References 
Authors
0.40
12
3
Name
Order
Citations
PageRank
Lorraine Goeuriot126630.94
Emmanuel Morin220.40
Béatrice Daille330634.40