Compilation of specialized comparable corpora in French and Japanese - Citegraph

Paper Info

Title
Compilation of specialized comparable corpora in French and Japanese

Abstract
We present in this paper the development of a specialized comparable corpora compilation tool, for which quality would be close to a manually compiled corpus. The comparability is based on three levels: domain, topic and type of discourse. Domain and topic can be filtered with the keywords used through web search. But the detection of the type of discourse needs a wide linguistic analysis. The first step of our work is to automate the detection of the type of discourse that can be found in a scientific domain (science and popular science) in French and Japanese languages. First, a contrastive stylistic analysis of the two types of discourse is done on both languages. This analysis leads to the creation of a reusable, generic and robust typology. Machine learning algorithms are then applied to the typology, using shallow parsing. We obtain good results, with an average precision of 80% and an average recall of 70% that demonstrate the efficiency of this typology. This classification tool is then inserted in a corpus compilation tool which is a text collection treatment chain realized through IBM UIMA system. Starting from two specialized web documents collection in French and Japanese, this tool creates the corresponding corpus.

Year	Venue	Keywords
2011	BUCC@ACL/IJCNLP	average recall,classification tool,corpus compilation tool,wide linguistic analysis,japanese language,robust typology,corresponding corpus,contrastive stylistic analysis,scientific domain,average precision,specialized comparable corpus,machine learning
Field	DocType	Citations
Shallow parsing,IBM,Information retrieval,Computer science,Typology,Natural language processing,Artificial intelligence,Comparability,Recall,Linguistic analysis	Conference	2
PageRank	References	Authors
0.40	12	3

Authors (3 rows)

Cited by (2 rows)

References (12 rows)

Name	Order	Citations	PageRank
Lorraine Goeuriot	1	266	30.94
Emmanuel Morin	2	2	0.40
Béatrice Daille	3	306	34.40

1