Title
A large portuguese corpus on-line: cleaning and preprocessing
Abstract
We present a newly available on-line resource for Portuguese, a corpus of 310 million words, a new version of the Reference Corpus of Contemporary Portuguese, now searchable via a user-friendly web interface. Here we report on work carried out on the corpus previous to its publication on-line. We focus on the processes and tools involved for the cleaning, preparation and annotation to make the corpus suitable for linguistic inquiries.
Year
DOI
Venue
2012
10.1007/978-3-642-28885-2_13
PROPOR
Keywords
Field
DocType
million word,contemporary portuguese,available on-line resource,user-friendly web interface,new version,reference corpus,linguistic inquiry,large portuguese corpus
Annotation,Computer science,Portuguese,Text corpus,Preprocessor,Artificial intelligence,Corpus linguistics,Natural language processing,User interface
Conference
Citations 
PageRank 
References 
2
0.47
9
Authors
3
Name
Order
Citations
PageRank
Michel Généreux1293.95
Iris Hendrickx228530.91
Amália Mendes3198.15