Annotation Tools For Syntax And Named Entities In The National Corpus Of Polish - Citegraph

Paper Info

Title
Annotation Tools For Syntax And Named Entities In The National Corpus Of Polish

Abstract
The ongoing National Corpus of Polish project assumes several levels of linguistic annotation. We present the technical environment and methodological background developed for the three upper annotation levels: the levels of syntactic words, syntactic groups and named entities. We show how knowledge-based platforms Spejd and Sprout are used for the automatic pre-annotation of the corpus and discuss some particular problems faced during the preparation of the parser grammar, which contains over 1,000 rules and is one of the largest chunking grammars for Polish. We also show how the tree editor TrEd has been customised for manual post-editing of annotations and for further revision of discrepancies. Our XML format converters and customised archiving repository ensure an automatic data flow and efficient corpus file management. We discuss the inter-annotator agreement in the manually annotated data, and present the first results of a CRF classifier trained on these data.

Year	DOI	Venue
2013	10.1504/IJDMMM.2013.053691	INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT
Keywords	Field	DocType
corpus annotation, National Corpus of Polish, shallow parsing, chunking, named entity recognition, NER	Shallow parsing,Rule-based machine translation,Annotation,XML,Computer science,Natural language processing,Chunking (psychology),Artificial intelligence,Parsing,Syntax,Named-entity recognition	Journal
Volume	Issue	ISSN
5	2	1759-1163
Citations	PageRank	References
2	0.43	14
Authors
5

Authors (5 rows)

Cited by (2 rows)

References (14 rows)

Name	Order	Citations	PageRank
Jakub Waszczuk	1	29	6.17
Katarzyna Glowinska	2	26	4.08
Agata Savary	3	92	19.55
Adam Przepiórkowski	4	179	30.37
Michal Lenart	5	8	1.64

1