Abstract | ||
---|---|---|
Tasks involving the analysis of natural language are typically conducted on a corpus or corpora of plain text. However, it is rare that a document is unstructured and freeform in its entirety. Documents such as corporate disclosures, medical journals and other knowledge rich archive contain structured and loosely-structured information that can be used in a variety of important text mining tasks. In this paper we propose a syntactical preprocessing architecture to serialize presentation- oriented documents to a machine readable format that aspires to preserve the document structure, contents and metadata. We introduce a hybrid pipeline architecture, discussing the various processes and the future research direction that could potentially lead to a holistic representation of heterogeneous documents. |
Year | Venue | Field |
---|---|---|
2015 | FedCSIS Position Papers | Data mining,Metadata,World Wide Web,Architecture,Serialization,Computer science,Document Structure Description,Machine-readable data,Plain text,Preprocessor,Natural language |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
6 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Peter J. Hampton | 1 | 0 | 0.68 |
William Blackburn | 2 | 9 | 4.92 |
hui wang | 3 | 76 | 17.01 |