Title
The Serialization of Heterogeneous Documents.
Abstract
Tasks involving the analysis of natural language are typically conducted on a corpus or corpora of plain text. However, it is rare that a document is unstructured and freeform in its entirety. Documents such as corporate disclosures, medical journals and other knowledge rich archive contain structured and loosely-structured information that can be used in a variety of important text mining tasks. In this paper we propose a syntactical preprocessing architecture to serialize presentation- oriented documents to a machine readable format that aspires to preserve the document structure, contents and metadata. We introduce a hybrid pipeline architecture, discussing the various processes and the future research direction that could potentially lead to a holistic representation of heterogeneous documents.
Year
Venue
Field
2015
FedCSIS Position Papers
Data mining,Metadata,World Wide Web,Architecture,Serialization,Computer science,Document Structure Description,Machine-readable data,Plain text,Preprocessor,Natural language
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
6
3
Name
Order
Citations
PageRank
Peter J. Hampton100.68
William Blackburn294.92
hui wang37617.01