The Serialization of Heterogeneous Documents. - Citegraph

Paper Info

Title
The Serialization of Heterogeneous Documents.

Abstract
Tasks involving the analysis of natural language are typically conducted on a corpus or corpora of plain text. However, it is rare that a document is unstructured and freeform in its entirety. Documents such as corporate disclosures, medical journals and other knowledge rich archive contain structured and loosely-structured information that can be used in a variety of important text mining tasks. In this paper we propose a syntactical preprocessing architecture to serialize presentation- oriented documents to a machine readable format that aspires to preserve the document structure, contents and metadata. We introduce a hybrid pipeline architecture, discussing the various processes and the future research direction that could potentially lead to a holistic representation of heterogeneous documents.

Year	Venue	Field
2015	FedCSIS Position Papers	Data mining,Metadata,World Wide Web,Architecture,Serialization,Computer science,Document Structure Description,Machine-readable data,Plain text,Preprocessor,Natural language
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
6	3

Authors (3 rows)

Cited by (0 rows)

References (6 rows)

Name	Order	Citations	PageRank
Peter J. Hampton	1	0	0.68
William Blackburn	2	9	4.92
hui wang	3	76	17.01

1