Title
Efficient document analytics on compressed data: method, challenges, algorithms, insights
Abstract
AbstractToday's rapidly growing document volumes pose pressing challenges to modern document analytics, in both space usage and processing time. In this work, we propose the concept of compression-based direct processing to alleviate issues in both dimensions. The main idea is to enable direct document analytics on compressed data. We present how the concept can be materialized on Sequitur, a compression algorithm that produces hierarchical grammar-like representations. We discuss the major challenges in applying the idea to various document analytics tasks, and reveal a set of guidelines and also assistant software modules for developers to effectively apply compression-based direct processing. Experiments show that our proposed techniques save 90.8% storage space and 77.5% memory usage, while speeding up data processing significantly, i.e., by 1.6X on sequential systems, and 2.2X on distributed clusters, on average.
Year
DOI
Venue
2018
10.14778/3236187.3236203
Hosted Content
Field
DocType
Volume
Data science,Data mining,Computer science,Analytics
Journal
11
Issue
ISSN
Citations 
11
2150-8097
3
PageRank 
References 
Authors
0.39
0
5
Name
Order
Citations
PageRank
Feng Zhang17914.36
Jidong Zhai234036.27
Xipeng Shen32025118.55
Onur Mutlu49446357.40
Wenguang Chen5101470.57