Title
Extending Spark Analytics through Tika-Based Information Extraction and Retrieval
Abstract
In this paper, we focus on techniques to merge the parallelized data processing (i.e. map-reduce) capabilities of Apache Spark with the extensive file-type parsing support of Apache Tika. These two frameworks each have unique appeal for data scientists. Where Spark makes highly efficient the parallelized processing of very large, often text-based data sets, Tika makes consistent the information extraction of over 1,200 text and binary file types on a sequential file basis. The technical integration of these two frameworks is the subject of our investigation, and is relevant for data scientists pursuing two types of use cases: (1) analysis of numerous (1000x) un-partitioned small to medium sized Tika parse-able files, and (2) analysis of very large partition-able Tika parse-able files. Given Tika's niche specialization of file extraction and Spark's specialization of parallelized computing, there is a need to explore the benefits of integration. Thus, we investigate best practices so as to empower data scientists with tools to gain insight into a greater portion of data formats commonly in use.
Year
DOI
Venue
2015
10.1109/IRI.2015.43
Information Reuse and Integration
Keywords
Field
DocType
Spark, Tika, Hadoop, cluster computing, distributed computing
Data mining,Data processing,Data set,Spark (mathematics),Use case,Computer science,Information extraction,Parsing,Analytics,Computer cluster
Conference
Citations 
PageRank 
References 
0
0.34
4
Authors
2
Name
Order
Citations
PageRank
Rishi Verma100.34
Chris A. Mattmann220025.39