Extending Spark Analytics through Tika-Based Information Extraction and Retrieval - Citegraph

Paper Info

Title
Extending Spark Analytics through Tika-Based Information Extraction and Retrieval

Abstract
In this paper, we focus on techniques to merge the parallelized data processing (i.e. map-reduce) capabilities of Apache Spark with the extensive file-type parsing support of Apache Tika. These two frameworks each have unique appeal for data scientists. Where Spark makes highly efficient the parallelized processing of very large, often text-based data sets, Tika makes consistent the information extraction of over 1,200 text and binary file types on a sequential file basis. The technical integration of these two frameworks is the subject of our investigation, and is relevant for data scientists pursuing two types of use cases: (1) analysis of numerous (1000x) un-partitioned small to medium sized Tika parse-able files, and (2) analysis of very large partition-able Tika parse-able files. Given Tika's niche specialization of file extraction and Spark's specialization of parallelized computing, there is a need to explore the benefits of integration. Thus, we investigate best practices so as to empower data scientists with tools to gain insight into a greater portion of data formats commonly in use.

Year	DOI	Venue
2015	10.1109/IRI.2015.43	Information Reuse and Integration
Keywords	Field	DocType
Spark, Tika, Hadoop, cluster computing, distributed computing	Data mining,Data processing,Data set,Spark (mathematics),Use case,Computer science,Information extraction,Parsing,Analytics,Computer cluster	Conference
Citations	PageRank	References
0	0.34	4
Authors
2

Authors (2 rows)

Cited by (0 rows)

References (4 rows)

Name	Order	Citations	PageRank
Rishi Verma	1	0	0.34
Chris A. Mattmann	2	200	25.39

1