Abstract | ||
---|---|---|
In this paper, we focus on techniques to merge the parallelized data processing (i.e. map-reduce) capabilities of Apache Spark with the extensive file-type parsing support of Apache Tika. These two frameworks each have unique appeal for data scientists. Where Spark makes highly efficient the parallelized processing of very large, often text-based data sets, Tika makes consistent the information extraction of over 1,200 text and binary file types on a sequential file basis. The technical integration of these two frameworks is the subject of our investigation, and is relevant for data scientists pursuing two types of use cases: (1) analysis of numerous (1000x) un-partitioned small to medium sized Tika parse-able files, and (2) analysis of very large partition-able Tika parse-able files. Given Tika's niche specialization of file extraction and Spark's specialization of parallelized computing, there is a need to explore the benefits of integration. Thus, we investigate best practices so as to empower data scientists with tools to gain insight into a greater portion of data formats commonly in use. |
Year | DOI | Venue |
---|---|---|
2015 | 10.1109/IRI.2015.43 | Information Reuse and Integration |
Keywords | Field | DocType |
Spark, Tika, Hadoop, cluster computing, distributed computing | Data mining,Data processing,Data set,Spark (mathematics),Use case,Computer science,Information extraction,Parsing,Analytics,Computer cluster | Conference |
Citations | PageRank | References |
0 | 0.34 | 4 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Rishi Verma | 1 | 0 | 0.34 |
Chris A. Mattmann | 2 | 200 | 25.39 |