Abstract | ||
---|---|---|
State-of-the-art in duplicate detection in semi-structured data obtains significant improvement by exploiting the schema-related knowledge. Such schema-bound duplicate detection approaches, however, have severe limitations when dealing with multi-sourced, heterogeneous, high-velocity data streams. In this paper, we propose a novel context-aware duplicate detection system which is workload- and complexity-aware, and is adaptable to the underlying computing platform. The system operates in schema-oblivious manner, and relies upon information theory based heuristic and data shaping technique for efficient, and scalable duplicate detection in multi-sourced, heterogeneous data sets. Experiments with real-world data sets show speed up of up to 8X over state of-the-art schemes, while maintaining upto 92 percent accuracy. In addition, our data shaping technique for GPGPU processing speeds up the duplicate detection throughput by up to two orders of magnitude. |
Year | DOI | Venue |
---|---|---|
2014 | 10.1109/SERVICES.2014.46 | Services |
Keywords | DocType | ISSN |
graphics processing units,ubiquitous computing,GPU processing,context-aware duplicate detection system,data shaping,heterogeneous data sets,high velocity data streams,information theory based heuristic,scalable duplicate detection,schema-bound duplicate detection,schema-related knowledge,semistructured data streams,GPUs,data shaping,data streams,duplicate detection,novel architectures,semi-structured data | Conference | 2378-3818 |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Parijat Shukla | 1 | 0 | 0.34 |
Arun K. Somani | 2 | 20 | 5.28 |