Abstract | ||
---|---|---|
In this work, we present DUSTER, a new approach to detect and eliminate redundant content when crawling the web. DUSTER takes advantage of a multi-sequence alignment strategy to learn rewriting rules able to transform URLs to other likely to have similar content, when it is the case. We show the alignment strategy that can lead to a reduction in the number of duplicate URLs 54% larger than the one achieved by our best baseline. |
Year | DOI | Venue |
---|---|---|
2013 | 10.1007/978-3-319-02432-5_23 | SPIRE |
Field | DocType | Volume |
Regular expression,Crawling,Information retrieval,Computer science,URL normalization,Rewriting,Multiple sequence alignment | Conference | 8214 |
ISSN | Citations | PageRank |
0302-9743 | 1 | 0.37 |
References | Authors | |
11 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Kaio Wagner Lima Rodrigues | 1 | 1 | 0.37 |
Marco Cristo | 2 | 618 | 39.30 |
Edleno Silva de Moura | 3 | 988 | 75.44 |
Altigran Soares da Silva | 4 | 718 | 65.15 |