Title
Learning URL Normalization Rules Using Multiple Alignment of Sequences
Abstract
In this work, we present DUSTER, a new approach to detect and eliminate redundant content when crawling the web. DUSTER takes advantage of a multi-sequence alignment strategy to learn rewriting rules able to transform URLs to other likely to have similar content, when it is the case. We show the alignment strategy that can lead to a reduction in the number of duplicate URLs 54% larger than the one achieved by our best baseline.
Year
DOI
Venue
2013
10.1007/978-3-319-02432-5_23
SPIRE
Field
DocType
Volume
Regular expression,Crawling,Information retrieval,Computer science,URL normalization,Rewriting,Multiple sequence alignment
Conference
8214
ISSN
Citations 
PageRank 
0302-9743
1
0.37
References 
Authors
11
4
Name
Order
Citations
PageRank
Kaio Wagner Lima Rodrigues110.37
Marco Cristo261839.30
Edleno Silva de Moura398875.44
Altigran Soares da Silva471865.15