Title
Sequence Similarity by Gapped LZW
Abstract
Measures of sequence similarity based on some underlying notion of relative compressibility are becoming increasingly of interest in connection with massive tasks of textfile classification such as, notably, in document classification and molecular taxonomy on a genomic scale. Sequences that are similar can be expected to share a large number of common substrings, whence some successful measures in this class have been based on the substring composition of the input sequences. Among the corresponding methods, one finds suitable extensions of the bag-of-words together with more explicit resorts to data compression techniques such as LZ77. The approach presented in this paper explores the potential of LZW - the variant of LZ78 proposed by Welch -- as well as of some of its lossy variants, in this context. Whereas LZW has a faster and simpler implementation than LZ77, the vocabulary underlying LZW is significantly smaller than that of LZ77. In addition, recently introduced "gapped" variants of LZW are considered that are equally straightforward to implement but allow for a controlled number of don't cares to be introduced in the substrings that constitute the dictionary used in compression. This study assesses the robustness of compressibility based measures of similarity under these faster yet inherently more dispersive paradigms built around LZW.
Year
DOI
Venue
2011
10.1109/DCC.2011.41
DCC
Keywords
Field
DocType
relative compressibility,common substrings,sequence similarity,data compression technique,data compression,textfile classification,document classification,encoding,large number,underlying notion,controlled number,gapped lzw,lz78,lz77,corresponding method,molecular taxonomy,bag-of-words,lempel-ziv-welchcompression,bag of words,length measurement,dictionaries
Bag-of-words model,Document classification,Substring,Lossy compression,Computer science,Theoretical computer science,Robustness (computer science),LZ77 and LZ78,Data compression,Encoding (memory)
Conference
ISSN
ISBN
Citations 
1068-0314
978-1-61284-279-0
2
PageRank 
References 
Authors
0.40
14
2
Name
Order
Citations
PageRank
Alberto Apostolico11441182.20
Fabio Cunial2729.68