Title
Table Compression by Record Intersections
Abstract
Saturated patterns with don't care like those emerged in biosequence motif discovery have proven a valuable notion also in the design of lossless and lossy compression of sequence data. In independent endeavors, the peculiarities inherent to the compression of tables have been examined, leading to compression schemata advantageously hinged on a prudent rearrangement of columns. The present paper introduces off-line table compression by textual substitution in which the patterns used in compression are chosen among models or templates that capture recurrent record subfields. A model record is to be interpreted here as a sequence of intermixed solid and don't care characters that obeys, in addition, some conditions of saturation: most notably, it must be not possible to replace a don't care in the model by a solid character without having to forfeit some of its occurrences in the table. Saturation is expected to save on the size of the codebook at the outset, and hence to improve compression. It also induces some clustering of the records in the table, which may present independent interest. Results from preliminary experiments show the savings and potential for classification brought about by this method in connection with a table of specimens collected in a context of biodiversity studies.
Year
DOI
Venue
2008
10.1109/DCC.2008.105
DCC
Keywords
Field
DocType
lossy compression,intrecord,model record,present paper,biodiversity context,sequence data,data sequence,independent interest,off-line table compression,data compression,record intersections,codebook,solid character,codes,capture recurrent record subfields,pattern saturation,table compression,saturated record,biosequence motif discovery,independent endeavor,compression schemata,care character,sequences,solid modeling,feature extraction,biodiversity
Data mining,Biosequence,Lossy compression,Computer science,Algorithm,Feature extraction,Theoretical computer science,Solid modeling,Cluster analysis,Data compression,Codebook,Lossless compression
Conference
ISSN
ISBN
Citations 
1068-0314
978-0-7695-3121-2
6
PageRank 
References 
Authors
0.46
16
3
Name
Order
Citations
PageRank
Alberto Apostolico11441182.20
Fabio Cunial2729.68
Vineith Kaul360.46