Title
Learning layouts of biological datasets semi-automatically
Abstract
A key challenge associated with the existing approaches for data integration and workflow creation for bioinformatics is the effort required to integrate a new data source. As new data sources emerge, and data formats and contents of existing data sources evolve, wrapper programs need to be written or modified. This can be extremely time consuming, tedious, and error-prone. This paper describes our semi-automatic approach for learning the layout of a flat-file bioinformatics dataset. Our approach involves three key steps. The first step is to use a number of heuristics to infer the delimiters used in the program. Specifically, we have developed a metric that uses information on the frequency and starting position of sequences. Based on this metric, we are able to find a superset of delimiters, and then we can seek user input to eliminate the incorrect ones. Our second step involves generating a layout descriptor based on the relative order in which the delimiters occur. Our final step is to generate a parser based on the layout descriptor. Our heuristics for finding the delimiters has been evaluated using three popular flat-file biological datasets.
Year
DOI
Venue
2005
10.1007/11530084_5
DILS
Keywords
Field
DocType
final step,key challenge,flat-file bioinformatics dataset,layout descriptor,key step,existing approach,data integration,biological datasets semi-automatically,new data source,data format,data source,data integrity
Data integration,Data source,Data mining,Subset and superset,Topological sorting,Computer science,Heuristics,Parsing,Workflow,Delimiter,Database
Conference
Volume
ISSN
ISBN
3615
0302-9743
3-540-27967-9
Citations 
PageRank 
References 
3
0.39
22
Authors
4
Name
Order
Citations
PageRank
Kaushik Sinha124417.81
Xuan Zhang211018.58
Ruoming Jin3163791.73
Gagan Agrawal42058209.59