Learning layouts of biological datasets semi-automatically - Citegraph

Paper Info

Title
Learning layouts of biological datasets semi-automatically

Abstract
A key challenge associated with the existing approaches for data integration and workflow creation for bioinformatics is the effort required to integrate a new data source. As new data sources emerge, and data formats and contents of existing data sources evolve, wrapper programs need to be written or modified. This can be extremely time consuming, tedious, and error-prone. This paper describes our semi-automatic approach for learning the layout of a flat-file bioinformatics dataset. Our approach involves three key steps. The first step is to use a number of heuristics to infer the delimiters used in the program. Specifically, we have developed a metric that uses information on the frequency and starting position of sequences. Based on this metric, we are able to find a superset of delimiters, and then we can seek user input to eliminate the incorrect ones. Our second step involves generating a layout descriptor based on the relative order in which the delimiters occur. Our final step is to generate a parser based on the layout descriptor. Our heuristics for finding the delimiters has been evaluated using three popular flat-file biological datasets.

Year	DOI	Venue
2005	10.1007/11530084_5	DILS
Keywords	Field	DocType
final step,key challenge,flat-file bioinformatics dataset,layout descriptor,key step,existing approach,data integration,biological datasets semi-automatically,new data source,data format,data source,data integrity	Data integration,Data source,Data mining,Subset and superset,Topological sorting,Computer science,Heuristics,Parsing,Workflow,Delimiter,Database	Conference
Volume	ISSN	ISBN
3615	0302-9743	3-540-27967-9
Citations	PageRank	References
3	0.39	22
Authors
4

Authors (4 rows)

Cited by (3 rows)

References (22 rows)

Name	Order	Citations	PageRank
Kaushik Sinha	1	244	17.81
Xuan Zhang	2	110	18.58
Ruoming Jin	3	1637	91.73
Gagan Agrawal	4	2058	209.59

1