Title
Data cleansing and preparation for moving toward electronic library repository
Abstract
Manually annotated metadata usually contains errors from mistyping; however, correcting those metadata manually could be costly and time consuming. This paper proposed a framework to ease metadata correction processed by proposing a system that utilizes OCR and NLP techniques to automatically extract metadata from document image. The system firstly converts images into text using OCR and then extracts metadata from OCR results. After that, the extracted metadata are compared with the data in existing repository to locate error entries. The error entries are then displayed to users whom will correct them using supporting information. Although human decision is required to correct the error manually, this step is necessary with only error entries. The experimental results with 3,712 thesis abstracts show that the proposed solution can automatically extract the relevance information with 91.41% accuracy.
Year
DOI
Venue
2005
10.1007/11599517_69
ICADL
Keywords
Field
DocType
electronic library repository,system firstly,proposed solution,ocr result,error entry,metadata correction,supporting information,extracts metadata,utilizes ocr,relevance information,annotated metadata,data cleansing
Data warehouse,Metadata,Metadata repository,Data mining,Data cleansing,Information retrieval,Character recognition,Computer science,Data element,Optical character recognition,Error detection and correction
Conference
Volume
ISSN
ISBN
3815
0302-9743
3-540-30850-4
Citations 
PageRank 
References 
0
0.34
1
Authors
1
Name
Order
Citations
PageRank
asanee kawtrakul116125.90