Title
A framework for semisupervised feature generation and its applications in biomedical literature mining.
Abstract
Feature representation is essential to machine learning and text mining. In this paper, we present a feature coupling generalization (FCG) framework for generating new features from unlabeled data. It selects two special types of features, i.e., example-distinguishing features (EDFs) and class-distinguishing features (CDFs) from original feature set, and then generalizes EDFs into higher-level features based on their coupling degrees with CDFs in unlabeled data. The advantage is: EDFs with extreme sparsity in labeled data can be enriched by their co-occurrences with CDFs in unlabeled data so that the performance of these low-frequency features can be greatly boosted and new information from unlabeled can be incorporated. We apply this approach to three tasks in biomedical literature mining: gene named entity recognition (NER), protein-protein interaction extraction (PPIE), and text classification (TC) for gene ontology (GO) annotation. New features are generated from over 20 GB unlabeled PubMed abstracts. The experimental results on BioCreative 2, AIMED corpus, and TREC 2005 Genomics Track show that 1) FCG can utilize well the sparse features ignored by supervised learning. 2) It improves the performance of supervised baselines by 7.8 percent, 5.0 percent, and 5.8 percent, respectively, in the tree tasks. 3) Our methods achieve 89.1, 64.5 F-score, and 60.1 normalized utility on the three benchmark data sets.
Year
DOI
Venue
2011
10.1109/TCBB.2010.99
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Keywords
DocType
Volume
protein-protein interaction extraction,feature representation,supervised learning,semisupervised feature generation,feature coupling generalization,semisupervised learning,named entity recognition,class-distinguishing features,class-distinguishing feature,learning (artificial intelligence),genetics,example-distinguishing features,example-distinguishing feature,biomedical literature mining,text classification.,unlabeled data,generalizes edfs,proteins,text classification,biocreative 2,gb unlabeled pubmed abstract,benchmark data set,biology computing,aimed corpus,gene ontology,low-frequency feature,trec 2005 genomics track,data mining,new feature,gene named entity recognition,computational biology,text mining,feature extraction,couplings,bioinformatics,machine learning,protein engineering,protein protein interaction,low frequency,learning artificial intelligence
Journal
8
Issue
ISSN
Citations 
2
1545-5963
4
PageRank 
References 
Authors
0.43
23
4
Name
Order
Citations
PageRank
Yanpeng Li1492.60
Xiaohua Hu22819314.15
Hongfei Lin3768122.52
Zhihao Yang427036.04