Multi-label ASRS Dataset Classification Using Semi Supervised Subspace Clustering
There has been a lot of research targeting text classification. Many of them focus on a particular characteristic of text data - multi-labelity. This arises due to the fact that a document may be associated with multiple classes at the same time. The consequence of such a characteristic is the low performance of traditional binary or multi-class classification techniques on multi-label text data. In this paper, we propose a text classification technique that considers this characteristic and provides very good performance. Our multi-label text classification approach is an extension of our previously formulated (3) multi-class text classification approach called SISC (Semi-supervised Impurity based Subspace Clustering). We call this new classification model as SISC-ML(SISC Multi-Label). Empirical evaluation on real world multi-label NASA ASRS (Aviation Safety Reporting System) data set reveals that our approach outperforms state-of-the- art text classification as well as subspace clustering algorithms. Based on the number of labels that can be associated with a document, text data sets can be divided into three broad categories. These three types of data sets are binary, multi-class and multi- label data sets. In case of binary data sets, a data point or document may belong to either of two possible class labels. In case of multi-class data sets, however, more than two class labels are involved and just like binary data, each data point can be associated with only a single class label. Finally, in case of multi-label data sets, there are more than two class labels involved and each data point may belong to more than one class label at the same time. The NASA ASRS (Aviation Safety Reporting System) data set is a multi-label text data set. It consists of aviation safety reports that the flight crews submit after completion of each flight. Each such report describes the events that took place during a flight. Since ASRS is a multi-label data set, each report may belong to multiple class labels. Our objective is to propose a classification model that can successfully associate class labels to each report in the ASRS data set. There are a number of challenges associated with the ASRS data set. First of all, these reports are written in plain English language. The characters are all uppercase letters. Also there are usually quite a few technical terms and jargons present in each of the reports. So, it is hard to distinguish between acronyms and normal words. The usual challenges of classifying text data are also present in this data set. These include very high and sparse dimensionality. This high and sparse dimensionality happens as the dimension or feature space consists of all the distinct words appearing in all the reports. Such a report (with key parts boldfaced) is provided next, as an example.
multi class classification,feature space,english language
Data mining,Feature vector,Data set,Aviation Safety Reporting System,Computer science,Curse of dimensionality,Data type,Binary data,Plain English,Binary number
Mohammad Salim Ahmed1171.98
Latifur Khan22323178.68
nikunj c oza369454.32
Mandava Rajeswari48610.54