Title
Batch-mode active learning for technology-assisted review
Abstract
In recent years, technology-assisted review (TAR) has become an increasingly important component of the document review process in litigation discovery. This is fueled largely by dramatic growth in data volumes that may be associated with many matters and investigations. Potential review populations frequently exceed several hundred thousands documents, and document counts in the millions are not uncommon. Budgetary and/or time constraints often make a once traditional linear review of these populations impractical, if not impossible ¿ which made \"predictive coding\" the most discussed TAR approach in recent years. A key challenge in any predictive coding approach is striking the appropriate balance in training the system. The goal is to minimize the time that Subject Matter Experts spend in training the system, while making sure that they perform enough training to achieve acceptable classification performance over the entire review population. Recent research demonstrates that Support Vector Machines (SVM) perform very well in finding a compact, yet effective, training dataset in an iterative fashion using batch-mode active learning. However, this research is limited. Additionally, these efforts have not led to a principled approach for determining the stabilization of the active learning process. In this paper, we propose and compare several batch-mode active learning methods which are integrated within SVM learning algorithm. We also propose methods for determining the stabilization of the active learning method. Experimental results on a set of large-scale, real-life legal document collections validate the superiority of our method over the existing methods for this task.
Year
DOI
Venue
2015
10.1109/BigData.2015.7363867
Big Data
Field
DocType
Citations 
Population,Data mining,Active learning,Active learning (machine learning),Computer science,Subject-matter expert,Support vector machine,Predictive coding,Artificial intelligence,Batch processing,Machine learning
Conference
1
PageRank 
References 
Authors
0.35
28
5
Name
Order
Citations
PageRank
Tanay Kumar Saha1445.07
Mohammad Al Hasan242735.08
Chandler Burgess310.35
Md. Ahsan Habib4334.49
Jeff Johnson5214.67