Title
Limits of Detecting Text Generated by Large-Scale Language Models
Abstract
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns. Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated. We show that error exponents for particular language models are bounded in terms of their perplexity, a standard measure of language generation performance. Under the assumption that human language is stationary and ergodic, the formulation is ex-tended from considering specific language models to considering maximum likelihood language models, among the class of k-order Markov approximations; error probabilities are characterized. Some discussion of incorporating semantic side information is also given.
Year
DOI
Venue
2020
10.1109/ITA50056.2020.9245012
2020 Information Theory and Applications Workshop (ITA)
Keywords
DocType
ISSN
large-scale language model output detection,language generation performance,human language,maximum likelihood language models,text detection,k-order Markov approximations,error probabilities,semantic side information
Conference
2641-8150
ISBN
Citations 
PageRank 
978-1-7281-8825-6
0
0.34
References 
Authors
12
3
Name
Order
Citations
PageRank
Varshney Lav R.100.34
nitish shirish keskar232516.71
Richard Socher36770230.61