Title
A Probabilistic Approach to Source Code Authorship Identification
Abstract
There exists a need for tools to help identify the authorship of source code. This includes situations in which the ownership of code is questionable, such as in plagiarism or intellectual property infringement disputes. Authorship identification can also be used to assist in the apprehension of the creators of malware. In this paper we present an approach to identifying the authors of source code. We begin by computing a set of metrics to build profiles for a population of known authors using code samples that are verified to be authentic. We then compute metrics on unidentified source code to determine the closest matching profile. We demonstrate our approach on a case study that involves two kinds of software: one based on open source developers working on various projects, and another based on students working on assignments with the same requirements. In our case study we are able to determine authorship with greater than 70% accuracy in choosing the single nearest match and greater than 90% accuracy in choosing the top three ordered nearest matches.
Year
DOI
Venue
2007
10.1109/ITNG.2007.17
ITNG
Keywords
Field
DocType
source code authorship identification,intellectual property infringement dispute,single nearest match,source code,probabilistic approach,authorship identification,unidentified source code,open source developer,code sample,nearest match,case study,closest matching profile,databases,pattern matching,malware,authorisation,filtering,law,intellectual property,computer viruses,computer science
Data mining,Population,Existential quantification,Source code,Computer science,Computer virus,Software,Probabilistic logic,Malware,Pattern matching
Conference
ISBN
Citations 
PageRank 
0-7695-2776-0
9
0.64
References 
Authors
2
4
Name
Order
Citations
PageRank
Jay Kothari1442.82
Maxim Shevertalov2403.56
Edward Stehle3242.03
Spiros Mancoridis488856.82