On how often code is cloned across repositories - Citegraph

Paper Info

Title
On how often code is cloned across repositories

Abstract
Detecting code duplication in large code bases, or even across project boundaries, is problematic due to the massive amount of data involved. Large-scale clone detection also opens new challenges beyond asking for the provenance of a single clone fragment, such as assessing the prevalence of code clones on the entire code base, and their evolution. We propose a set of lightweight techniques that may scale up to very large amounts of source code in the presence of multiple versions. The common idea behind these techniques is to use bad hashing to get a quick answer. We report on a case study, the Squeaksource ecosystem, which features thousands of software projects, with more than 40 million versions of methods, across more than seven years of evolution. We provide estimates for the prevalence of type-1, type-2, and type-3 clones in Squeaksource.

Year	DOI	Venue
2012	10.1109/ICSE.2012.6227097	ICSE
Keywords	Field	DocType
single clone fragment,type-3 clone,large-scale clone detection,code clone,squeaksource ecosystem,source code,detecting code duplication,entire code base,large amount,large code base,software maintenance,project management,ecosystems,indexes,layout,cloning,code base	Duplicate code,Information retrieval,Source code,Computer science,Real-time computing,Software,ACROSS Project,Hash function,Software maintenance,Database,Project management	Conference
Volume	ISSN	ISBN
2	0270-5257	978-1-4673-1067-3
Citations	PageRank	References
21	0.83	17
Authors
3

Authors (3 rows)

Cited by (21 rows)

References (17 rows)

Name	Order	Citations	PageRank
Niko Schwarz	1	39	4.07
Mircea Lungu	2	545	39.17
Romain Robbes	3	1438	73.40

1