Title
Neural detection of semantic code clones via tree-based convolution
Abstract
Code clones are similar code fragments that share the same semantics but may differ syntactically to various degrees. Detecting code clones helps reduce the cost of software maintenance and prevent faults. Various approaches of detecting code clones have been proposed over the last two decades, but few of them can detect semantic clones, i.e., code clones with dissimilar syntax. Recent research has attempted to adopt deep learning for detecting code clones, such as using tree-based LSTM over Abstract Syntax Tree (AST). However, it does not fully leverage the structural information of code fragments, thereby limiting its clone-detection capability. To fully unleash the power of deep learning for detecting code clones, we propose a new approach that uses tree-based convolution to detect semantic clones, by capturing both the structural information of a code fragment from its AST and lexical information from code tokens. Additionally, our approach addresses the limitation that source code has an unlimited vocabulary of tokens and models, and thus exploiting lexical information from code tokens is often ineffective when dealing with unseen tokens. Particularly, we propose a new embedding technique called position-aware character embedding (PACE), which essentially treats any token as a position-weighted combination of character one-hot embeddings. Our experimental results show that our approach substantially outperforms an existing state-of-the-art approach with an increase of 0.42 and 0.15 in F1-score on two popular code-clone benchmarks (OJClone and BigCloneBench), respectively, while being more computationally efficient. Our experimental results also show that PACE enables our approach to be substantially more effective when code clones contain unseen tokens.
Year
DOI
Venue
2019
10.1109/ICPC.2019.00021
Proceedings of the 27th International Conference on Program Comprehension
Keywords
Field
DocType
AST, clone detection, embedding, generalization, lexical information, semantic clone, source code, structural information, token, tree-based convolution
Data mining,Embedding,Source code,Computer science,Abstract syntax tree,Theoretical computer science,Artificial intelligence,Software maintenance,Deep learning,Code (cryptography),Security token,Semantics
Conference
ISSN
ISBN
Citations 
2643-7147
978-1-7281-1520-7
6
PageRank 
References 
Authors
0.40
13
6
Name
Order
Citations
PageRank
Hao Yu160.40
Wing Lam21728.81
Long Chen370.74
Ge Li446930.57
Tao Xie55978304.97
Qianxiang Wang634631.05