Neural detection of semantic code clones via tree-based convolution - Citegraph

Paper Info

Title
Neural detection of semantic code clones via tree-based convolution

Abstract
Code clones are similar code fragments that share the same semantics but may differ syntactically to various degrees. Detecting code clones helps reduce the cost of software maintenance and prevent faults. Various approaches of detecting code clones have been proposed over the last two decades, but few of them can detect semantic clones, i.e., code clones with dissimilar syntax. Recent research has attempted to adopt deep learning for detecting code clones, such as using tree-based LSTM over Abstract Syntax Tree (AST). However, it does not fully leverage the structural information of code fragments, thereby limiting its clone-detection capability. To fully unleash the power of deep learning for detecting code clones, we propose a new approach that uses tree-based convolution to detect semantic clones, by capturing both the structural information of a code fragment from its AST and lexical information from code tokens. Additionally, our approach addresses the limitation that source code has an unlimited vocabulary of tokens and models, and thus exploiting lexical information from code tokens is often ineffective when dealing with unseen tokens. Particularly, we propose a new embedding technique called position-aware character embedding (PACE), which essentially treats any token as a position-weighted combination of character one-hot embeddings. Our experimental results show that our approach substantially outperforms an existing state-of-the-art approach with an increase of 0.42 and 0.15 in F1-score on two popular code-clone benchmarks (OJClone and BigCloneBench), respectively, while being more computationally efficient. Our experimental results also show that PACE enables our approach to be substantially more effective when code clones contain unseen tokens.

Year	DOI	Venue
2019	10.1109/ICPC.2019.00021	Proceedings of the 27th International Conference on Program Comprehension
Keywords	Field	DocType
AST, clone detection, embedding, generalization, lexical information, semantic clone, source code, structural information, token, tree-based convolution	Data mining,Embedding,Source code,Computer science,Abstract syntax tree,Theoretical computer science,Artificial intelligence,Software maintenance,Deep learning,Code (cryptography),Security token,Semantics	Conference
ISSN	ISBN	Citations
2643-7147	978-1-7281-1520-7	6
PageRank	References	Authors
0.40	13	6

Authors (6 rows)

Cited by (6 rows)

References (13 rows)

Name	Order	Citations	PageRank
Hao Yu	1	6	0.40
Wing Lam	2	172	8.81
Long Chen	3	7	0.74
Ge Li	4	469	30.57
Tao Xie	5	5978	304.97
Qianxiang Wang	6	346	31.05

1