Use this URL to cite or link to this record in EThOS:
Title: Code similarity and clone search in large-scale source code data
Author: Ragkhitwetsagul, Chaiyong
ISNI:       0000 0004 7660 5591
Awarding Body: UCL (University College London)
Current Institution: University College London (University of London)
Date of Award: 2018
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
Software development is tremendously benefited from the Internet by having online code corpora that enable instant sharing of source code and online developer's guides and documentation. Nowadays, duplicated code (i.e., code clones) not only exists within or across software projects but also between online code repositories and websites. We call them "online code clones."' They can lead to license violations, bug propagation, and re-use of outdated code similar to classic code clones between software systems. Unfortunately, they are difficult to locate and fix since the search space in online code corpora is large and no longer confined to a local repository. This thesis presents a combined study of code similarity and online code clones. We empirically show that many code snippets on Stack Overflow are cloned from open source projects. Several of them become outdated or violate their original license and are possibly harmful to reuse. To develop a solution for finding online code clones, we study various code similarity techniques to gain insights into their strengths and weaknesses. A framework, called OCD, for evaluating code similarity and clone search tools is introduced and used to compare 34 state-of-the-art techniques on pervasively modified code and boiler-plate code. We also found that clone detection techniques can be enhanced by compilation and decompilation. Using the knowledge from the comparison of code similarity analysers, we create and evaluate Siamese, a scalable token-based clone search technique via multiple code representations. Our evaluation shows that Siamese scales to large-scale source code data of 365 million lines of code and offers high search precision and recall. Its clone search precision is comparable to seven state-of-the-art clone detection tools on the OCD framework. Finally, we demonstrate the usefulness of Siamese by applying the tool to find online code clones, automatically analyse clone licenses, and recommend tests for reuse.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available