Abstract | ||
---|---|---|
Empirical software engineering researchers are concerned with understanding the relationships between outcomes of interest, e.g. defects, and process and product measures. The use of correlations to uncover strong relationships is a natural precursor to multivariate modeling. Unfortunately, correlation coefficients can be difficult and/or misleading to interpret. For example, a strong correlation occurs between variables that stand in a polynomial relationship; this may lead one mistakenly, and eventually misleadingly, to model a polynomially related variable in a linear regression. Likewise, a non-monotonic functional, or even non-functional relationship might be entirely missed by a correlation coefficient. Outliers can influence standard correlation measures, tied values can unduly influence even robust non-parametric rank correlation, measures, and smaller sample sizes can cause instability in correlation measures. A new bivariate measure of association, Maximal Information Coefficient (MIC) [1], promises to simultaneously discover if two variables have: a) any association, b) a functional relationship, and c) a non-linear relationship. The MIC is a very useful complement to standard and rank correlation measures. It separately characterizes the existence of a relationship and its precise nature; thus, it enables more informed choices in modeling non-functional and non-linear relationships, and a more nuanced indicator of potential problems with the values reported by standard and rank correlation measures. We illustrate the use of MIC using a variety of software engineering metrics. We study and explain the distributional properties of MIC and related measures in software engineering data, and illustrate the value of these measures for the empirical software engineering researcher.
|
Year | DOI | Venue |
---|---|---|
2012 | 10.1109/MSR.2012.6224295 | MSR |
Keywords | Field | DocType |
empirical software engineering,software measurement,software metrics,linear regression,correlation,rank correlation,software engineering,sample size,monotone function,dataset | Rank correlation,Data mining,Correlation coefficient,Computer science,Correlation,Software metric,Maximal information coefficient,Bivariate analysis,Statistics,Sample size determination,Linear regression | Conference |
ISBN | Citations | PageRank |
978-1-4673-1761-0 | 3 | 0.37 |
References | Authors | |
5 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Daryl Posnett | 1 | 578 | 19.11 |
Premkumar Devanbu | 2 | 4956 | 357.68 |
Vladimir Filkov | 3 | 1503 | 75.32 |