Article

Review of “Predicting Expert Evaluations of Software Code Reviews” (Denisov et al., 2024)

Published: 24 December 2024

We applaud the Denisov et al. (2024) initiative in highlighting the critical dimensions of Productivity and Quality in software development. We concur with the authors’ acknowledgement of the benefits of structured reviews to improve developer performance. We are also delighted to see a research community attending to this increasingly urgent topic for the software industry recently forming at Stanford University as it did at Cambridge University 24 years ago.

These observations set out in this research paper align with the principles underpinning BlueOptima’s Team Lead Dashboard, which provides:

  • Real-time visibility into productivity and quality metrics.
  • Guidance for code reviewers to identify improvement opportunities.
  • Code insights that maximise the positive impact of code reviews on organisational performance​.

While the Stanford academics make important contributions to understanding code review automation, several significant methodological shortcomings should be highlighted for the industry to have a truly independent and objective understanding of the issues the paper purports to address. Their study of expert evaluations in code reviews raises valuable questions about productivity and quality metrics in software development. However, research design and technological shortcomings present considerable limitations on the relevance and generalisability of their findings.

BlueOptima’s technology and capabilities in this domain extend beyond theoretical propositions, delivering actionable insights that enhance the software development lifecycle. For over a decade, BlueOptima has been the industry leader in evaluating developer productivity and quality, with robust and widely validated methodologies used by enterprises worldwide in large-scale software development investments. We believe that we are uniquely qualified to review this piece of work by these two academics.

We provide a summary below of our observations of this specific paper and the underlying technology that is loosely described by the authors.


1. Unreliable Questionnaire Data

The academics rely on subjective evaluations of 70 commits by 10 expert raters, yielding 4,900 total “judgements” as the sole basis of their considerable claims. Questionnaires are often employed in social science research where objective and reliable data is not available. The limited reliability and generalisability of questionnaires is well understood among researchers and practitioners alike.

The academics made substantial efforts to address questionnaire validity. The reported ICC2,k scores (0.81-0.82 for coding time measures) demonstrate apparently consistent interpretation among raters. These values warrant closer scrutiny due to the subjective nature of the underlying data. The assessments are based on perceived time estimates provided by expert raters, not on verified or recorded measurements of actual time spent on these tasks. This reliance on subjective judgment raises several critical methodological concerns.

First, the high ICC2,k values indicate consistency among raters but do not inherently validate the accuracy of their estimates. Without an objective benchmark for comparison, it is impossible to know the true effort required to complete the tasks. The absence of validation against real-world data, such as task tracking logs or integrated development environment (IDE) telemetry, limits the reliability of the reported measures as indicators of actual productivity.

Second, the constrained dataset of 70 commits evaluated by only 10 raters may amplify the effects of anchoring or implicit alignment in their responses. Such intentional or subconscious alignment could inflate agreement metrics without necessarily improving the validity of the underlying estimates. Additionally, the study’s reliance on a Fibonacci scale, while aligning with Agile methodologies, introduces an abstract and non-linear framework that may further disconnect the estimates from real-world effort.

Lastly, methodological questions remain about construct validity. For example, Question 5 asks raters to assess maintainability on a five-point scale from ‘Poor’ to ‘Excellent’. While raters showed moderate agreement (ICC2,k = 0.52), this lower correlation compared to other measures may indicate underlying challenges in standardising maintainability assessment. 

For this academic research to be delivering actionable industry insights it would need to at least incorporate the following:

  1. Relating these subjective maintainability assessments to objective metrics
  2. Examining whether maintainability ratings correlate with actual maintenance outcomes or costs
  3. Investigating if the moderate ICC2,k score for maintainability (0.52) reflects inherent challenges in standardising such assessments across different development contexts

Similarly, the structural assessment questions (Q6-Q7) achieved ICC2,k scores of 0.50-0.51, suggesting these measures lack concrete evaluation criteria. While the academics recruited “experts” of unknown capabilities aside from their years of experience across apparently diverse organisational contexts (from teams of 1-10 to 201-500 employees), the relatively lower agreement on these measures suggests that the questionnaire data is not readily transferable across organisational contexts.

BlueOptima’s Approach: BlueOptima removes subjectivity by automating the measurement of productivity and quality using well-validated metrics like Actual Coding Effort (ACE) and Analysis of Relative Thresholds (ART) otherwise known as Aberrant Coding Effort, which are validated against varied sources of activity telemetry and systematically logged data. These metrics are calibrated against large-scale, enterprise-grade datasets to ensure objectivity and accuracy​​. BlueOptima does not rely on subjective questionnaire-based data.


2. Limited Insights and Source Code Context

The academics claim their model “adds insights that help reviewers focus on more impactful tasks.” However, the methodology’s limited scope fails to deliver on this promise.

Narrow Source Code Context: The dataset is derived from a heavily undersampled selection of just 70 commits from a body of work of 1.73 million, offering an incomplete view of software development practices in the organisations included in the study. 

These commits are evaluated in terms of an unstated number of static metrics that attempt to describe Code Structure (Classes, Interfaces, and Methods), Code QualityMetrics (Cohesion, Complexity, and Coupling), Implementation Details (Data Structures, Dependencies, and Dependency Injections), and Architectural Elements (Architectural Patterns, Persistence Layers, APIs Consumed). A number of these concepts, insofar as they are described, suggest that it is not possible to evaluate them at the commit delta level, given that they describe higher order structures in source code. If true, a fundamental requirement of this technology fails to be achieved because commits will involve more or less arbitrary changes in any given revision of any given file, rendering the measurement of some of these concepts impossible.

The academics measure productivity largely based on Lines of Code and some unspecified measures of cohesion which are both of questionable face validity and construct validity. Quality is measured using a ratio of intra-module method calls to total method calls, complexities (i.e. Halstead, Cyclomatic), and fan-in and fan-out measures. These have better face and construct validity however, ultimately, the combination of these code-level measures are combined, using a random forest model, to optimally explain the subjective and unreliable questionnaire data. 

BlueOptima’s measure of Productivity, Coding Effort, is calculated by statistically evaluating every source code change made by developers in terms of up to 36 static source code metrics measuring various aspects of Volume, Complexity, and Interrelatedness while considering the context worked in (e.g. a complex legacy software component or a brand new project). Coding Effort is evaluated based on the changes committed into version control systems on a per commit per file basis.

BlueOptima’s measure of Quality, Aberrant Coding Effort, is a measure of source code’s quality, more specifically maintainability. The aberrancy of source code provides an objective account of how easy it is for a developer who is naive to the code to reach a level of understanding of the code so that they can extend, alter, or improve it. It is calculated by evaluating the proportion of code that is aberrant relative to the codebase in which it sits across more than 20 static source code metrics. Code is flagged as aberrant when it violates thresholds that have been benchmarked across an enterprise’s software estate and BlueOptima’s Global Benchmarks.

Broader Functionality for Effective Code Reviews: Whatever the validity of the productivity and quality measures that the academics propose, they offer insight to the code reviewer that is, at best, sorely incomplete. BlueOptima, through its Team Lead Dashboard, offers a suite of source code analytic capabilities that go beyond what is described in the paper, enabling team leads and reviewers to identify and prioritise critical issues:

  1. Test Code Detection: Identifies non-production code to allow reviewers to focus on core functionality.
  2. Code Author Detection: Flags machine-generated code for review, essential in environments where Generative AI is used.
  3. Secrets Detection: Highlights sensitive information embedded in the source code, mitigating security risks.
  4. Software Composition Analysis: Detects external dependencies, including vulnerable or outdated packages.
  5. Software Vulnerability Detection: Identifies patterns in code that may expose applications to security threats​.

These wider insights ensure code reviewers can address high-impact areas quickly and comprehensively.


3. Unreplicable, Unreliable, and Unimplementable Commit Sampling

Sampling and Representativeness – The authors collected data from 1.73 million commits across 50,935 contributors but analysed only 70 selected commits. While they state these commits “matched the LOC distribution”, they do not specify their sampling methodology or demonstrate how this extremely small sample can adequately represent enterprise software development patterns. In enterprise environments, commit sizes span multiple orders of magnitude – from single-line changes to commits exceeding 100,000 lines of code for generated artifacts or large-scale refactoring.

Fundamental Classification Problems – The premise of classifying commits into discrete automation levels (1-4) faces insurmountable operational challenges. Every commit contains lines of code regardless of its origin, and modern integrated development environments blend manual coding with automated assistance through features like code completion and automated refactoring. The final commit content provides no reliable signal about the degree of automation used in its creation. The authors do not address this fundamental limitation in their methodology.

Limitations of Expert Evaluation – While achieving high inter-rater reliability, expert evaluation of 70 selected commits cannot provide generalizable insights given the massive variation in commit types and sizes observed in enterprise software development. The evaluation approach assumes that automation of commit selection can be reliably determined through post-hoc analysis of commits, an assumption that remains unproven.

Limited Representation of Commit Diversity: The distribution of commit sizes (in Figure 1, Population size:  10 billion commits) observed across enterprise software development organisations reveals a wide range of commit sizes. A dataset with selected 70 commits would not adequately capture the full spectrum of development activities. It could be skewed towards specific types of commits (e.g., small bug fixes or contained functional deliverables) and miss critical information about larger-scale changes, refactoring efforts, or complex feature implementations. This severely limits the generalizability of the findings to other projects, teams, or contexts.

Impact of Outliers: The presence of extreme values (e.g., the maximum values for Lines of Code Added and Number of Files Modified) highlights the potential impact of outliers. A small sample size that has deliberately avoided cases such as these will be disproportionately influenced by these outliers, leading to flawed estimates and inaccurate conclusions.

Difficulty in Identifying Trends: The significant variance across the metrics suggests that identifying meaningful trends or patterns in developer behavior might be extremely challenging using the inadequate sample proposed by the academics. The limited data would not provide sufficient statistical power to detect subtle relationships between commit characteristics and other factors (e.g., developer experience, project complexity).

The academics would do well to provide a more robust technological solution that provides a defensible account of automation in software development, which would require:

  • Detailed analysis of the full commit size distribution
  • Integration with development environment telemetry
  • Recognition that automation exists on a continuous spectrum rather than discrete levels
  • Examination of the development process itself rather than post-hoc classification of commits
  • Incorporation of IDE logs and longitudinal studies of automation tool adoption

These methodological issues suggest that a meaningful study of automation in software development requires a fundamental rethinking of how such analysis should be conducted.

BlueOptima’s technology has a mature and highly capable ability to identify code changes that are automated or systematically generated which means that reviewers are able to handle human-authored changes with the appropriate care and focus required to maximise the benefits of the code review process.


4. Lack of Generalizability Beyond OOP

The academics’ model focuses exclusively on Java. While they broadly allude to their approach potentially being suitable for other object-oriented programming (OOP) languages, however, even if they had delivered the significant engineering effort required to accommodate the other OOP languages they refer to this significantly limits the applicability of the techniques they describe. Modern software projects involve scripting languages (e.g., Python), configuration files (e.g., YAML), or functional programming languages. The lack of generalisability to these paradigms diminishes the utility of the proposed methodology​.

BlueOptima’s Versatility: By supporting a wide range of programming paradigms and languages, BlueOptima ensures its metrics are relevant across diverse development environments, from dynamic languages to compiled systems​​.


5. Lack of Replicability and Transparency

Sections 2.1–2.4 of the paper fail to provide sufficient detail for replication. The specific static metrics used and their algorithmic combinations are not disclosed, making it impossible to validate or reproduce the findings. This paper is published by academics in a format that is typically required by peer-reviewed journals and it conspicuously refers to the university within which the research was apparently conducted. Peer-reviewed research demands transparency to enable replication and scrutiny.

The lack of appropriate detail to replicate the calculation of metrics may be an oversight that could be rectified in subsequent publications by the academics. Alternatively, it may be that this document is in fact not intended for publication in a peer-reviewed journal despite the pretenses of the document. If so the authors should declare the document as a marketing vehicle and present it as such so as not to mislead their readers.

BlueOptima’s Integrity: In contrast, BlueOptima provides clear, detailed methodologies for its metrics for paying customers, including the calculation of ACE and ART. These methodologies are validated through large-scale studies and adopted by enterprises worldwide​​. BlueOptima does not pretend to publish the details of core algorithm calculations, nor would its customers permit it to do so, as the metrics and measures employed are used to provide near real time evaluation of billions of dollars of software development investments daily. If these metrics were to be gamed or hacked the implications for the software development industry would be significant.


Conclusion

Denisov et al. (2024) highlight the importance of measuring productivity and quality in software development, but their approach faces critical limitations in scale, objectivity, generalisability, and transparency. Reliance on subjective questionnaire data, a small dataset, and unclear metrics restricts the real-world applicability of their findings. Furthermore, their focus on Java and object-oriented programming limits relevance to modern, multi-paradigm environments.

In contrast, BlueOptima’s validated metrics, such as Actual Coding Effort (ACE) and Aberrant Coding Effort (ART), leverage objective data and global benchmarks to provide actionable insights across diverse programming paradigms. Tools like the Team Lead Dashboard go beyond static metrics, offering comprehensive support for effective code reviews.

To achieve a broader impact, Denisov et al.’s approach must expand its dataset, integrate objective validation, and improve algorithmic transparency. While their work raises important questions, BlueOptima delivers proven, enterprise-ready solutions that address these challenges and enable meaningful improvements in software development outcomes.

Related articles...

Article
Debunking GitHub’s Claims: A Data-Driven Critique of Their Copilot Study

Generative AI (GenAI) tools like GitHub Copilot have captured the…

Read More
Article
GenAI and the Future of Coding: Predictions and Preparation

Our previous articles explored insights from BlueOptima’s report, Autonomous Coding:…

Read More
Article
Building a Balanced Approach to Coding Automation: Strategies for Success

Our previous articles explored the Coding Automation Framework and how…

Read More
abstract02@2x

Bringing objectivity to your decisions

Giving teams visibility, managers are enabled to increase the velocity of development teams without risking code quality.

0

out of 10 of the worlds biggest banks

0

of the S&P Top 50 Companies

0

of the Fortune 50 Companies