Back to evaal.aiPre-launch draft

Methodology

Last updated: April 2026

Engineering managers buying software should be able to read the math. This page is the running, public-facing explanation of how Evaal calculates the metrics it reports. The full version with formulas and references will replace this draft before general availability.

If you are evaluating Evaal and need a specific metric explained in detail before signing, email hello@evaal.ai. Evaal will share the current methodology document and arrange a walkthrough with the engineering lead.

The 52-metric taxonomy

Evaal computes 52 metrics across four domains: velocity (cycle time, throughput, deployment frequency, change failure rate), quality (PR review depth, test coverage delta, post-release defect rate, escaped defect rate), wellbeing (focus time availability, after-hours work ratio, on-call burden, meeting load), and AI adoption (AI code ratio, AI rework rate, clean merge rate, AI cycle time delta, AI time saved).

The taxonomy is grounded in DORA (DevOps Research and Assessment), SPACE (the SPACE framework from Microsoft Research), and DX Core 4 (the DX Engineering Productivity model). Customers choose which framework to weight; metrics from all three are computed in parallel.

Cycle time

Cycle time is measured as the elapsed time from first commit on a branch to merge into the default branch, computed per pull request. Evaal reports the median, p75, and p90 of cycle time at the team level over rolling 14-day, 30-day, and 90-day windows.

PRs that are abandoned, force-pushed away, or split are handled by a heuristic that the full methodology document describes in detail.

Burnout risk

Burnout risk is a composite signal computed from three passive measurements per engineer: after-hours commit ratio (commits between 19:00 and 07:00 local time, divided by total commits), review engagement decline (PR review participation week-over-week, with a 4-week trend), and focus time erosion (the count of 90-minute uninterrupted blocks per week, where uninterrupted means no calendar events and no Slack DMs).

Evaal flags burnout risk when two or more of these signals cross a 2σ (two standard deviation) threshold relative to the engineer's own historical baseline. The threshold is intentionally conservative; Evaal's design priority is precision over recall, because false positives erode manager trust faster than missed detections do.

Burnout risk is reported privately to the manager only. It is never reported to HR, never to the engineer, and never as an individual score to anyone other than the manager who has been granted that view. Aggregated team-level signals are visible to skip-level managers and the customer's executive sponsor.

Cognitive load

Cognitive load is a composite metric for the manager themselves, not their team. It tracks five inputs: context switches per day (a context switch is a transition between projects, teams, or repositories), decision volume (the count of PRs reviewed, tickets reassigned, calendar events accepted), meeting load (hours in meetings as a fraction of working hours), information overload (Slack messages received in channels the manager actively reads), and Dunbar drift (the count of distinct people the manager interacted with in the rolling 7-day window).

The composite is normalised to a 0-100 scale and adapts at Dunbar's three layers (15, 50, 150) as a soft signal that the manager has crossed a cognitive scaling boundary.

Cognitive load is the only metric Evaal computes about the manager. It is private to the manager. It is never visible to the manager's manager, to HR, or to anyone other than the manager themselves.

Anomaly detection

Evaal uses a modified Z-score (the median absolute deviation, MAD-based, instead of the mean and standard deviation) for anomaly detection on time-series metrics. The modified Z-score is more robust to outliers, which matters for engineering data because a single launch week or incident can skew a small sample.

Anomalies surface in the daily briefing only when the modified Z-score exceeds a threshold and the metric is one the customer has marked as a priority.

What Evaal does not compute

Evaal does not compute lines of code per engineer, commit count per engineer, or any other vanity-metric proxy for individual productivity. These metrics are well-documented to misrepresent engineering work and to enable surveillance patterns Evaal explicitly refuses to support.

Evaal does not compute a single 'engineer score' or stack-rank engineers against each other. The product is not designed to support performance management decisions about individuals, and the data is intentionally aggregated to make that misuse difficult.

Sources and references

DORA: Forsgren, Humble, and Kim, Accelerate (2018) and the annual State of DevOps Report.

SPACE: Forsgren, Storey, Maddila, Zimmermann, Houck, and Butler, 'The SPACE of Developer Productivity' (2021), ACM Queue.

DX Core 4: Storey, Houck, and Forsgren, 'The DX Core 4: An Empirically-Validated Framework for Developer Experience' (2024).

Dunbar: Dunbar, 'How Many Friends Does One Person Need? Dunbar's Number and Other Evolutionary Quirks' (2010), and the 2024 update in Annals of Human Biology.

Cognitive overload and the prefrontal cortex: Arnsten, 'Stress signalling pathways that impair prefrontal cortex structure and function' (2009), Nature Reviews Neuroscience.

Context switching cost: Mark, Gudith, and Klocke, 'The Cost of Interrupted Work' (UCI, 2008-2023 longitudinal data).

Questions? Email hello@evaal.ai.