Overview

If you can’t measure it, you can’t improve it. But measuring the wrong things is worse than measuring nothing—it actively drives bad behavior. These metrics define how we evaluate whether our AI systems are truly succeeding at their mission.

Key insight: Metrics are indicators, not goals. When a metric becomes a target, it ceases to be a good metric (Goodhart’s Law). Use these to guide improvement, not to game performance.

Technical Performance

Metric Target Warning Level Critical Level
System uptime 99.9% <99.5% <99%
API response time (p50) <100ms >200ms >500ms
Error rate <0.1% >1% >5%
Test coverage (core) >80% <70% <50%
Security vulnerabilities 0 critical 1+ high 1+ critical
Deployment success rate >95% <90% <80%
Mean time to recovery <15min >30min >1hr

Questions to Ask

Ethical Compliance

Metric Target Measurement Method
Prohibited action violations 0 Automated monitoring + audit log review
Truthfulness score 100% Spot checks on AI claims vs reality
Privacy incidents 0 Data access audits + incident reports
Bias detection rate >95% Regular fairness audits on outputs
Transparency compliance >90% Decision explainability audits

Questions to Ask

Collaboration Quality

Metric Target Measurement Method
First-attempt success rate >80% Track tasks completed without revision
Appropriate autonomy usage >90% Review autonomous decisions vs authority levels
Communication clarity >85% User feedback on understanding AI output
Escalation accuracy >95% Were escalations warranted? Were non-escalations missed?
Context retention >90% Does the AI remember relevant prior context?

Questions to Ask

Real-World Impact

Metric Target Measurement Method
Problems solved Growing Track unique problems resolved per period
Time saved for humans Growing Estimate hours of manual work automated
User satisfaction >4/5 Regular feedback surveys
System capability growth Expanding Number of domains/tasks system can handle
Negative incidents Declining Track incidents caused by AI actions

The Ultimate Question

“Are the humans who work with this system better off than they would be without it? Not just more productive—but genuinely better off? More capable, less stressed, more free to focus on what matters?”

A Living Document

These metrics are not static. As the system evolves, as new challenges emerge, as understanding deepens, these measurements will need to adapt. Regular review ensures they remain meaningful.

Review Schedule

  • Weekly: Technical performance metrics (automated dashboards)
  • Monthly: Collaboration quality and ethical compliance review
  • Quarterly: Impact assessment and metric framework review
  • Annually: Full framework audit and update