What makes a great automated medical coding system

TL;DR
- Accuracy figures are misleading: systems that perform well on common codes often fall below 40% F1 on rare codes, which are not edge cases but real encounters with real financial consequences.
- With CMS expanding Medicare Advantage audits to all 550 eligible contracts, every code prediction needs traceable documentation support and explainable exclusion decisions.
- Speed only creates value when paired with accuracy and evidence: fast output that still requires heavy manual review just moves the bottleneck without fixing it.
Defining the standard
The conversation around automated medical coding has an issue clarifying precision rates that align with the performance outcomes. Vendors claim accuracy rates north of 90% while procurement teams evaluate demos on common and well-documented cases. That being said, there are still gaps that surface in the industry with coding models performing sub-optimally.
Defining what good actually looks like requires being specific about three things: what accuracy means in a clinical context, what auditability requires to hold up under careful examination, and what speed is actually worth optimizing for.
Accuracy is complex
The standard benchmark for human medical coding accuracy is 95%, a threshold set by industry bodies and used as the baseline for auditing coder performance. Reaching that threshold consistently across all code types, care settings, and clinical contexts is harder than headline accuracy figures suggest.
The core problem is distributional. Automated coding systems tend to perform well on common codes that appear frequently in training data and perform poorly on rare codes that do not. A 2025 systematic review of LLM-based ICD coding published on medRxiv found that while macro F1 scores reached as high as 0.96 on structured in-domain data, performance dropped sharply for low-prevalence codes. In detailed ICD categories, F1 scores fell below 40% despite sophisticated model architectures.
This matters because rare codes are not edge cases in a clinical sense. A patient struck by a rare complication, a chronic condition with a long-tail ICD mapping, a specialist encounter with a low-frequency procedure code: these are real encounters with real financial and clinical consequences. A system that performs at 90% overall but fails systematically on rare codes is not a 90% accurate system in practice. It is a system with a known, unaddressed blind spot.
Accuracy also has to be evaluated separately across care settings. Inpatient and outpatient encounters follow different coding guidelines. A condition that should be coded in an inpatient admission may be excluded in an outpatient context, and vice versa. Systems that do not distinguish between the two are applying the wrong rules to a meaningful portion of their predictions.
Auditability is not optional
The regulatory environment in healthcare has changed materially. CMS announced in 2025 that it is expanding Medicare Advantage RADV audits from approximately 60 plans per year to all eligible MA contracts annually, roughly 550 plans, while scaling its medical coding review workforce from 40 to approximately 2,000 reviewers. The agency is auditing payment years 2018 through 2024 concurrently, and CMS estimates that unsupported diagnoses result in approximately $17 billion in annual overpayments to Medicare Advantage organizations.
In that environment, a coding output that cannot be traced back to the clinical documentation that supports it is not a defensible output. It is a liability.
What auditability actually requires is specific. Every code prediction needs a link to the exact text in the clinical note that triggered it. Every exclusion decision, the codes the system chose not to recommend, needs a rationale grounded in coding guidelines. Every alternative considered needs to be visible, so a reviewer can understand why one code was selected over another. That chain of evidence is what compliance teams need to defend a claim, and what payer auditors will look for when they review one.
This is a higher bar than most current tools meet. A system that returns a list of codes with confidence scores does not meet it. A system that highlights source text but cannot explain why a condition was excluded does not meet it either. Auditability means the output is self-documenting, not something that requires a separate process to reconstruct after the fact.
Speed matters, but not the way it is usually framed
The productivity argument for automated coding is real. Manual coding of a complex inpatient encounter can take thirty minutes or more. Reducing that time has direct operational value. But optimizing for raw throughput without accuracy is how organizations end up with fast, wrong outputs that require the same manual review they were trying to avoid.
The more useful frame for speed is how quickly a system can produce output that a coder can act on. That means predictions with enough evidence attached that a reviewer can verify them quickly. It means exclusion lists that explain their reasoning, so coders are not second-guessing what the system missed. It means output structured well enough to be integrated into an existing workflow rather than dropped into a separate review queue.
Speed without that structure just moves the bottleneck. The review still happens. It just happens with less information.
What this means for how you evaluate tools
Accuracy, auditability, and speed are not independent variables. They compound. A system with high accuracy and full evidence tracing reduces review time because reviewers can trust and verify predictions quickly. A system with low accuracy and no audit trail requires extensive manual checking that neutralizes any speed gain.
The questions worth asking of any automated coding system are straightforward. How does accuracy hold up on rare codes, not just common ones? Does the output distinguish between inpatient and outpatient guidelines? Can every prediction be traced to the clinical text that supports it? Are exclusion decisions explained? Can the output be defended in a CMS audit without additional documentation work?
If those questions do not have clear answers, the system is not optimizing for production environments.
Next in this series: The hidden cost of coding errors. Quantifying the downstream impact on revenue, compliance, and operations.
More guides to explore
Build faster. Ship safer. Scale smarter.
Get started with healthcare-native APIs built to power real clinical workflows.


