When Simulation Results Become Evidence
Simulation has become the backbone of autonomous vehicle development. Millions of virtual kilometers driven every day, generating charts, KPIs, and scenario results. But here's the problem: a simulator never tells you when its assumptions break down.
Every simulation produces an answer. The critical question isn't what the simulator outputs—it's whether that output can actually be trusted for the claim we're trying to make.
Too often, scenarios get run, reports get generated, and results get presented—without anyone pausing to ask: was the model even valid for this situation?
Without an explicit credibility check, simulation is an exploration tool. Useful for development, but insufficient as safety evidence.
The Question Nobody Wants to Ask
Before any simulation result gets used in validation or certification, we need to answer one fundamental question:
Is this model credible for this specific situation and purpose?
Not "is the model accurate?", not "is it realistic?". Those questions are too vague. What matters is: does this model make sense for the claim we're trying to support?
What Credibility Actually Means
Model credibility isn't about claiming realism or completeness. It's about answering a more pragmatic question:
Does this simulated scenario make sense given what we know about real-world behavior?
You can assess credibility along several dimensions:
- Physical plausibility: Are motions physically feasible? (No teleporting cars, no 3g lateral accelerations)
- Behavioral plausibility: Do agents behave in ways consistent with traffic rules and human reaction times?
- Temporal consistency: Are there discontinuities, artifacts, or numerical instabilities?
- Statistical plausibility: Does this type of interaction occur in real driving data with non-zero frequency?
You don't need a perfect model. You need to be transparent about what you're modeling—and what you're ignoring.
Model Validity Depends on the Question
Here's something that took me years to internalize: model validity is not absolute. It's always relative to what you're trying to prove.
In practice, we use simulation to answer very different questions:
- Does the planning logic respect traffic rules?
- Does the AV interact safely with other agents at an intersection?
- Is perception robust to weather, lighting, and sensor noise?
- Does the full system behave safely end-to-end?
Each question demands a different level of model fidelity. What's valid for testing the Planning component doesn't automatically tell you anything useful about perception robustness—and pretending it does is how validation gaps emerge.
Treating all simulation results as equally representative is one of the most common ways teams overestimate what they've actually validated. I've been guilty of this myself.
Simulation Models and Their Assumptions
Every simulation model encodes assumptions:
- How traffic participants behave
- How quickly they react
- How aggressively they accelerate or brake
- Which uncertainties are modeled—and which are ignored
A model can be wildly inaccurate in a global sense and still be valid within a well-defined envelope. The problem arises when results get extrapolated beyond that envelope without anyone acknowledging the underlying assumptions.
Abstraction itself isn't the risk. Unbounded abstraction is.
Grounding Simulation in Real-World Data
Real Data Defines the Envelope
Real-world driving data plays a critical role, but not in the way people usually assume.
Logged drives don't validate individual simulated scenarios. What they do is define the operating envelope where the model can reasonably be trusted.
By comparing simulated behavior against empirical distributions—speeds, accelerations, time gaps, interaction timing, sensor measurements—we can detect when a simulation starts drifting into regions that are poorly represented or entirely absent in reality.
This comparison doesn't need to be exact. What matters is identifying where the model is interpolating within known behavior—and where it's extrapolating beyond it.
Real-world data acts as an anchor, not a gold standard.
Digital Twins as Validation Tools
Digital twins—scenarios replayed from logged drives—let you directly compare what your simulation predicts against what actually happened.
By running real-world scenarios through your models and comparing the outputs, you reveal where your simulation matches reality and where it drifts. This isn't about achieving perfect alignment. It's about understanding the bounds of your model's validity.
Once these bounds are established, digital twins become regression tests. When software changes are deployed, you can replay the same real-world scenarios to verify fixes worked without introducing new failures.
Filtering Generated Scenarios
Once critical scenarios are identified in simulation, for example with Search-Based Testing, credibility needs to be checked again—this time against reality.
Simulated scenarios can be mapped into real-world feature space using relative speeds, distances, timing metrics, or interaction patterns, and then compared against logged drives.
This lets you distinguish between:
- Scenarios that are rare but plausible
- Scenarios that are common but previously untested
- Scenarios that are likely artifacts of modeling assumptions
Real-world data constrains simulation. Simulation explores what data never captured. Each checks the other.
Confidence: Knowing Where Results Can Be Trusted
There's a concept missing from most simulation pipelines: confidence.
Beyond asking "is this scenario critical or safe?" we need a second question:
How confident are we that the simulator is a valid representation in this situation?
Confidence can be informed by:
- Proximity to real-world data distributions
- Number and strength of modeling assumptions involved
- Abstraction level used
- Degree of extrapolation beyond calibrated regions
Low confidence doesn't mean a result is useless. It means it needs to be treated differently—flagged, contextualized, or excluded from certain safety claims.
For leadership and regulators, this distinction is essential. A result without explicit confidence context isn't evidence. It's just output.
A Clear Scoping Example: When Bounding Boxes Are Enough—and When They're Not
When you're evaluating behavior planner logic in isolation (component testing), high-fidelity perception modeling isn't required. In these cases:
- Perfect state information is acceptable
- Bounding-box representations are sufficient
- Sensors, weather, and noise models can be omitted
This isn't cutting corners. It's a scoping decision.
But those results are only valid for the behavior layer. They cannot be extrapolated to claims about perception robustness or end-to-end safety. Clear scoping increases credibility.
From Exploration to Defensible Evidence
Without explicit model credibility checks, simulation remains an exploration tool. Valuable for development, but insufficient as a primary safety argument.
With credibility, confidence, and clear scoping, simulation becomes something else: defensible evidence, suitable for safety cases, audits, and regulatory review.
For leadership, this distinction determines whether simulation accelerates certification—or introduces unquantified risk late in the process. I've seen both outcomes, and the difference almost always comes down to whether the team treated credibility as an afterthought or a first-class concern.
Simulation proposes hypotheses. Real-world data filters them.
Kaveh Rahnema
V&V Expert for ADAS & Autonomous Driving with 7+ years at Robert Bosch GmbH.
Connect on LinkedIn