When Simulation Results Become Evidence

Every simulation produces an answer. The critical question isn't what the simulator outputs—it's whether that output can actually be trusted for the claim we're trying to make.

Too often, scenarios get run, reports get generated, and results get presented—without anyone pausing to ask: was the model even valid for this situation?

Without an explicit credibility check, simulation is an exploration tool. Useful for development, but insufficient as safety evidence.

The Question Nobody Wants to Ask

Before any simulation result gets used in validation or certification, we need to answer one fundamental question:

Is this model credible for this specific situation and purpose?

Not "is the model accurate?", not "is it realistic?". Those questions are too vague. What matters is: does this model make sense for the claim we're trying to support?

What Credibility Actually Means

Model credibility isn't about claiming realism or completeness. It's about answering a more pragmatic question:

Does this simulated scenario make sense given what we know about real-world behavior?

You can assess credibility along several dimensions:

Physical plausibility: Are motions physically feasible? (No teleporting cars, no 3g lateral accelerations)
Behavioral plausibility: Do agents behave in ways consistent with traffic rules and human reaction times?
Temporal consistency: Are there discontinuities, artifacts, or numerical instabilities?
Statistical plausibility: Does this type of interaction occur in real driving data with non-zero frequency?

You don't need a perfect model. You need to be transparent about what you're modeling—and what you're ignoring.

Model Validity Depends on the Question

Here's something that took me years to internalize: model validity is not absolute. It's always relative to what you're trying to prove.

In practice, we use simulation to answer very different questions:

Does the planning logic respect traffic rules?
Does the AV interact safely with other agents at an intersection?
Is perception robust to weather, lighting, and sensor noise?
Does the full system behave safely end-to-end?

Each question demands a different level of model fidelity. What's valid for testing the Planning component doesn't automatically tell you anything useful about perception robustness—and pretending it does is how validation gaps emerge.

Treating all simulation results as equally representative is one of the most common ways teams overestimate what they've actually validated. I've been guilty of this myself.

Simulation Models and Their Assumptions

Every simulation model encodes assumptions:

How traffic participants behave
How quickly they react
How aggressively they accelerate or brake
Which uncertainties are modeled—and which are ignored

A model can be wildly inaccurate in a global sense and still be valid within a well-defined envelope. The problem arises when results get extrapolated beyond that envelope without anyone acknowledging the underlying assumptions.

Abstraction itself isn't the risk. Unbounded abstraction is.

Grounding Simulation in Real-World Data

Real Data Defines the Envelope

Real-world driving data plays a critical role, but not in the way people usually assume.

Logged drives don't validate individual simulated scenarios. What they do is define the operating envelope where the model can reasonably be trusted.

By comparing simulated behavior against empirical distributions—speeds, accelerations, time gaps, interaction timing, sensor measurements—we can detect when a simulation starts drifting into regions that are poorly represented or entirely absent in reality.

This comparison doesn't need to be exact. What matters is identifying where the model is interpolating within known behavior—and where it's extrapolating beyond it.

Real-world data acts as an anchor, not a gold standard.

Digital Twins as Validation Tools

Digital twins—scenarios replayed from logged drives—let you directly compare what your simulation predicts against what actually happened.

By running real-world scenarios through your models and comparing the outputs, you reveal where your simulation matches reality and where it drifts. This isn't about achieving perfect alignment. It's about understanding the bounds of your model's validity.

Once these bounds are established, digital twins become regression tests. When software changes are deployed, you can replay the same real-world scenarios to verify fixes worked without introducing new failures.

Filtering Generated Scenarios

Once critical scenarios are identified in simulation, for example with Search-Based Testing, credibility needs to be checked again—this time against reality.

Simulated scenarios can be mapped into real-world feature space using relative speeds, distances, timing metrics, or interaction patterns, and then compared against logged drives.

This lets you distinguish between:

Scenarios that are rare but plausible
Scenarios that are common but previously untested
Scenarios that are likely artifacts of modeling assumptions

Real-world data constrains simulation. Simulation explores what data never captured. Each checks the other.

Confidence: Knowing Where Results Can Be Trusted

There's a concept missing from most simulation pipelines: confidence.

Beyond asking "is this scenario critical or safe?" we need a second question:

How confident are we that the simulator is a valid representation in this situation?

Confidence can be informed by:

Proximity to real-world data distributions
Number and strength of modeling assumptions involved
Abstraction level used
Degree of extrapolation beyond calibrated regions

Low confidence doesn't mean a result is useless. It means it needs to be treated differently—flagged, contextualized, or excluded from certain safety claims.

For leadership and regulators, this distinction is essential. A result without explicit confidence context isn't evidence. It's just output.

A Clear Scoping Example: When Bounding Boxes Are Enough—and When They're Not

When you're evaluating behavior planner logic in isolation (component testing), high-fidelity perception modeling isn't required. In these cases:

Perfect state information is acceptable
Bounding-box representations are sufficient
Sensors, weather, and noise models can be omitted

This isn't cutting corners. It's a scoping decision.

But those results are only valid for the behavior layer. They cannot be extrapolated to claims about perception robustness or end-to-end safety. Clear scoping increases credibility.

From Exploration to Defensible Evidence

Without explicit model credibility checks, simulation remains an exploration tool. Valuable for development, but insufficient as a primary safety argument.

With credibility, confidence, and clear scoping, simulation becomes something else: defensible evidence, suitable for safety cases, audits, and regulatory review.

For leadership, this distinction determines whether simulation accelerates certification—or introduces unquantified risk late in the process. I've seen both outcomes, and the difference almost always comes down to whether the team treated credibility as an afterthought or a first-class concern.

Simulation proposes hypotheses. Real-world data filters them.