Automated QA for Robot Trajectory Data: A Three-Layer Framework

The "record more demos" trap

When a robot learning policy fails to generalize, the default response across almost every lab and company is the same: record more demonstrations. The intuition is borrowed from the language model playbook -- more data, better model. But trajectory data for robotics is not web text. A single corrupted episode does not just add noise; it can teach a policy physically dangerous behaviors that only manifest when the robot is running on real hardware.

The scale problem is already here. Open X-Embodiment aggregated over one million trajectories across 22 robot embodiments. Bridge Data V2 contains 60,000+ demonstrations. The humanoid-everyday dataset from USC covers 260+ tasks with multimodal observations at 30Hz. RealSource-World ships over 14 million frames. At these scales, manual review is not a workflow -- it is a fantasy. Nobody is watching every episode.

Yet the tooling has not caught up. Most teams doing imitation learning or training vision-language-action (VLA) models have no automated quality gates on their trajectory data. Episodes flow from teleop sessions or simulation runs directly into training, with at best a spot-check by the person who collected them. The result is a familiar failure mode: you train for 48 hours on 8 GPUs, the policy fails on hardware, and you have no way of knowing whether the problem was the model, the data, or both.

This post describes a three-layer framework for automated trajectory data quality validation. It is format-agnostic -- it works whether your data is in LeRobot Parquet, HDF5, rosbag2, or any episodic structure. The layers are ordered by computational cost: cheap structural checks first, physics-based kinematic validation second, and expensive vision-language semantic checks last. The goal is to assign every episode a quality score and a classification -- Pass, Needs Review, or Reject -- before it ever enters a training dataset.

Why data quality matters more than data quantity

The empirical evidence is clear, even if the tooling lags behind. Lin et al. (2024) demonstrated in their work on data quality for robot learning (arXiv:2412.02987) that quality-filtered subsets of demonstration datasets consistently outperform the full unfiltered datasets when training manipulation policies. Removing even a small percentage of low-quality episodes can improve policy success rates by double-digit percentages, while adding more low-quality data often degrades performance.

The failure modes are specific to robotics and distinct from what you encounter in NLP or computer vision:

A single bad demonstration can teach dangerous behaviors. If a teleop episode includes an unintended collision and that collision is not filtered, the policy learns that the collision trajectory is valid. On hardware, this becomes a safety incident.
Action-observation misalignment corrupts the learning signal entirely. A 50-millisecond timestamp skew between the camera and the joint encoder means the policy is learning to associate visual states with actions that happened at a different time. The gradient signal is not just noisy -- it is actively wrong.
Inconsistent action spaces break batching silently. If episode 1 has a 7-dimensional action space (joint positions) and episode 2 has 8 dimensions (joint positions plus gripper), many dataloaders will either crash or silently pad/truncate. Neither outcome is good.
Unlabeled failures poison the success distribution. If a demonstration where the robot dropped the object is not labeled as a failure, the policy treats the drop trajectory as a valid way to complete the task.

Despite these well-known risks, there is no standard set of quality metrics for robotics trajectory data. The computer vision community has FID and IS for generative models, COCO mAP for detection. NLP has perplexity, BLEU, ROUGE. Robotics has... nothing standardized. Every lab defines quality ad hoc, if they define it at all.

The three-layer QA framework

We propose organizing trajectory data validation into three layers, ordered by computational cost and the specificity of the failures they catch. The key insight is that you should run cheap checks first and only spend compute on expensive semantic analysis for episodes that pass the lower layers.

Layer 1: Structural validation

Structural checks validate the format, completeness, and internal consistency of an episode without interpreting the data semantically. These checks are computationally trivial -- they run in milliseconds per episode and catch a surprising number of real problems.

Check	What it catches
FPS consistency	Are frames evenly spaced within tolerance? Large gaps indicate dropped frames or recording interruptions. Typical tolerance: +/- 10% of target interval.
Dropped frame ratio	What percentage of expected frames are missing? For a 30Hz stream over 10 seconds, you expect 300 frames. If you have 247, that is a 17.7% drop rate.
Action dimension consistency	Do all episodes in a dataset share the same action space dimensionality? Mixing 7-DoF and 23-DoF episodes in the same training set without explicit handling will produce garbage.
Timestamp monotonicity	Are timestamps strictly increasing? Non-monotonic timestamps indicate clock resets, recording bugs, or data corruption. This is a hard failure.
Modality completeness	Are all expected modalities present? If the dataset schema specifies RGB, depth, and proprioception, an episode missing depth should be flagged.
File integrity	Are video files decodable? Are Parquet files readable? Are HDF5 groups structurally valid? Truncated uploads and partial writes are more common than you think.

Structural checks should be hard gates. An episode with non-monotonic timestamps or unreadable video files should never reach training. There is no amount of model capacity that compensates for corrupted input data.

Layer 2: Kinematic validation

Kinematic checks interpret the trajectory data through the lens of the robot's physical capabilities. They require knowing the robot's URDF or at minimum its joint limits, velocity limits, and workspace bounds. The computational cost is moderate -- typically a few hundred milliseconds per episode for forward kinematics and limit checks.

Check	What it catches
Joint limit violations	Are any joint positions outside the robot's physical range? Values beyond limits indicate sensor noise, unit mismatches (radians vs. degrees), or data corruption.
Velocity sanity	Do joint velocities exceed what the hardware can physically achieve? A velocity of 50 rad/s on a joint rated for 3 rad/s is not a fast movement -- it is bad data.
Acceleration spikes	Sudden jumps in acceleration indicate teleop glitches, network dropouts during remote operation, or sensor noise. These produce trajectories the robot cannot physically execute smoothly.
End-effector workspace bounds	Is the computed end-effector position physically reachable given the robot's kinematics? Out-of-workspace positions indicate IK errors or frame-of-reference mismatches.
Self-collision detection	Does the commanded joint configuration cause any links to intersect? This requires collision geometry from the URDF but catches physically impossible states that a policy should never learn.
Gripper state consistency	Does the gripper open/close pattern correlate with the task? A pick-and-place episode where the gripper never closes is either mislabeled or a failed demonstration.

Kinematic violations come in two categories. Hard violations -- joint positions 20% beyond limits, self-collisions -- are rejects. Soft violations -- a brief velocity spike that the trajectory recovers from -- are warnings that flag the episode for human review. The distinction matters because teleop data is inherently noisier than simulation data, and overly aggressive filtering discards usable demonstrations.

Layer 3: Semantic validation

Semantic checks use vision and language models to validate whether the episode content matches the task description and whether the demonstration represents meaningful, successful behavior. These are the most expensive checks -- they require running inference on video frames -- and should only run on episodes that pass Layers 1 and 2.

Check	What it catches
Object presence	Is the target object described in the task instruction actually visible in the scene? A "pick up the red cup" episode where no red cup appears is mislabeled or miscaptured.
Task progress	Does the episode show meaningful progress toward the stated goal? Comparing the first and last frames for task-relevant state changes catches demonstrations that never actually attempted the task.
Hand-object interaction	Does the end-effector or hand actually make contact with the target object? Detecting grasp events, contact frames, and release events validates that the demonstration includes the critical manipulation phase.
Scene consistency	Does the environment match what the task description implies? A "kitchen pick-and-place" demonstration recorded in a lab with no kitchen objects is a metadata error.
Anomaly detection	Is this episode an outlier compared to other successful demonstrations of the same task? Embedding the episode trajectory and comparing against the cluster of known-good demonstrations catches subtle failures that no individual check would flag.

Semantic validation is where the framework leverages recent advances in vision-language models. Models like SigLIP and DINOv2 provide frame-level embeddings that capture scene semantics without task-specific fine-tuning. For task progress and object interaction checks, a VLM can be prompted with the task description and sampled frames to produce a confidence score. This is not perfect -- VLMs hallucinate, and confidence calibration is an open problem -- but it is dramatically better than no check at all.

Scoring and episode classification

Each check in the framework produces one of three signals: Pass, Warn, or Fail. The challenge is combining dozens of individual signals into a single episode-level decision. A simple weighted scoring scheme works well in practice:

Hard failures are immediate rejects. Non-monotonic timestamps, unreadable files, joint positions wildly outside limits, non-decodable video. No scoring nuance needed.
Warnings accumulate. Each warning carries a severity weight. A single velocity spike at one timestep is a minor warning. Velocity spikes across 15% of the trajectory is a pattern that should trigger review.
Semantic scores are confidence-weighted. If the VLM reports 0.92 confidence that the task was completed, that is a pass. If it reports 0.51, that is a review flag. Below 0.3, it is a reject.

The final episode classification is one of three outcomes:

Pass -- All structural checks pass. No kinematic hard violations. Semantic confidence above threshold. This episode is training-ready.
Needs Review -- No hard failures, but one or more warnings that a human should inspect. Common case: a brief kinematic anomaly that may be a teleop artifact or may indicate a real problem.
Reject -- One or more hard failures. This episode should not enter any training dataset without manual correction.

Here is what this looks like for a concrete episode:

Example: Episode 47 of a pick-and-place dataset

Layer 1 (Structural): Pass -- 30.0 Hz stable, 0 dropped frames, action dim = 7, timestamps monotonic, all modalities present, files intact.

Layer 2 (Kinematic): Warn -- Velocity spike on joint 3 at t=2.31s (12.4 rad/s, limit 8.0 rad/s, duration 2 frames). All other checks pass. No self-collisions. EE within workspace.

Layer 3 (Semantic): Pass -- Target object (red block) detected in 94% of frames. Task progress confirmed: block position changed from table to bin. Grasp event detected at t=1.8s, release at t=4.2s.

Final verdict: Needs Review -- Structural and semantic pass, but kinematic warning on velocity spike requires human inspection. Possible cause: teleoperator made a sudden corrective movement.

What bad data actually looks like

Abstract frameworks are useful, but concrete examples of failure modes are what actually calibrate your intuition. Here are four real categories of bad trajectory data, how they manifest, and which layer of the framework catches them.

Teleop recovery artifacts

A teleoperator bumps the robot or makes an unintended movement, then corrects. The episode contains the intended task behavior plus a spike-and-recovery motion that is not part of the desired behavior. The problem: if the policy learns this trajectory, it learns that the spike motion is a valid part of the task. Layer 2 catches this via acceleration spike detection. The velocity profile shows a clear discontinuity -- the spike followed by an opposite-direction correction -- that is distinguishable from smooth intentional motion.

Calibration drift

In multi-sensor setups (common in humanoid and mobile manipulation), the extrinsic calibration between cameras and the robot base can drift over a recording session. The result: depth maps that no longer align with RGB frames, or hand tracking data that is offset from where the hand actually is in the image. Layer 1 catches gross calibration errors if they result in invalid data ranges, but subtle drift requires Layer 3 -- checking whether the detected object positions in the image are consistent with the reported end-effector positions. This is one of the hardest failure modes to catch automatically and a strong argument for re-calibrating at the start of every recording session.

Unlabeled failures

The teleoperator attempted a grasp, the object slipped, and the demonstration was not flagged as a failure. This happens constantly in large-scale data collection campaigns where operators are recording hundreds of episodes per session. Layer 3 catches this by checking task progress: if the task is "pick and place the block in the bin" but the block's position is unchanged between the first and last frames, the episode is flagged. Gripper state analysis (Layer 2) can also detect this -- the gripper closed and opened without a sustained grasp force, suggesting the object was not successfully held.

Timestamp clock skew

Different sensors run on different clocks or different machines. The camera timestamps are on the host machine clock, the joint encoders report timestamps from the robot controller, and nobody synchronized them. The result: when the training pipeline aligns observations and actions by timestamp, there is a systematic offset. The policy learns to associate each visual observation with an action that happened 80ms earlier or later. Layer 1 catches this if the skew is large enough to violate monotonicity or create impossible inter-frame intervals. For subtler skew, cross-modal consistency checks are needed: correlating visual motion with reported joint velocities to detect systematic lag.

The cost of not doing QA

If the preceding sections describe what bad data looks like, this section quantifies what it costs. These are the failure modes that compound when quality is not gated at ingest.

Wasted compute. Training a diffusion policy or a VLA on a dataset with 5-10% corrupted episodes does not just degrade performance by 5-10%. The corrupted episodes introduce conflicting gradients that slow convergence across the entire training run. A 48-hour training job on 8 A100s costs roughly $10,000 in cloud compute. If 10% of the dataset is bad data that should have been filtered, the effective cost of that bad data is not $1,000 -- it is the full $10,000 of the failed run plus the cost of the subsequent debugging.

Hardware failures from impossible trajectories. Policies trained on data that includes physically impossible joint configurations -- positions beyond limits, velocities exceeding motor capabilities -- will occasionally command those configurations on the real robot. At best, the robot's safety controller catches it and triggers an e-stop. At worst, on systems with permissive safety bounds, it causes mechanical damage. Kinematic validation at the data layer is cheaper than replacing a $40,000 robot arm.

Irreproducible experiments. Without dataset versioning and quality tracking, you cannot answer the question "what data did training run #47 use?" When a policy works on version 3 of a dataset but fails on version 4, and version 4 added 500 new episodes with no quality gate, the debugging process is pure guesswork. Was it the new data? Which episodes? The answer requires quality metadata that was never recorded.

The manual review wall. Some teams do perform manual review. It works at 100 episodes. It barely works at 1,000. At 10,000 episodes, it is impossible to maintain consistent review quality. The reviewer fatigues. Standards drift. Episodes that would have been caught on episode 50 get waved through on episode 5,000. Automated QA does not fatigue, does not drift, and scales linearly with dataset size.

Compounding data debt. Every unvalidated episode that enters your data lake is a liability. When you eventually discover that a class of episodes is bad -- say, all episodes from a particular teleop session had miscalibrated depth -- you need to retroactively identify and remove them. Without per-episode quality metadata, this requires re-running validation on the entire dataset. With quality metadata recorded at ingest, it is a filter query.

From framework to practice

The framework described here is not theoretical. Every layer maps to concrete, implementable checks. Structural validation is straightforward engineering -- parsing timestamps, checking file headers, validating schemas. Kinematic validation requires a URDF and standard robotics libraries for forward kinematics and collision checking. Semantic validation requires VLM inference, which is increasingly cheap and fast with models like SigLIP, DINOv2, and open-weight VLMs.

The key architectural decision is when to run validation. There are two viable approaches:

At ingest (recommended). Every episode is validated when it enters the data store. Quality metadata is recorded alongside the episode data. Training pipelines can filter by quality score at query time. This front-loads the cost but means your data lake is always annotated.
At materialization. Validation runs when a training dataset is assembled from the data lake. This defers the cost but means you do not know the quality of your raw data until someone tries to use it.

We strongly recommend ingest-time validation with a strict mode option. In strict mode, episodes that fail Layer 1 are rejected outright -- they never enter the data store. In permissive mode, all episodes are stored but their quality scores are recorded, allowing downstream consumers to set their own thresholds. Strict mode is appropriate for production training pipelines. Permissive mode is useful during initial data collection campaigns when you want to understand the quality distribution before setting thresholds.

The scoring output should be CI-gatable: a data ingestion pipeline should be able to fail a build if more than N% of new episodes score below a threshold. This brings the same quality discipline to robotics data that unit tests bring to software.

Traceplane runs this framework automatically

Every episode ingested into Traceplane is scored across all three validation layers. Structural checks, kinematic validation against your robot's URDF, and semantic analysis -- computed at ingest, queryable at training time. Set quality thresholds per dataset. Gate your training pipeline on data quality, not just data volume.

Request early access