The assumption that costs you GPU hours

If you are training robot policies with imitation learning, you are almost certainly using open-source datasets. Bridge V2, Open X-Embodiment, ALOHA, LeRobot community datasets -- these are the foundations that labs and startups build on. The implicit assumption is that these datasets are clean, well-structured, and ready for training. After all, they come from top research groups. They have papers. They have thousands of GitHub stars.

We decided to test that assumption. We ran automated quality checks across 10 popular open-source robotics datasets, examining structural integrity, metadata correctness, and semantic completeness. The checks are not exotic -- they look for things like consistent schemas, valid camera intrinsics, correct action space descriptions, and decodable video frames. The kind of validation you would expect any dataset to pass before release.

Every single dataset had issues. Some had minor documentation gaps. Others had problems that would silently corrupt your training pipeline -- wrong action space descriptions, black video frames, missing camera views, floating-point bugs that cause frame decode failures. These are not hypothetical risks. They are open GitHub issues, many with dozens of thumbs-up reactions from researchers who hit them during real training runs.

What we checked

Our audit examined three quality dimensions across each dataset:

We cross-referenced our findings with public GitHub issues to confirm that these are known, reproducible problems -- not artifacts of our setup. Every issue cited below links to a real GitHub issue filed by a real user.

The findings

1. Bridge V2 (UC Berkeley)

Bridge V2 is one of the most cited manipulation datasets in the field, used as a foundation for RT-2, Octo, and numerous other projects. It claims 60,000+ trajectories across 24 environments. The reality is more complicated.

2. Open X-Embodiment (Google DeepMind)

Open X-Embodiment is the largest multi-embodiment robotics dataset, aggregating data from over 20 institutions across 22 robot types. Its scale is impressive. Its quality control is not.

3. RealSource-World

RealSource-World is a large-scale humanoid manipulation dataset with over 14 million frames. Its scale is a selling point, but early versions had significant issues that persisted across multiple releases.

4. USC-GVL/humanoid-everyday

The humanoid-everyday dataset from USC covers 260+ tasks with multimodal observations at 30Hz, targeting humanoid manipulation research. It is one of the more comprehensive humanoid datasets available, but usability issues limit its practical value.

5. Apple EgoDex

Apple's EgoDex dataset provides 829 hours of egocentric video with 3D hand pose annotations, targeting pretraining for hand-object interaction understanding. The scale and source are compelling, but the annotations have structural problems.

6. ALOHA (Stanford)

ALOHA introduced a compelling bimanual teleop platform and the ACT policy architecture. The hardware design and learning approach were influential. The data release, less so.

7. robomimic

robomimic is both a dataset and a benchmarking framework for offline learning from demonstrations. It is widely used as a baseline for manipulation policy research. The framework itself reveals some uncomfortable truths about data quality in the field.

8. LeRobot ecosystem (cross-cutting)

LeRobot from Hugging Face has become the de facto standard format for sharing robotics datasets. The tooling and format are genuinely useful. But the ecosystem has its own class of issues that affect anyone building on top of it.

The pattern

Across all 10 datasets, the same five failure modes recur. These are not isolated incidents -- they are systemic patterns in how the robotics community produces and shares data.

  1. Metadata lies. Action space descriptions that do not match the actual data format. Camera intrinsics that are empty arrays. Episode counts that are off by 2x. The metadata is the first thing your dataloader reads, and it is wrong more often than it is right.
  2. Silent schema changes between versions. Proprioceptive dimensions change from 57 to 71. Image statistics go from constant values to computed values. Version conversion introduces boundary bugs. None of these changes are communicated in a way that prevents downstream breakage.
  3. Timestamp and synchronization issues. Modality streams that are misaligned by tens of milliseconds. Floating-point drift that accumulates over episodes. Tolerance parameters that need to be set 100x higher than defaults. Time is the fundamental axis of trajectory data, and it is consistently the least validated.
  4. No quality validation at any stage of the pipeline. None of the 10 datasets include per-episode quality scores. None have automated validation gates at ingest. None document a QA process. The data flows from collection to release with, at best, spot-check review.
  5. Missing data that should be there. Camera views documented but not present. Sub-datasets listed but not downloadable. Training data referenced in papers but never released. Calibration data requested by dozens of users but never provided. The dataset card promises more than the dataset delivers.

What this means for your training pipeline

The practical consequence of these findings is straightforward: if you are loading any of these datasets and feeding them directly into a training loop, you are training on data with known defects. The question is not whether these issues affect your policy -- it is how much, and whether you will be able to diagnose the problem when your policy fails.

Consider the robomimic finding that validation loss is a poor predictor of policy performance. This means the standard feedback loop -- train, check validation loss, deploy best checkpoint -- is broken for robot learning. Now layer on the fact that your training data itself may have wrong action space descriptions, misaligned timestamps, or corrupted episode boundaries. You are optimizing a metric that does not predict performance, on data that may not represent what it claims to represent. When the policy fails on the robot, the root cause could be the model architecture, the training hyperparameters, the data quality, or some combination. Without data validation, you have eliminated none of the possibilities.

The "garbage in, garbage out" principle is not new. What is new is the scale at which robotics is now operating. When your dataset has 50 demonstrations, you can manually review each one. When it has 60,000 -- or 14 million frames -- manual review is not a methodology. It is a hope. And hope does not scale.

A path forward

The good news is that most of these issues are detectable automatically. Schema consistency, timestamp alignment, action space validation, camera intrinsic checks, episode boundary verification, zero-variance detection -- none of these require human judgment. They require running a validation pass over the data before it enters your training pipeline.

The robotics community needs the equivalent of what every other data-intensive field already has: automated quality gates. Data warehouses have schema validation and constraint checking. ML feature stores have distribution monitoring and drift detection. Software has CI/CD with test suites. Robotics trajectory data has... nothing. The data goes from collection to training with no automated checks in between.

This is a solvable problem. The checks are well-defined. The failure modes are known. What is missing is the tooling that runs these checks automatically, at ingest time, and attaches quality metadata to every episode so that downstream consumers can make informed decisions about what to train on.

At Traceplane, we are building automated QA into the ingest pipeline so these issues get caught before they reach your training loop. Every episode is validated for structural integrity, metadata correctness, and cross-modal consistency at the moment it enters the system. Quality scores are queryable, filterable, and gatable -- you set the threshold, and only episodes that pass make it into your training dataset. If you are working with trajectory data and want to try it, request early access.


Stop training on unchecked data

Traceplane validates every episode at ingest -- structural integrity, metadata correctness, timestamp alignment, and semantic consistency. Catch the issues documented above before they reach your training loop, not after your policy fails on hardware.

Request early access