The assumption that costs you GPU hours
If you are training robot policies with imitation learning, you are almost certainly using open-source datasets. Bridge V2, Open X-Embodiment, ALOHA, LeRobot community datasets -- these are the foundations that labs and startups build on. The implicit assumption is that these datasets are clean, well-structured, and ready for training. After all, they come from top research groups. They have papers. They have thousands of GitHub stars.
We decided to test that assumption. We ran automated quality checks across 10 popular open-source robotics datasets, examining structural integrity, metadata correctness, and semantic completeness. The checks are not exotic -- they look for things like consistent schemas, valid camera intrinsics, correct action space descriptions, and decodable video frames. The kind of validation you would expect any dataset to pass before release.
Every single dataset had issues. Some had minor documentation gaps. Others had problems that would silently corrupt your training pipeline -- wrong action space descriptions, black video frames, missing camera views, floating-point bugs that cause frame decode failures. These are not hypothetical risks. They are open GitHub issues, many with dozens of thumbs-up reactions from researchers who hit them during real training runs.
What we checked
Our audit examined three quality dimensions across each dataset:
- Structural integrity. Missing or malformed fields, schema consistency across episodes, timestamp alignment between modalities, file decodability (can you actually read every Parquet file and decode every video frame?).
- Metadata correctness. Action space descriptions that match the actual data, valid camera intrinsics and extrinsics, correct episode counts, accurate modality specifications.
- Semantic completeness. Task labels present and meaningful, subtask segmentation where claimed, quality metrics or success/failure labels, documentation sufficient to actually use the dataset without reverse-engineering it.
We cross-referenced our findings with public GitHub issues to confirm that these are known, reproducible problems -- not artifacts of our setup. Every issue cited below links to a real GitHub issue filed by a real user.
The findings
1. Bridge V2 (UC Berkeley)
Bridge V2 is one of the most cited manipulation datasets in the field, used as a foundation for RT-2, Octo, and numerous other projects. It claims 60,000+ trajectories across 24 environments. The reality is more complicated.
- Episode count mismatch. The dataset claims 60,000+ trajectories, but users attempting to load the full dataset find only approximately 25,000 usable episodes. The discrepancy has been reported in #27 and #44 with no resolution.
- Wrong action space descriptions. The metadata describes the action format as "joint velocities," but the actual data contains Cartesian position deltas. If you build your dataloader trusting the metadata, your actions will be interpreted in the wrong space entirely. Reported in #26.
- Missing camera intrinsics/extrinsics. Despite multiple user requests, camera calibration parameters are not provided. This makes it impossible to do accurate 3D reconstruction or depth projection from the RGB data. See #38, #42, and #46.
- Non-unique episode IDs. Episode identifiers are not unique across the dataset, which breaks any indexing system that assumes they are. Reported in #47.
2. Open X-Embodiment (Google DeepMind)
Open X-Embodiment is the largest multi-embodiment robotics dataset, aggregating data from over 20 institutions across 22 robot types. Its scale is impressive. Its quality control is not.
- Entire sub-dataset of black frames. The
utokyo_saytapsub-dataset contains only black frames -- no actual image data. Reported in #85. If your training pipeline samples from this subset, it is learning from nothing. - Half the datasets unavailable. 28 of 55 listed sub-datasets are unavailable for download. Reported in #104. The dataset card lists them, but the data is simply not there.
- Typo prevents dataset access. One dataset name is misspelled as "manpulation" instead of "manipulation," which prevents programmatic access. Reported in #86.
- Missing camera views. The
berkeley_rptsub-dataset is documented as having 3 camera views, but only 1 is actually present in the data. Reported in #95. - Irreversible downsampling. All images are downsampled to 224x224 pixels, and there is no way to access the original resolution data. Reported in #78. For any method that benefits from higher resolution input, the original data is simply gone.
3. RealSource-World
RealSource-World is a large-scale humanoid manipulation dataset with over 14 million frames. Its scale is a selling point, but early versions had significant issues that persisted across multiple releases.
- Constant image statistics. In versions prior to v1.3, the image statistics (mean, std) were hardcoded constant values rather than computed from the actual data. Any training pipeline that used these values for normalization was applying incorrect normalization to every single frame.
- Timing synchronization issues. The v1.2 changelog explicitly confirms timing synchronization fixes, meaning prior versions had misaligned modality streams. If you trained on v1.1 data, your observations and actions were not properly aligned in time.
- Silent proprioceptive dimension change. The proprioceptive state vector changed from 57 dimensions to 71 dimensions between versions. No migration guide was provided. Dataloaders built for the earlier version break silently -- they either crash or, worse, load truncated/misaligned state vectors without error.
4. USC-GVL/humanoid-everyday
The humanoid-everyday dataset from USC covers 260+ tasks with multimodal observations at 30Hz, targeting humanoid manipulation research. It is one of the more comprehensive humanoid datasets available, but usability issues limit its practical value.
- Broken dataset viewer. The HuggingFace dataset viewer fails because the Parquet files are zero bytes. Users cannot preview the data before downloading the full dataset.
- Extreme timestamp tolerance required. Loading the dataset requires setting
tolerance_s=1e-2-- a value 100x larger than the default tolerance of1e-4. This suggests the timestamps across modalities are misaligned by up to 10 milliseconds, which for a 30Hz dataset (33ms frame interval) represents nearly a third of a frame period. - No quality metrics or action space documentation. The dataset page provides no quality scoring, no description of the action space format, and no documentation of what each dimension in the state/action vectors represents. Users must reverse-engineer the format from the data itself.
5. Apple EgoDex
Apple's EgoDex dataset provides 829 hours of egocentric video with 3D hand pose annotations, targeting pretraining for hand-object interaction understanding. The scale and source are compelling, but the annotations have structural problems.
- 2D joint reprojections misaligned with video. The 2D hand joint reprojections do not align with the corresponding video frames. This is not a bug -- the authors confirmed in #9 that this is by design, due to the reprojections being computed from a different camera model. However, this means the 2D annotations cannot be used directly as ground truth for the provided video data.
- Empty camera intrinsics arrays. Camera intrinsics are stored as arrays with shape
(0,3)instead of the expected(3,3). This means the intrinsic matrices are literally empty -- no focal length, no principal point. Reported in #10. - Undocumented coordinate conventions. The coordinate system conventions used across the HDF5 files are not documented, making it unclear how to correctly transform between the various reference frames in the dataset. Reported in #5.
6. ALOHA (Stanford)
ALOHA introduced a compelling bimanual teleop platform and the ACT policy architecture. The hardware design and learning approach were influential. The data release, less so.
- Silent demonstration discarding. The recording pipeline silently discards demonstrations when the control frequency drops below 42Hz. As reported in #18, this frequency drop happens regularly during normal operation, meaning an unknown number of demonstrations are lost without any warning to the operator. You have no way of knowing how many demos were silently dropped during a collection session.
- Original training data never released. The actual demonstration data used to train the policies in the ALOHA paper was never publicly released, despite multiple requests (see #44 and #5). Users can replicate the hardware and the training code, but cannot reproduce the paper's results because the paper's data is unavailable.
7. robomimic
robomimic is both a dataset and a benchmarking framework for offline learning from demonstrations. It is widely used as a baseline for manipulation policy research. The framework itself reveals some uncomfortable truths about data quality in the field.
- Broken conversion scripts. The dataset conversion scripts reference files that do not exist in the repository. Reported in #250. Users attempting to convert between formats hit file-not-found errors with no guidance on where to get the missing files.
- Undocumented action format mismatch. There is an action format mismatch between robosuite (the simulation environment) and robomimic (the training framework) that is not documented anywhere. Reported in #260. Users discover this only after training produces unexpectedly poor results.
- Validation loss does not predict policy performance. The robomimic team's own study found that the best validation checkpoint performs 50-100% worse than the best overall policy checkpoint. This means standard ML practice -- selecting the model with the lowest validation loss -- actively selects a worse policy. This is not a dataset bug per se, but it underscores how broken the standard feedback loops are for robot learning: if your validation metric does not predict deployment performance, you have no reliable signal for when your data is the problem.
8. LeRobot ecosystem (cross-cutting)
LeRobot from Hugging Face has become the de facto standard format for sharing robotics datasets. The tooling and format are genuinely useful. But the ecosystem has its own class of issues that affect anyone building on top of it.
- Floating-point drift causes frame decode failures. A floating-point precision issue causes video frame decoding to fail after approximately 45 episodes. Reported in #3177. The timestamps accumulate rounding error, eventually causing the decoder to seek to an invalid position. Your training run works fine for the first few hundred batches, then starts throwing errors.
- Version conversion corrupts episode mapping. Converting from v2.1 to v3.0 format has a metadata boundary bug that silently corrupts the episode-to-frame mapping. Reported in #2401. Frames get assigned to the wrong episodes. The data loads without error, but the episode boundaries are wrong, meaning your policy is learning from jumbled sequences.
- Zero-variance dimensions cause NaN during training. The
columbia_cairlab_pusht_realdataset contains dimensions with zero variance, which causes NaN values during normalization. Reported in #3280. Division by zero during standardization propagates NaN through the entire forward pass. - No way to delete bad episodes. The LeRobot format provides no built-in mechanism for removing individual episodes from a dataset. If you identify a bad episode after ingestion, the workaround involves manually rewriting the entire dataset's Parquet files and re-indexing all episode boundaries.
- No direct v2.0 to v3.0 conversion. Upgrading from v2.0 to v3.0 requires a two-step process through an intermediate version. There is no direct migration path, increasing the surface area for conversion bugs.
The pattern
Across all 10 datasets, the same five failure modes recur. These are not isolated incidents -- they are systemic patterns in how the robotics community produces and shares data.
- Metadata lies. Action space descriptions that do not match the actual data format. Camera intrinsics that are empty arrays. Episode counts that are off by 2x. The metadata is the first thing your dataloader reads, and it is wrong more often than it is right.
- Silent schema changes between versions. Proprioceptive dimensions change from 57 to 71. Image statistics go from constant values to computed values. Version conversion introduces boundary bugs. None of these changes are communicated in a way that prevents downstream breakage.
- Timestamp and synchronization issues. Modality streams that are misaligned by tens of milliseconds. Floating-point drift that accumulates over episodes. Tolerance parameters that need to be set 100x higher than defaults. Time is the fundamental axis of trajectory data, and it is consistently the least validated.
- No quality validation at any stage of the pipeline. None of the 10 datasets include per-episode quality scores. None have automated validation gates at ingest. None document a QA process. The data flows from collection to release with, at best, spot-check review.
- Missing data that should be there. Camera views documented but not present. Sub-datasets listed but not downloadable. Training data referenced in papers but never released. Calibration data requested by dozens of users but never provided. The dataset card promises more than the dataset delivers.
What this means for your training pipeline
The practical consequence of these findings is straightforward: if you are loading any of these datasets and feeding them directly into a training loop, you are training on data with known defects. The question is not whether these issues affect your policy -- it is how much, and whether you will be able to diagnose the problem when your policy fails.
Consider the robomimic finding that validation loss is a poor predictor of policy performance. This means the standard feedback loop -- train, check validation loss, deploy best checkpoint -- is broken for robot learning. Now layer on the fact that your training data itself may have wrong action space descriptions, misaligned timestamps, or corrupted episode boundaries. You are optimizing a metric that does not predict performance, on data that may not represent what it claims to represent. When the policy fails on the robot, the root cause could be the model architecture, the training hyperparameters, the data quality, or some combination. Without data validation, you have eliminated none of the possibilities.
The "garbage in, garbage out" principle is not new. What is new is the scale at which robotics is now operating. When your dataset has 50 demonstrations, you can manually review each one. When it has 60,000 -- or 14 million frames -- manual review is not a methodology. It is a hope. And hope does not scale.
A path forward
The good news is that most of these issues are detectable automatically. Schema consistency, timestamp alignment, action space validation, camera intrinsic checks, episode boundary verification, zero-variance detection -- none of these require human judgment. They require running a validation pass over the data before it enters your training pipeline.
The robotics community needs the equivalent of what every other data-intensive field already has: automated quality gates. Data warehouses have schema validation and constraint checking. ML feature stores have distribution monitoring and drift detection. Software has CI/CD with test suites. Robotics trajectory data has... nothing. The data goes from collection to training with no automated checks in between.
This is a solvable problem. The checks are well-defined. The failure modes are known. What is missing is the tooling that runs these checks automatically, at ingest time, and attaches quality metadata to every episode so that downstream consumers can make informed decisions about what to train on.
At Traceplane, we are building automated QA into the ingest pipeline so these issues get caught before they reach your training loop. Every episode is validated for structural integrity, metadata correctness, and cross-modal consistency at the moment it enters the system. Quality scores are queryable, filterable, and gatable -- you set the threshold, and only episodes that pass make it into your training dataset. If you are working with trajectory data and want to try it, request early access.
Stop training on unchecked data
Traceplane validates every episode at ingest -- structural integrity, metadata correctness, timestamp alignment, and semantic consistency. Catch the issues documented above before they reach your training loop, not after your policy fails on hardware.
Request early access