What we ran

traceplane check is a single-command local QA tool for robotics datasets. It validates LeRobot v2 and v3 datasets across roughly 25 distinct checks, each mapping to a failure mode we have seen break real training runs. The audit covers:

The datasets

Dataset Format Episodes Result Key finding
lerobot/pusht v3.0 206 PASS (2W) task_index.std=0, next.success.std=0
lerobot/aloha_sim_transfer_cube_scripted v3.0 50 PASS (1W) task_index.std=0
lerobot/xarm_lift_medium v3.0 800 PASS (1W) task_index.std=0
unitreerobotics/G1_Dex3_GraspSquare_Dataset v3.0 301 PASS (1W) task_index.std=0
lerobot/libero v3.0 1,693 PASS (0) Clean — multi-task stats are correct

4 of 5 ship broken normalization stats. The one that does not (lerobot/libero) is multi-task — which is exactly why its task_index.std is meaningful.

The finding: zero-variance normalization

Every LeRobot dataset ships meta/stats.json — a per-feature mean/std/min/max table used by dataloaders to normalize inputs. The standard normalization recipe is:

normalized = (x - mean) / std

When std = 0, that produces Inf. Every time.

In a single-task dataset, task_index is a constant (always 0). A column of constants has std = 0. The stats file dutifully computes and ships std = [0.0]. The dataloader dutifully reads it. If you loop over all numeric features and normalize — which is what most naive training setups do — you get Inf on the task_index feature.

The reason nobody notices is that most code paths ignore task_index during actual policy input construction, so the Inf flows into an unused tensor. But anyone doing:

...gets bit.

The same pattern applies to any constant-valued feature. lerobot/pusht also ships next.success.std = [0.0] — all episodes are successful, so the success flag is a constant. Any model trying to condition on reward or success gets Inf.

Why this happens

LeRobot's stats computation walks every numeric column and runs mean/std. Constant columns produce std = 0. There is no check, no warning, no fallback. The burden is on the consumer to handle std = 0 — but no tutorial or example code does.

The fix is one line in the dataloader: clamp std to a minimum epsilon (e.g., std = max(std, 1e-6)). Or explicitly skip index/flag features during normalization. Or use (x - mean) / (std + eps).

The fact that every major published dataset has this issue means the field has collectively agreed to ignore it. Which means any lab that trains on a mix of these without a custom normalization path has a latent bug in their pipeline.

What we found in our own tool

We ran the audit with traceplane check v0.3.0. Midway through, we discovered our own checker had a false positive on LeRobot v3 datasets — it was expecting stats.parquet when v3 supports both stats.parquet and stats.json. Every v3 dataset was producing STATS_MISSING warnings for a file that wasn't supposed to exist.

We fixed it. The commit is in the repo. This is how QA tools get real — you point them at real data, they hit real edge cases, you fix them.

The lesson: if a QA tool has been tested only on synthetic fixtures, assume it has bugs at real-dataset scale. Ours did.

Reproduce it yourself

pip install traceplane==0.3.0

# Download a dataset
python -c "
from huggingface_hub import snapshot_download
snapshot_download('lerobot/pusht', repo_type='dataset', local_dir='./pusht')
"

# Run the check
traceplane check ./pusht

Full command reference:

traceplane check <path>                       # full audit
traceplane check <path> --metadata-only       # quick metadata-only scan
traceplane check <path> --json report.json    # machine-readable output
traceplane check <path> --max-episodes 100    # sample a subset

What's next

We are building Traceplane as the QA + curation + versioning layer for trajectory data — for labs shipping datasets, labs training on public datasets, and labs building their own capture pipelines. traceplane check is the first piece; traceplane curate, traceplane coverage, and traceplane eval are also shipped.

If you are training policies on public data and want an outside audit of what you're feeding your model, we will run traceplane check + traceplane coverage on your dataset mix and send the report back for free. Drop a note at hello@traceplane.ai or the waitlist.

Sources

Dataset pages on HuggingFace:

Citations & acknowledgements

This audit is only possible because of open work from many groups. The framework, data formats, and several of the datasets above come from external research projects, and the credit for them belongs there:

The findings reported here describe these datasets as currently published on HuggingFace at the time of this post. They are not a critique of any of the projects above; the issue is a gap at the dataloader-stats interface, which is what traceplane check is built to surface.

Appendix: USC-GVL/humanoid-everyday (8,949 episodes, ~18 GB) was still downloading at publication. Findings from that dataset will be added as a follow-up post. An earlier internal audit of an older version found a DATA_ACTION_DIM_MISMATCH — action shape declared as -1 in info.json but actual data has proper dimensions. We will confirm whether this persists in the current revision.


Run the audit on your own datasets

traceplane check runs locally, in seconds, and surfaces zero-variance stats, schema drift, action-dimension mismatches, timestamp problems, and ~20 other failure modes before they reach your training loop.

Request early access