We Audited 5 Popular LeRobot Datasets. 4 Ship Stats That Produce Inf.

What we ran

traceplane check is a single-command local QA tool for robotics datasets. It validates LeRobot v2 and v3 datasets across roughly 25 distinct checks, each mapping to a failure mode we have seen break real training runs. The audit covers:

Metadata consistency — info.json matches declared episodes, FPS, robot type, features, and action spec
Manifest integrity — episode count matches, no duplicate IDs, no zero-length episodes
Stats sanity — no NaN, no Inf, no zero-variance, schema matches feature list
Data integrity — Parquet is readable, no zero-byte files, schema does not drift across chunks
Video files — count matches declared episodes, no corrupt files
Dimension agreement between metadata and actual data

The datasets

Dataset	Format	Episodes	Result	Key finding
`lerobot/pusht`	v3.0	206	PASS (2W)	`task_index.std=0`, `next.success.std=0`
`lerobot/aloha_sim_transfer_cube_scripted`	v3.0	50	PASS (1W)	`task_index.std=0`
`lerobot/xarm_lift_medium`	v3.0	800	PASS (1W)	`task_index.std=0`
`unitreerobotics/G1_Dex3_GraspSquare_Dataset`	v3.0	301	PASS (1W)	`task_index.std=0`
`lerobot/libero`	v3.0	1,693	PASS (0)	Clean — multi-task stats are correct

4 of 5 ship broken normalization stats. The one that does not (lerobot/libero) is multi-task — which is exactly why its task_index.std is meaningful.

The finding: zero-variance normalization

Every LeRobot dataset ships meta/stats.json — a per-feature mean/std/min/max table used by dataloaders to normalize inputs. The standard normalization recipe is:

normalized = (x - mean) / std

When std = 0, that produces Inf. Every time.

In a single-task dataset, task_index is a constant (always 0). A column of constants has std = 0. The stats file dutifully computes and ships std = [0.0]. The dataloader dutifully reads it. If you loop over all numeric features and normalize — which is what most naive training setups do — you get Inf on the task_index feature.

The reason nobody notices is that most code paths ignore task_index during actual policy input construction, so the Inf flows into an unused tensor. But anyone doing:

Policy conditioning on task ID (multi-task learning) → silent failure
Dataset mixing across task boundaries → silent bias
Automated stats-driven normalization pipelines → Inf in gradients somewhere

...gets bit.

The same pattern applies to any constant-valued feature. lerobot/pusht also ships next.success.std = [0.0] — all episodes are successful, so the success flag is a constant. Any model trying to condition on reward or success gets Inf.

Why this happens

LeRobot's stats computation walks every numeric column and runs mean/std. Constant columns produce std = 0. There is no check, no warning, no fallback. The burden is on the consumer to handle std = 0 — but no tutorial or example code does.

The fix is one line in the dataloader: clamp std to a minimum epsilon (e.g., std = max(std, 1e-6)). Or explicitly skip index/flag features during normalization. Or use (x - mean) / (std + eps).

The fact that every major published dataset has this issue means the field has collectively agreed to ignore it. Which means any lab that trains on a mix of these without a custom normalization path has a latent bug in their pipeline.

What we found in our own tool

We ran the audit with traceplane check v0.3.0. Midway through, we discovered our own checker had a false positive on LeRobot v3 datasets — it was expecting stats.parquet when v3 supports both stats.parquet and stats.json. Every v3 dataset was producing STATS_MISSING warnings for a file that wasn't supposed to exist.

We fixed it. The commit is in the repo. This is how QA tools get real — you point them at real data, they hit real edge cases, you fix them.

The lesson: if a QA tool has been tested only on synthetic fixtures, assume it has bugs at real-dataset scale. Ours did.

Reproduce it yourself

pip install traceplane==0.3.0

# Download a dataset
python -c "
from huggingface_hub import snapshot_download
snapshot_download('lerobot/pusht', repo_type='dataset', local_dir='./pusht')
"

# Run the check
traceplane check ./pusht

Full command reference:

traceplane check <path>                       # full audit
traceplane check <path> --metadata-only       # quick metadata-only scan
traceplane check <path> --json report.json    # machine-readable output
traceplane check <path> --max-episodes 100    # sample a subset

What's next

We are building Traceplane as the QA + curation + versioning layer for trajectory data — for labs shipping datasets, labs training on public datasets, and labs building their own capture pipelines. traceplane check is the first piece; traceplane curate, traceplane coverage, and traceplane eval are also shipped.

If you are training policies on public data and want an outside audit of what you're feeding your model, we will run traceplane check + traceplane coverage on your dataset mix and send the report back for free. Drop a note at hello@traceplane.ai or the waitlist.

Sources

Dataset pages on HuggingFace:

Citations & acknowledgements

This audit is only possible because of open work from many groups. The framework, data formats, and several of the datasets above come from external research projects, and the credit for them belongs there:

LeRobot — the framework, dataset format (v2/v3), and several reference datasets used here. Cadene, Alibert, Soare, Gallouédec, Zouitine, Wolf, et al. (HuggingFace, 2024). github.com/huggingface/lerobot
PushT — originally introduced in Diffusion Policy: Visuomotor Policy Learning via Action Diffusion by Chi, Feng, Du, Xu, Burchfiel, Cousineau, Song et al. (RSS 2023).
ALOHA (sim transfer cube) — from Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ALOHA / Action Chunking with Transformers) by Zhao, Kumar, Levine, Finn (RSS 2023).
xarm_lift_medium — from the robomimic suite, What Matters in Learning from Offline Human Demonstrations for Robot Manipulation by Mandlekar, Xu, Wong, Nasiriany, Wang, Kulkarni, Fei-Fei, Savarese, Zhu, Martín-Martín (CoRL 2021).
LIBERO — from LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning by Liu, Zhu, Gao, Feng, Liu, Zhu, Stone (NeurIPS 2023).
G1_Dex3_GraspSquare_Dataset — released by Unitree Robotics for the Unitree G1 humanoid + Dex3 hand platform.

The findings reported here describe these datasets as currently published on HuggingFace at the time of this post. They are not a critique of any of the projects above; the issue is a gap at the dataloader-stats interface, which is what traceplane check is built to surface.

Appendix: USC-GVL/humanoid-everyday (8,949 episodes, ~18 GB) was still downloading at publication. Findings from that dataset will be added as a follow-up post. An earlier internal audit of an older version found a DATA_ACTION_DIM_MISMATCH — action shape declared as -1 in info.json but actual data has proper dimensions. We will confirm whether this persists in the current revision.

Run the audit on your own datasets

traceplane check runs locally, in seconds, and surfaces zero-variance stats, schema drift, action-dimension mismatches, timestamp problems, and ~20 other failure modes before they reach your training loop.

Request early access