Which episodes are broken.
Which ones help your policy.
What to collect next.
Check, fix, curate, and evaluate robot-learning datasets.
Score every episode. Salvage good segments from bad demos. Detect coverage gaps.
Export reproducible, training-ready datasets. No custom scripts.
# pip install traceplane
# 1. Find what's broken
$ traceplane check ./my-dataset
Result: FAIL — 3 errors, 5 warnings
[X] STATS_NAN_INF — stats['action']['std'][4] is nan
[X] DATA_ZERO_BYTE — 2 zero-byte Parquet files
# 2. Fix automatically
$ traceplane fix ./my-dataset --all
[*] Deleted 2 zero-byte files
[*] Regenerated stats for 5 columns
# 3. Score episodes, salvage good segments
$ traceplane curate ./my-dataset --html report.html
Keep: 1,847 Review: 412 Drop: 139
# 4. Find coverage gaps
$ traceplane coverage ./dataset-a ./dataset-b
[!!!] Severe task imbalance: 'pick' has 800 but 'pour' has 12
[!!!] Action space coverage: 0.2% — collect more diverse demos
# 5. Publish with QA report
$ traceplane publish ./my-dataset --repo-id my-lab/clean-dataset --push
Your robot data has bugs you don't know about
We audited 10 popular open-source robot datasets. Every single one had quality issues that silently degrade training. Read the full audit.
Metadata lies
Bridge V2 claims 60K trajectories but has 25K. Action space descriptions in features.json are wrong. Your normalization is silently broken.
NaN in your stats
Zero-variance dimensions produce Inf during normalization. NaN propagates through your policy. Training loss looks fine until eval crashes.
Corrupt files ship
Zero-byte Parquet files, missing camera views, timestamp drift between sensors. Nobody catches these at release time. Downstream users lose weeks debugging.
No standard QA tooling
How do you know a dataset is training-ready? There's no pytest for datasets, no CI pipeline for trajectory data, no way to produce a defensible quality report for a paper.
Dataset CI for robotics
Check
Point traceplane check at any LeRobot, HDF5, or rosbag2 dataset. Get a full QA report: metadata, Parquet integrity, stats, schema drift, dimensions.
Fix
Auto-repair: delete corrupt files, regenerate stats, patch metadata, reindex episodes. --dry-run to preview. Backups before every change.
Curate
Score every episode on smoothness, outlier detection, and redundancy. Salvage good segments from imperfect demos. Keep/review/drop recommendations with visual reports.
Analyze
Coverage-gap analysis across datasets. "You need mugs in cluttered kitchens, not 500 more tabletop demos." Generalization-aware train/eval splits with leakage detection.
Evaluate
Compare training vs rollout distributions. Per-dimension shift analysis. Failure taxonomy. Know if your data quality is causing your policy failures.
Publish
Export to LeRobot + HuggingFace Hub with auto-generated dataset cards, QA stats, and reproducible split indices. One command: traceplane publish --push.
Everything between raw demos and a training run
Dataset CI
Run traceplane check locally or in CI. Validates metadata, Parquet integrity, stats correctness, schema consistency, action dimensions, and video files. Exits 1 on failure — plug it into your pipeline like a linter.
Episode & segment curation
Score episodes on smoothness, outlier detection, and redundancy. Salvage good segments from imperfect demos — don't throw away entire episodes when only part is bad. Keep/review/drop with visual HTML reports.
Coverage-gap analysis
Cross-dataset distribution maps over tasks, objects, scenes, and embodiments. Identifies what's missing: "You have 800 pick episodes but only 12 pour." Recommends what data to collect next.
Generalization-aware splits
Build train/eval splits that hold out by task, object, scene, or embodiment. Leakage detection catches overlap between sets. Reproducible indices. Prove your results generalize in your paper.
Evaluation studio
Compare training data against policy rollouts. Per-dimension distribution shift analysis. Failure taxonomy. Know whether your policy fails because of bad data or bad architecture.
Dataset recipes
Reproducible dataset compilation from multiple sources. Define sim/real ratios, tier balance, and task weights in a JSON recipe. Same recipe always produces the same training set.
One-click publish
Export to HuggingFace Hub with auto-generated dataset cards, QA stats, and citation metadata. traceplane publish --push and your dataset is streaming-ready on the Hub.
50+ automated QA checks
Structural, kinematic, semantic, and replayability checks. Catches NaN stats, zero-byte files, timestamp skew, joint limit violations, and schema drift. CI-friendly: exits 1 on failure.
Coming soon: search and understand every episode
We're building an ML layer on top of the QA pipeline — so you can find, compare, and cluster episodes by what actually happens in them, not just by filename. Request early access.
Semantic episode search
Find episodes in plain language: "grasps that slipped on cylindrical objects," "demos where the arm approached from the left." Natural-language and image-similarity search across millions of episodes.
Find-similar & deduplicate
Pick any episode and retrieve its nearest neighbors. Surface near-duplicate demos that bloat your dataset and bias training — embedding-based redundancy scoring that goes beyond timestamp and metadata heuristics.
Failure-mode clusters
Automatically group failed episodes into named failure modes. See the patterns behind your policy's mistakes — "missed the object," "gripper closed early" — instead of scrubbing through rollouts one by one.
How teams use Traceplane
Audit before you train
Run traceplane check on any dataset before it enters your training pipeline. Catch broken metadata, NaN stats, and schema drift. Save weeks of debugging policy failures caused by silent data bugs.
Clean up published datasets
Downloaded a dataset from HuggingFace? Run traceplane fix --all to auto-repair common issues. Regenerate stats, patch metadata, delete corrupt files. Get a dataset that's actually training-ready.
Decide what to collect next
Run traceplane coverage across your datasets. See which tasks, objects, and scenes are underrepresented. Stop recording 500 more demos you don't need. Collect the 20 that fill the gaps.
Publish with confidence
Run traceplane publish --push to export to HuggingFace Hub with an auto-generated dataset card, QA report, and reproducible split indices. Prove your dataset is ready in your paper.
For teams that publish datasets or train policies
Academic labs
Free for .edu. Audit your dataset before publishing. Generate a QA report for your paper. Export benchmark-ready train/eval bundles. Stop losing reviewer confidence to silent data bugs.
VLA companies
Training across 10+ embodiments and millions of episodes? Automated QA on every ingest. SQL queries over trajectory data. Materialize training sets in seconds, not hours.
Dataset authors
Releasing a new dataset? Run traceplane check before you push to HuggingFace. Catch the issues that cost downstream users weeks — before they find them.
Data collectors
Capturing human demos with wearables or teleop rigs? Auto-QA validates multi-sensor sync, hand tracking quality, SLAM stability, and joint limit replayability on every episode.
Frequently asked questions
What data formats does Traceplane support?
Traceplane ingests LeRobot v2/v3, HDF5 (robomimic/ActionNet style), rosbag2 (.mcap/.db3), Zarr, RLDS, and custom formats. All data is normalized to a canonical episode schema and can be exported in any supported format.
How does automated QA work for robot trajectory data?
Every episode is automatically scored across three layers: structural checks (FPS consistency, dropped frames, action dimensions), kinematic validation (joint limits, velocity sanity, collision detection), and semantic checks (object presence, hand-object interaction). Episodes are classified as pass, needs review, or reject.
What is embodiment-aware storage?
Traceplane understands the physical meaning of each data dimension — for example, which joints belong to a 7-DoF left arm vs. a 3-DoF waist. This lets you store data from a 23-DoF humanoid and query just the arm joints, or search across different robot embodiments for similar tasks.
How is Traceplane different from HuggingFace Datasets?
HuggingFace is a general-purpose dataset hub. Traceplane is purpose-built for robotics trajectories: it understands embodiments, runs automated QA on every episode, pre-computes frame embeddings and action tokens, and materializes training-ready datasets sharded for your GPU count — all things a general hub cannot do.
Does Traceplane support human demonstration data from XR/wearables?
Yes. Traceplane validates hand tracking quality, SLAM stability, grasp events, and device consistency — 15 automated checks specific to human demonstration data. It also handles retargeting from human hand poses to robot action spaces, including workspace mapping and calibration. This is the infrastructure layer for the "human data → robot policy" approach used by teams like Generalist AI.
Can I stream data directly into my training loop?
Yes. Traceplane provides streaming dataloaders for PyTorch and JAX that serve Arrow batches directly — no need to download terabytes to local disk first. You can also resample on the fly, for example data stored at 30Hz served at 10Hz.
Who is behind Traceplane?
Traceplane is built by engineers with deep experience in robotics data pipelines and large-scale ML infrastructure. We've worked with rosbag2, LeRobot, HDF5, and custom formats across humanoid manipulation, mobile robotics, and simulation — and built Traceplane to solve the data problems we kept hitting ourselves.
Is my data secure?
Yes. Each organization's data is fully isolated — separate storage buckets, separate compute, no cross-tenant access. Data is encrypted at rest (AES-256) and in transit (TLS 1.3). We never use your data to train models or share it with other customers. For teams with strict compliance requirements, we offer private deployment options.
Pricing
Simple, usage-based pricing
Start free. Scale when your data does.
For university labs and research groups with .edu or .ac email addresses.
- Unlimited users
- 1 TB storage
- All platform features
- Community support
For robotics companies building policies at scale.
- Unlimited users
- Storage: $0.02/GB/mo
- Compute: $0.10/query-hour
- Materialization: $0.05/GB
- Priority support
Private deployment, SLAs, and dedicated support.
- Self-hosted or VPC
- SSO / SAML
- Volume discounts
- Dedicated support
Get a free dataset audit
Send us your dataset (or a link) and we'll run our full QA pipeline on it. You'll get a detailed report of every issue we find — no commitment required.