Now in private beta

Which episodes are broken.
Which ones help your policy.
What to collect next.

Check, fix, curate, and evaluate robot-learning datasets. Score every episode. Salvage good segments from bad demos. Detect coverage gaps.
Export reproducible, training-ready datasets. No custom scripts.

# pip install traceplane

# 1. Find what's broken
$ traceplane check ./my-dataset
  Result: FAIL — 3 errors, 5 warnings
  [X] STATS_NAN_INF — stats['action']['std'][4] is nan
  [X] DATA_ZERO_BYTE — 2 zero-byte Parquet files

# 2. Fix automatically
$ traceplane fix ./my-dataset --all
  [*] Deleted 2 zero-byte files
  [*] Regenerated stats for 5 columns

# 3. Score episodes, salvage good segments
$ traceplane curate ./my-dataset --html report.html
  Keep: 1,847  Review: 412  Drop: 139

# 4. Find coverage gaps
$ traceplane coverage ./dataset-a ./dataset-b
  [!!!] Severe task imbalance: 'pick' has 800 but 'pour' has 12
  [!!!] Action space coverage: 0.2% — collect more diverse demos

# 5. Publish with QA report
$ traceplane publish ./my-dataset --repo-id my-lab/clean-dataset --push

Your robot data has bugs you don't know about

We audited 10 popular open-source robot datasets. Every single one had quality issues that silently degrade training. Read the full audit.

Metadata lies

Bridge V2 claims 60K trajectories but has 25K. Action space descriptions in features.json are wrong. Your normalization is silently broken.

📉

NaN in your stats

Zero-variance dimensions produce Inf during normalization. NaN propagates through your policy. Training loss looks fine until eval crashes.

📁

Corrupt files ship

Zero-byte Parquet files, missing camera views, timestamp drift between sensors. Nobody catches these at release time. Downstream users lose weeks debugging.

🔄

No standard QA tooling

How do you know a dataset is training-ready? There's no pytest for datasets, no CI pipeline for trajectory data, no way to produce a defensible quality report for a paper.

Dataset CI for robotics

01

Check

Point traceplane check at any LeRobot, HDF5, or rosbag2 dataset. Get a full QA report: metadata, Parquet integrity, stats, schema drift, dimensions.

LeRobot v2/v3 HDF5 rosbag2
02

Fix

Auto-repair: delete corrupt files, regenerate stats, patch metadata, reindex episodes. --dry-run to preview. Backups before every change.

Pass Needs review Reject
03

Curate

Score every episode on smoothness, outlier detection, and redundancy. Salvage good segments from imperfect demos. Keep/review/drop recommendations with visual reports.

04

Analyze

Coverage-gap analysis across datasets. "You need mugs in cluttered kitchens, not 500 more tabletop demos." Generalization-aware train/eval splits with leakage detection.

05

Evaluate

Compare training vs rollout distributions. Per-dimension shift analysis. Failure taxonomy. Know if your data quality is causing your policy failures.

06

Publish

Export to LeRobot + HuggingFace Hub with auto-generated dataset cards, QA stats, and reproducible split indices. One command: traceplane publish --push.

Everything between raw demos and a training run

Dataset CI

Run traceplane check locally or in CI. Validates metadata, Parquet integrity, stats correctness, schema consistency, action dimensions, and video files. Exits 1 on failure — plug it into your pipeline like a linter.

Episode & segment curation

Score episodes on smoothness, outlier detection, and redundancy. Salvage good segments from imperfect demos — don't throw away entire episodes when only part is bad. Keep/review/drop with visual HTML reports.

Coverage-gap analysis

Cross-dataset distribution maps over tasks, objects, scenes, and embodiments. Identifies what's missing: "You have 800 pick episodes but only 12 pour." Recommends what data to collect next.

Generalization-aware splits

Build train/eval splits that hold out by task, object, scene, or embodiment. Leakage detection catches overlap between sets. Reproducible indices. Prove your results generalize in your paper.

Evaluation studio

Compare training data against policy rollouts. Per-dimension distribution shift analysis. Failure taxonomy. Know whether your policy fails because of bad data or bad architecture.

Dataset recipes

Reproducible dataset compilation from multiple sources. Define sim/real ratios, tier balance, and task weights in a JSON recipe. Same recipe always produces the same training set.

One-click publish

Export to HuggingFace Hub with auto-generated dataset cards, QA stats, and citation metadata. traceplane publish --push and your dataset is streaming-ready on the Hub.

50+ automated QA checks

Structural, kinematic, semantic, and replayability checks. Catches NaN stats, zero-byte files, timestamp skew, joint limit violations, and schema drift. CI-friendly: exits 1 on failure.

Coming soon: search and understand every episode

We're building an ML layer on top of the QA pipeline — so you can find, compare, and cluster episodes by what actually happens in them, not just by filename. Request early access.

Coming soon

Semantic episode search

Find episodes in plain language: "grasps that slipped on cylindrical objects," "demos where the arm approached from the left." Natural-language and image-similarity search across millions of episodes.

Coming soon

Find-similar & deduplicate

Pick any episode and retrieve its nearest neighbors. Surface near-duplicate demos that bloat your dataset and bias training — embedding-based redundancy scoring that goes beyond timestamp and metadata heuristics.

Coming soon

Failure-mode clusters

Automatically group failed episodes into named failure modes. See the patterns behind your policy's mistakes — "missed the object," "gripper closed early" — instead of scrubbing through rollouts one by one.

How teams use Traceplane

01

Audit before you train

Run traceplane check on any dataset before it enters your training pipeline. Catch broken metadata, NaN stats, and schema drift. Save weeks of debugging policy failures caused by silent data bugs.

02

Clean up published datasets

Downloaded a dataset from HuggingFace? Run traceplane fix --all to auto-repair common issues. Regenerate stats, patch metadata, delete corrupt files. Get a dataset that's actually training-ready.

03

Decide what to collect next

Run traceplane coverage across your datasets. See which tasks, objects, and scenes are underrepresented. Stop recording 500 more demos you don't need. Collect the 20 that fill the gaps.

04

Publish with confidence

Run traceplane publish --push to export to HuggingFace Hub with an auto-generated dataset card, QA report, and reproducible split indices. Prove your dataset is ready in your paper.

For teams that publish datasets or train policies

Academic labs

Free for .edu. Audit your dataset before publishing. Generate a QA report for your paper. Export benchmark-ready train/eval bundles. Stop losing reviewer confidence to silent data bugs.

VLA companies

Training across 10+ embodiments and millions of episodes? Automated QA on every ingest. SQL queries over trajectory data. Materialize training sets in seconds, not hours.

Dataset authors

Releasing a new dataset? Run traceplane check before you push to HuggingFace. Catch the issues that cost downstream users weeks — before they find them.

Data collectors

Capturing human demos with wearables or teleop rigs? Auto-QA validates multi-sensor sync, hand tracking quality, SLAM stability, and joint limit replayability on every episode.

Frequently asked questions

What data formats does Traceplane support?

Traceplane ingests LeRobot v2/v3, HDF5 (robomimic/ActionNet style), rosbag2 (.mcap/.db3), Zarr, RLDS, and custom formats. All data is normalized to a canonical episode schema and can be exported in any supported format.

How does automated QA work for robot trajectory data?

Every episode is automatically scored across three layers: structural checks (FPS consistency, dropped frames, action dimensions), kinematic validation (joint limits, velocity sanity, collision detection), and semantic checks (object presence, hand-object interaction). Episodes are classified as pass, needs review, or reject.

What is embodiment-aware storage?

Traceplane understands the physical meaning of each data dimension — for example, which joints belong to a 7-DoF left arm vs. a 3-DoF waist. This lets you store data from a 23-DoF humanoid and query just the arm joints, or search across different robot embodiments for similar tasks.

How is Traceplane different from HuggingFace Datasets?

HuggingFace is a general-purpose dataset hub. Traceplane is purpose-built for robotics trajectories: it understands embodiments, runs automated QA on every episode, pre-computes frame embeddings and action tokens, and materializes training-ready datasets sharded for your GPU count — all things a general hub cannot do.

Does Traceplane support human demonstration data from XR/wearables?

Yes. Traceplane validates hand tracking quality, SLAM stability, grasp events, and device consistency — 15 automated checks specific to human demonstration data. It also handles retargeting from human hand poses to robot action spaces, including workspace mapping and calibration. This is the infrastructure layer for the "human data → robot policy" approach used by teams like Generalist AI.

Can I stream data directly into my training loop?

Yes. Traceplane provides streaming dataloaders for PyTorch and JAX that serve Arrow batches directly — no need to download terabytes to local disk first. You can also resample on the fly, for example data stored at 30Hz served at 10Hz.

Who is behind Traceplane?

Traceplane is built by engineers with deep experience in robotics data pipelines and large-scale ML infrastructure. We've worked with rosbag2, LeRobot, HDF5, and custom formats across humanoid manipulation, mobile robotics, and simulation — and built Traceplane to solve the data problems we kept hitting ourselves.

Is my data secure?

Yes. Each organization's data is fully isolated — separate storage buckets, separate compute, no cross-tenant access. Data is encrypted at rest (AES-256) and in transit (TLS 1.3). We never use your data to train models or share it with other customers. For teams with strict compliance requirements, we offer private deployment options.

Simple, usage-based pricing

Start free. Scale when your data does.

Academic
Free

For university labs and research groups with .edu or .ac email addresses.

  • Unlimited users
  • 1 TB storage
  • All platform features
  • Community support
Request access
Enterprise
Custom

Private deployment, SLAs, and dedicated support.

  • Self-hosted or VPC
  • SSO / SAML
  • Volume discounts
  • Dedicated support
Contact us

Get a free dataset audit

Send us your dataset (or a link) and we'll run our full QA pipeline on it. You'll get a detailed report of every issue we find — no commitment required.