How to Convert rosbag2 Data to LeRobot Format

The problem: your data is in rosbag2, your training code expects LeRobot

If you work in robotics ML, you have almost certainly hit this wall. Your teleoperation system, your real-world data collection fleet, or your simulation bridge writes rosbag2 files. But your training pipeline -- whether it uses diffusion policies, ACT, or a custom VLA architecture -- expects data in HuggingFace's LeRobot format. There is no official converter. The LeRobot GitHub issue tracker has multiple open requests for rosbag2 import support, and the usual answer is "write a custom script."

That custom script inevitably becomes 800+ lines of fragile Python that handles your specific topic names, your specific message types, and your specific camera configuration. When the robot changes, the script breaks. When a colleague records data with a slightly different topic layout, the script breaks. When LeRobot updates its format from v2.0 to v2.1, the script breaks.

This guide walks through the conversion systematically. We will cover what each format actually contains, why the conversion is non-trivial, and a step-by-step approach using the rosbags library (which does not require a ROS 2 installation). By the end, you will understand every moving part well enough to build a robust converter -- or to decide that you would rather not maintain one.

What is rosbag2?

rosbag2 is the recording and playback system for ROS 2. When you call ros2 bag record, it subscribes to a set of topics and serializes every incoming message to disk with a nanosecond-precision timestamp. The result is a directory containing:

A metadata.yaml file describing the storage plugin, serialization format, topic names, message types, and message counts.
One or more data files in the configured storage backend. The two common backends are SQLite3 (producing .db3 files) and MCAP (producing .mcap files). MCAP has become the default in ROS 2 Humble and later.

Messages are serialized using CDR (Common Data Representation), the binary wire format from DDS. To deserialize them, you need the message type definitions -- either from the original ROS 2 packages or from the embedded schema in MCAP files.

A typical manipulation dataset bag contains topics like:

/camera/color/image_raw        sensor_msgs/msg/Image
/camera/depth/image_rect_raw   sensor_msgs/msg/Image
/joint_states                  sensor_msgs/msg/JointState
/ee_pose                       geometry_msgs/msg/PoseStamped
/gripper/command               control_msgs/msg/GripperCommand
/task_label                    std_msgs/msg/String

Crucially, rosbag2 stores raw message streams. There is no concept of "episodes," no distinction between observations and actions, no dataset-level statistics, and no fixed sampling rate across topics. Each topic publishes at its own frequency -- cameras at 30 Hz, joint states at 100 Hz or 500 Hz, gripper commands at irregular intervals.

What is LeRobot format?

LeRobot is HuggingFace's open-source framework for real-world robot learning. Its dataset format has become a de facto standard for sharing manipulation and locomotion demonstration data. A LeRobot dataset is a directory (or HuggingFace repo) with this structure:

dataset/
  data/
    chunk-000/
      episode_000000.parquet
      episode_000001.parquet
      ...
  videos/
    chunk-000/
      observation.images.camera0/
        episode_000000.mp4
        episode_000001.mp4
      ...
  meta/
    info.json
    stats.json
    episodes.jsonl
    tasks.json

The Parquet files store per-timestep tabular data: joint positions, joint velocities, end-effector poses, gripper states, action vectors, and references to the corresponding video frame indices. Each row is one timestep at a fixed frequency (commonly 30 Hz or 50 Hz).

The MP4 files store video streams, one file per episode per camera. LeRobot uses video-level encoding (H.264 or H.265) rather than storing individual JPEG/PNG frames, which gives roughly 10-50x storage savings compared to image-per-frame approaches.

The meta files tie everything together:

info.json -- dataset-level metadata: number of episodes, number of frames, FPS, feature definitions with shapes and dtypes, chunk sizes, encoding details.
episodes.jsonl -- one JSON object per episode with the episode index, task index, and frame count.
tasks.json -- maps task indices to natural language instruction strings.
stats.json -- per-feature statistics (min, max, mean, standard deviation) computed across the entire dataset, used for normalization during training.

The key insight is that LeRobot format is training-oriented. Everything is pre-aligned to a fixed clock, pre-encoded for efficient sequential reads, and pre-annotated with the statistics a training loop needs for normalization. rosbag2, by contrast, is recording-oriented -- optimized for lossless capture of asynchronous message streams.

Why the conversion is harder than it looks

At first glance, this seems like a straightforward ETL job: read the bag, write Parquet and MP4. In practice, there are six distinct challenges that each require careful handling.

1. Message deserialization without ROS

If you want your conversion pipeline to run on a training server or CI machine (not the robot), you cannot depend on a full ROS 2 installation. The rosbags Python library (installable via pip install rosbags) can read both SQLite3 and MCAP backends and deserialize CDR messages using bundled or custom type definitions. However, if your bag uses custom message types not in the standard sensor_msgs/geometry_msgs set, you will need to provide those definitions manually via rosbags' typestore registration.

2. Timestamp alignment across topics

Different topics publish at different rates and are not synchronized. Joint states might arrive at 200 Hz, images at 30 Hz, and gripper commands at irregular intervals. LeRobot expects a single fixed-frequency time grid where every column has exactly one value per timestep. You need to choose a target frequency and resample every topic onto that grid -- typically using nearest-neighbor or zero-order hold for discrete values and linear interpolation for continuous states.

3. Image extraction and video encoding

ROS image messages come in several forms. sensor_msgs/Image carries raw pixel data (BGR8, RGB8, 16UC1 for depth, etc.). sensor_msgs/CompressedImage carries JPEG or PNG bytes. You need to decode each frame, convert to RGB if necessary, then encode the full episode as an MP4 file with a constant frame rate matching your target frequency. The encoding must use pixel format yuv420p for broad compatibility, and the timestamps in the video container must be consistent with the Parquet frame indices.

4. Mapping ROS topics to LeRobot schema fields

rosbag2 has no concept of "observation" vs. "action." A /joint_states topic might contain the current joint positions (observation), while a /joint_commands topic contains the commanded positions (action). A /gripper/status might be observation, while /gripper/command is action. This mapping is entirely dataset-specific -- there is no standard convention. Your converter must accept a configuration that maps topic names to LeRobot feature names and specifies which features are observations and which are actions.

5. Episode boundary detection

A single rosbag2 file might contain multiple episodes (start, execute task, reset, repeat) or one continuous recording. LeRobot requires explicit episode boundaries. If your recording pipeline publishes an episode marker topic, you can split on that. Otherwise, you might need to detect resets via sudden jumps in joint positions, or split on time gaps, or treat each bag file as a single episode.

6. Computing dataset statistics

LeRobot's stats.json requires min, max, mean, and standard deviation for every feature across the entire dataset. This means you cannot write the stats file until you have processed every episode. For large datasets, this requires either a two-pass approach (first pass to compute stats, second to write normalized data) or an online algorithm that accumulates statistics incrementally.

Step-by-step conversion

The following approach uses rosbags for bag reading, numpy for resampling, pyav (or ffmpeg subprocess) for video encoding, and pyarrow for Parquet writing. No ROS 2 installation is required.

Step 1: Read the bag and enumerate topics

Start by opening the bag and inspecting what it contains. This tells you which topics are available, their message types, and their message counts.

from rosbags.rosbag2 import Reader
from rosbags.typesys import get_types_from_msg, Stores
from rosbags.typesys.stores.ros2_humble import FIELDDEFS

def inspect_bag(bag_path):
    with Reader(bag_path) as reader:
        for conn in reader.connections:
            print(f"{conn.topic:<40s} {conn.msgtype:<40s} "
                  f"count={conn.msgcount}")
    return reader.connections

Step 2: Define the topic-to-feature mapping

Create a configuration that maps each rosbag2 topic to a LeRobot feature. This is the part that is inherently dataset-specific.

TOPIC_MAP = {
    "/joint_states": {
        "feature": "observation.state",
        "extract": lambda msg: list(msg.position) + list(msg.velocity),
        "shape": [14],  # 7 positions + 7 velocities
    },
    "/ee_pose": {
        "feature": "observation.ee_pos",
        "extract": lambda msg: [
            msg.pose.position.x, msg.pose.position.y,
            msg.pose.position.z,
            msg.pose.orientation.x, msg.pose.orientation.y,
            msg.pose.orientation.z, msg.pose.orientation.w,
        ],
        "shape": [7],
    },
    "/joint_commands": {
        "feature": "action",
        "extract": lambda msg: list(msg.data),
        "shape": [7],
    },
    "/camera/color/image_raw": {
        "feature": "observation.images.camera0",
        "type": "video",
        "encoding": "bgr8",
        "width": 640,
        "height": 480,
    },
}

Step 3: Extract messages and align to a fixed clock

Read all messages, group by topic, then resample everything onto a common time grid at the target frequency.

import numpy as np
from rosbags.serde import deserialize_cdr

def extract_messages(bag_path, topic_map, target_hz=30):
    raw = {topic: [] for topic in topic_map}

    with Reader(bag_path) as reader:
        for conn, timestamp, rawdata in reader.messages():
            if conn.topic in topic_map:
                msg = deserialize_cdr(rawdata, conn.msgtype)
                raw[conn.topic].append((timestamp, msg))

    # Determine episode time bounds (nanoseconds)
    all_ts = [t for msgs in raw.values() for t, _ in msgs]
    t_start, t_end = min(all_ts), max(all_ts)

    # Build the target time grid
    dt_ns = int(1e9 / target_hz)
    grid = np.arange(t_start, t_end, dt_ns)

    return raw, grid

Step 4: Resample non-image topics with nearest-neighbor

def resample_to_grid(timestamps, values, grid):
    """For each grid point, find the nearest message."""
    ts_arr = np.array(timestamps)
    indices = np.searchsorted(ts_arr, grid, side="right") - 1
    indices = np.clip(indices, 0, len(values) - 1)
    return [values[i] for i in indices]

Step 5: Encode image topics to MP4

Extract each image frame, resample to the target FPS grid, and write an MP4. Using pyav avoids shelling out to ffmpeg.

import av
import numpy as np

def encode_video(frames, output_path, fps, width, height):
    """frames: list of numpy arrays (H, W, 3) in RGB."""
    container = av.open(str(output_path), mode="w")
    stream = container.add_stream("h264", rate=fps)
    stream.width = width
    stream.height = height
    stream.pix_fmt = "yuv420p"
    stream.options = {"crf": "23", "preset": "medium"}

    for i, rgb_frame in enumerate(frames):
        frame = av.VideoFrame.from_ndarray(rgb_frame, format="rgb24")
        frame.pts = i
        for packet in stream.encode(frame):
            container.mux(packet)

    for packet in stream.encode():
        container.mux(packet)
    container.close()

Step 6: Build the Parquet file

Assemble all resampled features into a single table and write as Parquet. Each row corresponds to one timestep on the fixed clock.

import pyarrow as pa
import pyarrow.parquet as pq

def build_parquet(features, grid, episode_idx, output_dir):
    """features: dict of feature_name -> list of values (one per grid point)."""
    n = len(grid)
    columns = {
        "timestamp": pa.array(grid.tolist(), type=pa.int64()),
        "frame_index": pa.array(list(range(n)), type=pa.int64()),
        "episode_index": pa.array([episode_idx] * n, type=pa.int64()),
    }

    for name, values in features.items():
        if isinstance(values[0], list):
            # Multi-dimensional feature: store as list column
            columns[name] = pa.array(values)
        else:
            columns[name] = pa.array(values)

    table = pa.table(columns)
    out_path = output_dir / f"episode_{episode_idx:06d}.parquet"
    pq.write_table(table, out_path)
    return n

Step 7: Generate the meta files

After processing all episodes, compute statistics and write the four required meta files.

import json
from pathlib import Path

def write_meta(dataset_dir, episodes, features_info, task_label, fps):
    meta_dir = dataset_dir / "meta"
    meta_dir.mkdir(exist_ok=True)

    # tasks.json
    tasks = [{"task_index": 0, "task": task_label}]
    (meta_dir / "tasks.json").write_text(json.dumps(tasks, indent=2))

    # episodes.jsonl
    with open(meta_dir / "episodes.jsonl", "w") as f:
        for ep in episodes:
            f.write(json.dumps(ep) + "\n")

    # info.json
    total_frames = sum(ep["length"] for ep in episodes)
    info = {
        "codebase_version": "v2.1",
        "robot_type": "unknown",
        "total_episodes": len(episodes),
        "total_frames": total_frames,
        "fps": fps,
        "features": features_info,
        "chunks_size": 1000,
        "data_path": "data/chunk-{chunk_index:03d}/{episode_id}.parquet",
        "video_path": "videos/chunk-{chunk_index:03d}/{video_key}/episode_{episode_id:06d}.mp4",
    }
    (meta_dir / "info.json").write_text(json.dumps(info, indent=2))

    # stats.json -- computed from all episodes
    # (accumulate running stats during processing)

The stats.json computation deserves special attention. For each numeric feature, you need the global min, max, mean, and standard deviation. Use Welford's online algorithm to compute mean and variance in a single pass without storing all values in memory.

class RunningStats:
    def __init__(self, shape):
        self.n = 0
        self.mean = np.zeros(shape)
        self.M2 = np.zeros(shape)
        self.min_val = np.full(shape, np.inf)
        self.max_val = np.full(shape, -np.inf)

    def update(self, x):
        x = np.asarray(x)
        self.n += 1
        delta = x - self.mean
        self.mean += delta / self.n
        delta2 = x - self.mean
        self.M2 += delta * delta2
        self.min_val = np.minimum(self.min_val, x)
        self.max_val = np.maximum(self.max_val, x)

    def finalize(self):
        std = np.sqrt(self.M2 / max(self.n, 1))
        return {
            "min": self.min_val.tolist(),
            "max": self.max_val.tolist(),
            "mean": self.mean.tolist(),
            "std": std.tolist(),
        }

Common pitfalls

Even with the above structure in place, there are several failure modes that will silently produce a broken dataset. These are the issues that cost teams days of debugging.

Timestamp drift between sensors

ROS 2 messages have two relevant timestamps: the message header stamp (when the sensor acquired the data) and the bag receive time (when rosbag2 recorded the message). These can differ by tens of milliseconds due to transport latency. Always use msg.header.stamp for alignment when available, not the bag timestamp. If a message type has no header (e.g., std_msgs/Float64MultiArray), you are stuck with the bag timestamp and should expect lower alignment accuracy.

Missing and dropped frames

If a camera drops frames, your video will have fewer frames than expected and your Parquet frame indices will be wrong. After resampling to the target grid, verify that the video frame count exactly matches the Parquet row count. If not, either insert blank frames at the gaps or adjust the Parquet to skip those timesteps.

Compressed vs. raw image handling

If the bag stores sensor_msgs/CompressedImage, the data field contains JPEG or PNG bytes, not raw pixels. You need to decode them (e.g., with cv2.imdecode) before encoding to MP4. If the bag stores raw sensor_msgs/Image in BGR8, you must convert to RGB before passing to the video encoder. Getting the color channel order wrong produces a dataset that "works" but has systematically wrong color information -- a subtle bug that degrades policy performance without obvious symptoms.

Coordinate frame conventions

ROS uses a right-handed coordinate system (X-forward, Y-left, Z-up for the body frame, but X-east, Y-north, Z-up for map frames, and conventions vary across drivers). LeRobot datasets have no enforced convention -- each dataset defines its own. Make sure you document which coordinate frame your converted data uses, and verify that your action and observation spaces are consistent (i.e., the action space uses the same frame as the observation space).

Action delay and temporal offset

In many teleoperation setups, there is a one-or-more-timestep delay between when an action is commanded and when the robot executes it. Some training frameworks (like diffusion policy) expect the action at timestep t to correspond to the observation at timestep t. Others expect action at t to describe the transition from t to t+1. If your rosbag2 data has a known command-to-execution delay, you may need to shift the action stream by that number of timesteps during conversion.

Large datasets and memory

A single rosbag2 file with 30 Hz RGB at 640x480 generates approximately 25 MB/s of raw image data. For a 10-minute recording, that is 15 GB of raw frames in memory. Process one episode at a time, stream frames through the video encoder, and do not accumulate all frames in a list before encoding. The same applies to the statistics computation -- use the streaming approach shown above.

Or skip the scripts entirely

Traceplane handles rosbag2 to LeRobot conversion automatically

Building and maintaining a custom conversion pipeline is real engineering work. Traceplane ingests rosbag2 files (.mcap and .db3) natively, aligns timestamps, extracts and encodes video streams, maps topics to a canonical schema, computes statistics, and outputs LeRobot v2-compatible datasets -- all through a single API call. Every episode is automatically QA-scored for dropped frames, timestamp skew, kinematic violations, and more.

If you are converting a handful of bags for a one-off experiment, the manual approach above will work. If you are building a data pipeline that needs to handle hundreds or thousands of bags reliably, with quality gates and versioning, that is exactly what Traceplane is built for.

Request early access