feat(droid): Add `daft.datasets.droid` by srilman · Pull Request #7089 · Eventual-Inc/Daft

srilman · 2026-06-08T17:54:08Z

Changes Made

Add a API for interacting with the DROID dataset via the daft.datasets.droid module. The current limitations include:

We don't have any APIs to read the numerical data from the trajectory.h5 files from each episode, such as sensor, observation, and state data.
We can't read the curated version thats stored in the RLDS format (https://droid-dataset.github.io/droid/the-droid-dataset.html#-using-the-dataset). We should add a custom DataSource for that
Some other smaller TODOs sprinkled throughout the file

github-actions · 2026-06-08T17:54:24Z

Rust Dependency Diff

Head: e90642c5c0c113e7d681627c9feedec8ab7d1499 vs Base: 823c32a0446d83377b766c33c145a850a93fb304.

✅ OK: Within budget.

New Crates: 0
Removed Crates: 0

codecov · 2026-06-08T18:13:50Z

Codecov Report

❌ Patch coverage is 70.00000% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.05%. Comparing base (006355e) to head (1a90010).
⚠️ Report is 10 commits behind head on main.

Files with missing lines	Patch %	Lines
daft/datasets/droid.py	66.66%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7089      +/-   ##
==========================================
+ Coverage   76.02%   76.05%   +0.02%     
==========================================
  Files        1164     1165       +1     
  Lines      165448   165605     +157     
==========================================
+ Hits       125789   125952     +163     
+ Misses      39659    39653       -6

Files with missing lines	Coverage Δ
daft/datasets/__init__.py	`100.00% <100.00%> (ø)`
daft/datasets/droid.py	`66.66% <66.66%> (ø)`

... and 18 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

greptile-apps · 2026-06-16T18:03:08Z

Greptile Summary

This PR introduces daft.datasets.droid, a new public API that discovers and lazily loads episodes from the DROID robotics dataset by globbing metadata JSON files from GCS, unnesting episode metadata, and attaching lazy File/VideoFile references for the trajectory HDF5 and MP4 camera recordings.

New module daft/datasets/droid.py: exposes raw(), which globs metadata JSON files, parses a typed struct schema, unnests all metadata fields, and adds trajectory, wrist_video, ext1_video, and ext2_video lazy file columns. Video paths are constructed using camera serial numbers, but the DROID dataset stores MP4 files under recordings/MP4/ \u2014 these fields should use the wrist_mp4_path / ext*_mp4_path metadata fields already parsed from the JSON instead of deriving paths from {cam_serial}.mp4.
New integration tests in tests/datasets/test_droid.py: cover schema shape and lazy path construction, but only call .file_path() so they do not verify that the generated paths resolve to real files on GCS.

Confidence Score: 3/5

The new module builds video file references pointing to paths that don't match the actual DROID directory layout; any downstream attempt to read video data will fail silently until the path construction is corrected.

The video file paths are constructed from {episode_dir}/{cam_serial}.mp4, but the DROID dataset stores MP4s under a recordings/MP4/ subdirectory. The metadata struct already parses wrist_mp4_path/ext*_mp4_path fields that contain the correct relative paths, yet they are not used. Because video_file() only stores a lazy reference, no error is raised at construction time — the mistake would only surface when a user actually decodes video frames, making it easy to ship the wrong behavior. The integration tests check .file_path() equality against the same incorrect construction, so they provide no signal about real file reachability.

daft/datasets/droid.py (video path construction logic) and tests/datasets/test_droid.py (path assertions that mirror the bug)

Important Files Changed

Filename	Overview
daft/datasets/droid.py	New `daft.datasets.droid.raw()` API — video file paths are constructed as `{episode_dir}/{cam_serial}.mp4`, omitting the `recordings/MP4/` subdirectory that the official DROID layout requires; the metadata struct already contains correct `*_mp4_path` fields that should be used instead.
tests/datasets/test_droid.py	Integration tests for the new DROID loader — path assertions only validate lazy path construction via `.file_path()` so they pass regardless of whether the constructed paths resolve to real files, masking the path-construction bug in the main module.
daft/datasets/init.py	Trivial one-line addition exporting the new `droid` submodule alongside `common_crawl`.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["daft.datasets.droid.raw(path, io_config)"] --> B["from_glob_path\n{path}/**/metadata_*.json"]
    B --> C["download + cast(string)\n+ try_deserialize(json, _METADATA_DTYPE)"]
    C --> D["unnest(metadata)\n→ flat columns per field"]
    D --> E["regexp_replace(path)\n→ episode_dir"]
    E --> F["with_column: trajectory\nfile(episode_dir/trajectory.h5)"]
    F --> G["with_column: wrist_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
    G --> H["with_column: ext1_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
    H --> I["with_column: ext2_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
    I --> J["return DataFrame\n(lazy, one row per episode)"]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["daft.datasets.droid.raw(path, io_config)"] --> B["from_glob_path\n{path}/**/metadata_*.json"]
    B --> C["download + cast(string)\n+ try_deserialize(json, _METADATA_DTYPE)"]
    C --> D["unnest(metadata)\n→ flat columns per field"]
    D --> E["regexp_replace(path)\n→ episode_dir"]
    E --> F["with_column: trajectory\nfile(episode_dir/trajectory.h5)"]
    F --> G["with_column: wrist_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
    G --> H["with_column: ext1_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
    H --> I["with_column: ext2_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
    I --> J["return DataFrame\n(lazy, one row per episode)"]

_{Reviews (1): Last reviewed commit: "add droid dataset" | Re-trigger Greptile}

greptile-apps · 2026-06-16T18:03:18Z

+    episodes = (
+        episodes.with_column(
+            "wrist_video",
+            video_file(format("{}/{}.mp4", col("episode_dir"), col("wrist_cam_serial")), io_config=io_config),
+        )
+        .with_column(
+            "ext1_video",
+            video_file(format("{}/{}.mp4", col("episode_dir"), col("ext1_cam_serial")), io_config=io_config),
+        )
+        .with_column(
+            "ext2_video",
+            video_file(format("{}/{}.mp4", col("episode_dir"), col("ext2_cam_serial")), io_config=io_config),
+        )
+    )


Video file paths skip the recordings/MP4/ subdirectory

The paths are constructed as {episode_dir}/{cam_serial}.mp4, but the official DROID dataset layout (also shown verbatim in this file's own docstring) places MP4 files under {episode_dir}/recordings/MP4/{cam_serial}.mp4. Any call that actually reads the video data will fail with a "file not found" error because the path does not exist. The metadata struct already parses wrist_mp4_path, ext1_mp4_path, and ext2_mp4_path fields from each episode's JSON — those fields contain the correct relative paths (e.g., recordings/MP4/<serial>.mp4) and should be used here instead of re-deriving paths from the camera serial.

greptile-apps · 2026-06-16T18:03:19Z

+    # Configure IO config with anonymous access to the public GCS bucket
+    if io_config is None:
+        io_config = IOConfig(gcs=GCSConfig(anonymous=True))


Anonymous GCS config applied unconditionally to non-GCS paths

When io_config is None the function creates IOConfig(gcs=GCSConfig(anonymous=True)) and passes it to every IO operation. This is correct for the default GCS path, but if a caller supplies a non-GCS path (e.g., s3:// or a local path) without an explicit io_config, the GCS anonymous config is silently attached to S3 or other operations. S3 calls with a GCS config object will likely error or bypass credential resolution unexpectedly. Consider only defaulting to anonymous GCS when the path starts with gs://, leaving io_config=None for other schemes.

greptile-apps · 2026-06-16T18:03:20Z

+    episode_dir = result["episode_dir"][0]
+    assert episode_dir.startswith(f"{DROID_RAW_GCS_PREFIX}/")
+
+    trajectory_path = result["trajectory_path"][0]
+    assert trajectory_path == f"{episode_dir}/trajectory.h5"
+
+    assert result["wrist_video_path"][0] == f"{episode_dir}/{result['wrist_cam_serial'][0]}.mp4"
+    assert result["ext1_video_path"][0] == f"{episode_dir}/{result['ext1_cam_serial'][0]}.mp4"
+    assert result["ext2_video_path"][0] == f"{episode_dir}/{result['ext2_cam_serial'][0]}.mp4"


Path assertions shadow the underlying bug and will pass with wrong paths

col(...).file_path() returns the path stored in the lazy File/VideoFile reference — it never touches the remote filesystem. The assertions on lines 63–67 therefore validate the path-construction logic in droid.py, not that the files actually exist. If the paths produced by the implementation are wrong (e.g., missing recordings/MP4/), these tests still pass. Consider adding at least a smoke-test assertion that attempts to read one frame/metadata from the video, or add a fixture that places mock files at the expected paths.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

considering this is an evergreen dataset and the path can be provided, I'm not as worried. We aren't testing that the data is there, we're testing that we can read the dataset format.

colin-ho · 2026-06-16T18:29:35Z

gently ping @srilman to also add this to our docs whenever. common crawl needs a friend

everettVT

LGTM

everettVT · 2026-06-17T15:11:56Z

+    episode_dir = result["episode_dir"][0]
+    assert episode_dir.startswith(f"{DROID_RAW_GCS_PREFIX}/")
+
+    trajectory_path = result["trajectory_path"][0]
+    assert trajectory_path == f"{episode_dir}/trajectory.h5"
+
+    assert result["wrist_video_path"][0] == f"{episode_dir}/{result['wrist_cam_serial'][0]}.mp4"
+    assert result["ext1_video_path"][0] == f"{episode_dir}/{result['ext1_cam_serial'][0]}.mp4"
+    assert result["ext2_video_path"][0] == f"{episode_dir}/{result['ext2_cam_serial'][0]}.mp4"


considering this is an evergreen dataset and the path can be provided, I'm not as worried. We aren't testing that the data is there, we're testing that we can read the dataset format.

blacksmith-sh · 2026-06-17T18:31:30Z

Found 2 test failures on Blacksmith runners:

Failures

Test	View Logs
`test_to_torch_dataloader_batches`	View Logs
`coverage: platform linux, python 3.10/12-final-0`	View Logs

add droid dataset

18d28ea

srilman requested a review from sgarimel June 8, 2026 17:54

github-actions Bot added the feat label Jun 8, 2026

srilman requested a review from everettVT June 16, 2026 17:58

srilman marked this pull request as ready for review June 16, 2026 17:58

greptile-apps Bot reviewed Jun 16, 2026

View reviewed changes

Merge branch 'main' into slade/droid

1a90010

everettVT approved these changes Jun 17, 2026

View reviewed changes

add docs

c4eccd0

srilman added 4 commits June 17, 2026 12:32

clean up a bit

c06e38a

Merge branch 'main' into slade/droid

74421d7

clean up

6a5587a

Merge branch 'main' into slade/droid

a6ac034

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(droid): Add `daft.datasets.droid`#7089

feat(droid): Add `daft.datasets.droid`#7089
srilman wants to merge 7 commits into
mainfrom
slade/droid

srilman commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 16, 2026

Uh oh!

greptile-apps Bot Jun 16, 2026

Uh oh!

greptile-apps Bot Jun 16, 2026

Uh oh!

greptile-apps Bot Jun 16, 2026

Uh oh!

everettVT Jun 17, 2026

Uh oh!

colin-ho commented Jun 16, 2026

Uh oh!

everettVT left a comment

Uh oh!

everettVT Jun 17, 2026

Uh oh!

blacksmith-sh Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

srilman commented Jun 8, 2026

Changes Made

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rust Dependency Diff

Uh oh!

codecov Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

greptile-apps Bot commented Jun 16, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

everettVT Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

colin-ho commented Jun 16, 2026

Uh oh!

everettVT left a comment

Choose a reason for hiding this comment

Uh oh!

everettVT Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

blacksmith-sh Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jun 8, 2026 •

edited

Loading

codecov Bot commented Jun 8, 2026 •

edited

Loading

blacksmith-sh Bot commented Jun 17, 2026 •

edited

Loading