Skip to content

feat(droid): Add daft.datasets.droid#7089

Open
srilman wants to merge 7 commits into
mainfrom
slade/droid
Open

feat(droid): Add daft.datasets.droid#7089
srilman wants to merge 7 commits into
mainfrom
slade/droid

Conversation

@srilman

@srilman srilman commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Changes Made

Add a API for interacting with the DROID dataset via the daft.datasets.droid module. The current limitations include:

@srilman srilman requested a review from sgarimel June 8, 2026 17:54
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Rust Dependency Diff

Head: e90642c5c0c113e7d681627c9feedec8ab7d1499 vs Base: 823c32a0446d83377b766c33c145a850a93fb304.

OK: Within budget.

  • New Crates: 0
  • Removed Crates: 0

@github-actions github-actions Bot added the feat label Jun 8, 2026
@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.00000% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.05%. Comparing base (006355e) to head (1a90010).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
daft/datasets/droid.py 66.66% 6 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #7089      +/-   ##
==========================================
+ Coverage   76.02%   76.05%   +0.02%     
==========================================
  Files        1164     1165       +1     
  Lines      165448   165605     +157     
==========================================
+ Hits       125789   125952     +163     
+ Misses      39659    39653       -6     
Files with missing lines Coverage Δ
daft/datasets/__init__.py 100.00% <100.00%> (ø)
daft/datasets/droid.py 66.66% <66.66%> (ø)

... and 18 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@srilman srilman requested a review from everettVT June 16, 2026 17:58
@srilman srilman marked this pull request as ready for review June 16, 2026 17:58
@greptile-apps

greptile-apps Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces daft.datasets.droid, a new public API that discovers and lazily loads episodes from the DROID robotics dataset by globbing metadata JSON files from GCS, unnesting episode metadata, and attaching lazy File/VideoFile references for the trajectory HDF5 and MP4 camera recordings.

  • New module daft/datasets/droid.py: exposes raw(), which globs metadata JSON files, parses a typed struct schema, unnests all metadata fields, and adds trajectory, wrist_video, ext1_video, and ext2_video lazy file columns. Video paths are constructed using camera serial numbers, but the DROID dataset stores MP4 files under recordings/MP4/ \u2014 these fields should use the wrist_mp4_path / ext*_mp4_path metadata fields already parsed from the JSON instead of deriving paths from {cam_serial}.mp4.
  • New integration tests in tests/datasets/test_droid.py: cover schema shape and lazy path construction, but only call .file_path() so they do not verify that the generated paths resolve to real files on GCS.

Confidence Score: 3/5

The new module builds video file references pointing to paths that don't match the actual DROID directory layout; any downstream attempt to read video data will fail silently until the path construction is corrected.

The video file paths are constructed from {episode_dir}/{cam_serial}.mp4, but the DROID dataset stores MP4s under a recordings/MP4/ subdirectory. The metadata struct already parses wrist_mp4_path/ext*_mp4_path fields that contain the correct relative paths, yet they are not used. Because video_file() only stores a lazy reference, no error is raised at construction time — the mistake would only surface when a user actually decodes video frames, making it easy to ship the wrong behavior. The integration tests check .file_path() equality against the same incorrect construction, so they provide no signal about real file reachability.

daft/datasets/droid.py (video path construction logic) and tests/datasets/test_droid.py (path assertions that mirror the bug)

Important Files Changed

Filename Overview
daft/datasets/droid.py New daft.datasets.droid.raw() API — video file paths are constructed as {episode_dir}/{cam_serial}.mp4, omitting the recordings/MP4/ subdirectory that the official DROID layout requires; the metadata struct already contains correct *_mp4_path fields that should be used instead.
tests/datasets/test_droid.py Integration tests for the new DROID loader — path assertions only validate lazy path construction via .file_path() so they pass regardless of whether the constructed paths resolve to real files, masking the path-construction bug in the main module.
daft/datasets/init.py Trivial one-line addition exporting the new droid submodule alongside common_crawl.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["daft.datasets.droid.raw(path, io_config)"] --> B["from_glob_path\n{path}/**/metadata_*.json"]
    B --> C["download + cast(string)\n+ try_deserialize(json, _METADATA_DTYPE)"]
    C --> D["unnest(metadata)\n→ flat columns per field"]
    D --> E["regexp_replace(path)\n→ episode_dir"]
    E --> F["with_column: trajectory\nfile(episode_dir/trajectory.h5)"]
    F --> G["with_column: wrist_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
    G --> H["with_column: ext1_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
    H --> I["with_column: ext2_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
    I --> J["return DataFrame\n(lazy, one row per episode)"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["daft.datasets.droid.raw(path, io_config)"] --> B["from_glob_path\n{path}/**/metadata_*.json"]
    B --> C["download + cast(string)\n+ try_deserialize(json, _METADATA_DTYPE)"]
    C --> D["unnest(metadata)\n→ flat columns per field"]
    D --> E["regexp_replace(path)\n→ episode_dir"]
    E --> F["with_column: trajectory\nfile(episode_dir/trajectory.h5)"]
    F --> G["with_column: wrist_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
    G --> H["with_column: ext1_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
    H --> I["with_column: ext2_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
    I --> J["return DataFrame\n(lazy, one row per episode)"]
Loading

Reviews (1): Last reviewed commit: "add droid dataset" | Re-trigger Greptile

Comment thread daft/datasets/droid.py Outdated
Comment on lines +131 to +144
episodes = (
episodes.with_column(
"wrist_video",
video_file(format("{}/{}.mp4", col("episode_dir"), col("wrist_cam_serial")), io_config=io_config),
)
.with_column(
"ext1_video",
video_file(format("{}/{}.mp4", col("episode_dir"), col("ext1_cam_serial")), io_config=io_config),
)
.with_column(
"ext2_video",
video_file(format("{}/{}.mp4", col("episode_dir"), col("ext2_cam_serial")), io_config=io_config),
)
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Video file paths skip the recordings/MP4/ subdirectory

The paths are constructed as {episode_dir}/{cam_serial}.mp4, but the official DROID dataset layout (also shown verbatim in this file's own docstring) places MP4 files under {episode_dir}/recordings/MP4/{cam_serial}.mp4. Any call that actually reads the video data will fail with a "file not found" error because the path does not exist. The metadata struct already parses wrist_mp4_path, ext1_mp4_path, and ext2_mp4_path fields from each episode's JSON — those fields contain the correct relative paths (e.g., recordings/MP4/<serial>.mp4) and should be used here instead of re-deriving paths from the camera serial.

Comment thread daft/datasets/droid.py
Comment on lines +107 to +109
# Configure IO config with anonymous access to the public GCS bucket
if io_config is None:
io_config = IOConfig(gcs=GCSConfig(anonymous=True))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Anonymous GCS config applied unconditionally to non-GCS paths

When io_config is None the function creates IOConfig(gcs=GCSConfig(anonymous=True)) and passes it to every IO operation. This is correct for the default GCS path, but if a caller supplies a non-GCS path (e.g., s3:// or a local path) without an explicit io_config, the GCS anonymous config is silently attached to S3 or other operations. S3 calls with a GCS config object will likely error or bypass credential resolution unexpectedly. Consider only defaulting to anonymous GCS when the path starts with gs://, leaving io_config=None for other schemes.

Comment on lines +59 to +67
episode_dir = result["episode_dir"][0]
assert episode_dir.startswith(f"{DROID_RAW_GCS_PREFIX}/")

trajectory_path = result["trajectory_path"][0]
assert trajectory_path == f"{episode_dir}/trajectory.h5"

assert result["wrist_video_path"][0] == f"{episode_dir}/{result['wrist_cam_serial'][0]}.mp4"
assert result["ext1_video_path"][0] == f"{episode_dir}/{result['ext1_cam_serial'][0]}.mp4"
assert result["ext2_video_path"][0] == f"{episode_dir}/{result['ext2_cam_serial'][0]}.mp4"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Path assertions shadow the underlying bug and will pass with wrong paths

col(...).file_path() returns the path stored in the lazy File/VideoFile reference — it never touches the remote filesystem. The assertions on lines 63–67 therefore validate the path-construction logic in droid.py, not that the files actually exist. If the paths produced by the implementation are wrong (e.g., missing recordings/MP4/), these tests still pass. Consider adding at least a smoke-test assertion that attempts to read one frame/metadata from the video, or add a fixture that places mock files at the expected paths.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering this is an evergreen dataset and the path can be provided, I'm not as worried. We aren't testing that the data is there, we're testing that we can read the dataset format.

@colin-ho

Copy link
Copy Markdown
Collaborator

gently ping @srilman to also add this to our docs whenever. common crawl needs a friend

@everettVT everettVT left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Image

Comment on lines +59 to +67
episode_dir = result["episode_dir"][0]
assert episode_dir.startswith(f"{DROID_RAW_GCS_PREFIX}/")

trajectory_path = result["trajectory_path"][0]
assert trajectory_path == f"{episode_dir}/trajectory.h5"

assert result["wrist_video_path"][0] == f"{episode_dir}/{result['wrist_cam_serial'][0]}.mp4"
assert result["ext1_video_path"][0] == f"{episode_dir}/{result['ext1_cam_serial'][0]}.mp4"
assert result["ext2_video_path"][0] == f"{episode_dir}/{result['ext2_cam_serial'][0]}.mp4"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering this is an evergreen dataset and the path can be provided, I'm not as worried. We aren't testing that the data is there, we're testing that we can read the dataset format.

@blacksmith-sh

blacksmith-sh Bot commented Jun 17, 2026

Copy link
Copy Markdown

Found 2 test failures on Blacksmith runners:

Failures

Test View Logs
test_to_torch_dataloader_batches View Logs
coverage: platform linux, python 3.10/12-final-0 View Logs

Fix in Cursor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants