feat(droid): Add daft.datasets.droid#7089
Conversation
Rust Dependency DiffHead: ✅ OK: Within budget.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7089 +/- ##
==========================================
+ Coverage 76.02% 76.05% +0.02%
==========================================
Files 1164 1165 +1
Lines 165448 165605 +157
==========================================
+ Hits 125789 125952 +163
+ Misses 39659 39653 -6
🚀 New features to boost your workflow:
|
Greptile SummaryThis PR introduces
Confidence Score: 3/5The new module builds video file references pointing to paths that don't match the actual DROID directory layout; any downstream attempt to read video data will fail silently until the path construction is corrected. The video file paths are constructed from daft/datasets/droid.py (video path construction logic) and tests/datasets/test_droid.py (path assertions that mirror the bug) Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["daft.datasets.droid.raw(path, io_config)"] --> B["from_glob_path\n{path}/**/metadata_*.json"]
B --> C["download + cast(string)\n+ try_deserialize(json, _METADATA_DTYPE)"]
C --> D["unnest(metadata)\n→ flat columns per field"]
D --> E["regexp_replace(path)\n→ episode_dir"]
E --> F["with_column: trajectory\nfile(episode_dir/trajectory.h5)"]
F --> G["with_column: wrist_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
G --> H["with_column: ext1_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
H --> I["with_column: ext2_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
I --> J["return DataFrame\n(lazy, one row per episode)"]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A["daft.datasets.droid.raw(path, io_config)"] --> B["from_glob_path\n{path}/**/metadata_*.json"]
B --> C["download + cast(string)\n+ try_deserialize(json, _METADATA_DTYPE)"]
C --> D["unnest(metadata)\n→ flat columns per field"]
D --> E["regexp_replace(path)\n→ episode_dir"]
E --> F["with_column: trajectory\nfile(episode_dir/trajectory.h5)"]
F --> G["with_column: wrist_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
G --> H["with_column: ext1_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
H --> I["with_column: ext2_video\nvideo_file(episode_dir/cam_serial.mp4)\n⚠️ missing recordings/MP4/"]
I --> J["return DataFrame\n(lazy, one row per episode)"]
Reviews (1): Last reviewed commit: "add droid dataset" | Re-trigger Greptile |
| episodes = ( | ||
| episodes.with_column( | ||
| "wrist_video", | ||
| video_file(format("{}/{}.mp4", col("episode_dir"), col("wrist_cam_serial")), io_config=io_config), | ||
| ) | ||
| .with_column( | ||
| "ext1_video", | ||
| video_file(format("{}/{}.mp4", col("episode_dir"), col("ext1_cam_serial")), io_config=io_config), | ||
| ) | ||
| .with_column( | ||
| "ext2_video", | ||
| video_file(format("{}/{}.mp4", col("episode_dir"), col("ext2_cam_serial")), io_config=io_config), | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Video file paths skip the
recordings/MP4/ subdirectory
The paths are constructed as {episode_dir}/{cam_serial}.mp4, but the official DROID dataset layout (also shown verbatim in this file's own docstring) places MP4 files under {episode_dir}/recordings/MP4/{cam_serial}.mp4. Any call that actually reads the video data will fail with a "file not found" error because the path does not exist. The metadata struct already parses wrist_mp4_path, ext1_mp4_path, and ext2_mp4_path fields from each episode's JSON — those fields contain the correct relative paths (e.g., recordings/MP4/<serial>.mp4) and should be used here instead of re-deriving paths from the camera serial.
| # Configure IO config with anonymous access to the public GCS bucket | ||
| if io_config is None: | ||
| io_config = IOConfig(gcs=GCSConfig(anonymous=True)) |
There was a problem hiding this comment.
Anonymous GCS config applied unconditionally to non-GCS paths
When io_config is None the function creates IOConfig(gcs=GCSConfig(anonymous=True)) and passes it to every IO operation. This is correct for the default GCS path, but if a caller supplies a non-GCS path (e.g., s3:// or a local path) without an explicit io_config, the GCS anonymous config is silently attached to S3 or other operations. S3 calls with a GCS config object will likely error or bypass credential resolution unexpectedly. Consider only defaulting to anonymous GCS when the path starts with gs://, leaving io_config=None for other schemes.
| episode_dir = result["episode_dir"][0] | ||
| assert episode_dir.startswith(f"{DROID_RAW_GCS_PREFIX}/") | ||
|
|
||
| trajectory_path = result["trajectory_path"][0] | ||
| assert trajectory_path == f"{episode_dir}/trajectory.h5" | ||
|
|
||
| assert result["wrist_video_path"][0] == f"{episode_dir}/{result['wrist_cam_serial'][0]}.mp4" | ||
| assert result["ext1_video_path"][0] == f"{episode_dir}/{result['ext1_cam_serial'][0]}.mp4" | ||
| assert result["ext2_video_path"][0] == f"{episode_dir}/{result['ext2_cam_serial'][0]}.mp4" |
There was a problem hiding this comment.
Path assertions shadow the underlying bug and will pass with wrong paths
col(...).file_path() returns the path stored in the lazy File/VideoFile reference — it never touches the remote filesystem. The assertions on lines 63–67 therefore validate the path-construction logic in droid.py, not that the files actually exist. If the paths produced by the implementation are wrong (e.g., missing recordings/MP4/), these tests still pass. Consider adding at least a smoke-test assertion that attempts to read one frame/metadata from the video, or add a fixture that places mock files at the expected paths.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
considering this is an evergreen dataset and the path can be provided, I'm not as worried. We aren't testing that the data is there, we're testing that we can read the dataset format.
|
gently ping @srilman to also add this to our docs whenever. common crawl needs a friend |
| episode_dir = result["episode_dir"][0] | ||
| assert episode_dir.startswith(f"{DROID_RAW_GCS_PREFIX}/") | ||
|
|
||
| trajectory_path = result["trajectory_path"][0] | ||
| assert trajectory_path == f"{episode_dir}/trajectory.h5" | ||
|
|
||
| assert result["wrist_video_path"][0] == f"{episode_dir}/{result['wrist_cam_serial'][0]}.mp4" | ||
| assert result["ext1_video_path"][0] == f"{episode_dir}/{result['ext1_cam_serial'][0]}.mp4" | ||
| assert result["ext2_video_path"][0] == f"{episode_dir}/{result['ext2_cam_serial'][0]}.mp4" |
There was a problem hiding this comment.
considering this is an evergreen dataset and the path can be provided, I'm not as worried. We aren't testing that the data is there, we're testing that we can read the dataset format.

Changes Made
Add a API for interacting with the DROID dataset via the
daft.datasets.droidmodule. The current limitations include:trajectory.h5files from each episode, such as sensor, observation, and state data.