Skip to content

Document HF Buckets as s3 alternative for Common Crawl#7103

Draft
lhoestq wants to merge 2 commits into
Eventual-Inc:mainfrom
lhoestq:document-s3-alternative-for-cc
Draft

Document HF Buckets as s3 alternative for Common Crawl#7103
lhoestq wants to merge 2 commits into
Eventual-Inc:mainfrom
lhoestq:document-s3-alternative-for-cc

Conversation

@lhoestq

@lhoestq lhoestq commented Jun 10, 2026

Copy link
Copy Markdown

Hi team !

Hugging Face buckets is now a great alternative to S3 to access Common Crawl, especially from other region and provider than AWS us-east-1.

I wanted to make the warcio community aware of this so I added some documentation around it in the README

Let me know if this is the right place and feel free to suggest edits.

Ultimately I'd like to democratize Common Crawl's usage given its impact in AI / LLM pretraining.
Right now its access is too restricted due to AWS contraints and the costs that come with it IMO.

This requires @everettVT 's work on supporting HF Buckets in daft though (first draft is at #6731 but now requires OpenDAL IIRC)

@lhoestq lhoestq changed the title Document s3 alternative for cc Document HF Buckets as s3 alternative for Common Crawl Jun 10, 2026
@codspeed-hq

codspeed-hq Bot commented Jun 10, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 11.3%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 1 improved benchmark
✅ 39 untouched benchmarks
⏩ 10 skipped benchmarks1

Performance Changes

Benchmark BASE HEAD Efficiency
test_clickbench_sql[0] 5.1 ms 4.6 ms +11.3%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing lhoestq:document-s3-alternative-for-cc (4ae3600) with main (f04de35)

Open in CodSpeed

Footnotes

  1. 10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant