GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code#45998
GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code#45998AlenkaF wants to merge 39 commits into
Conversation
|
Ah, this will not work. If we want to remove cc @pitrou |
Well, ARROW_HDFS=ON could imply ARROW_FILESYSTEM=ON. I don't think that's a problem.
Yes, indeed. |
|
OK, I will then move |
e95f3f3 to
287cb9b
Compare
1bd3f43 to
8e26730
Compare
There was a problem hiding this comment.
Are all these declarations actually needed by PyArrow?
There was a problem hiding this comment.
No, most of them aren't and are copied from libarrow.pxd. I can remove the unused ones - but am not sure if some external application can actually use them?
There was a problem hiding this comment.
Don't we need to link to arrow::hadoop as was done above? cc @kou for advice
There was a problem hiding this comment.
Hm, yeah. I will add a link as above as it makes sense.
There was a problem hiding this comment.
Ah, it is there already (that explains why nothing failed =) )
arrow/cpp/src/arrow/CMakeLists.txt
Lines 857 to 861 in 53ef438
Not sure if the line with CMAKE_DL_LIBS is also needed here then?
There was a problem hiding this comment.
Ok, but we don't want to keep those two unofficial FileSystem and HadoopFileSystem classes which create confusion with the other (public) filesystem classes.
Ideally, those two classes disappear and their implementation code gets folded into the public HadoopFileSystem class.
If that's too annoying, we should at least merge those two classes and give them a less ambiguous name, for example HdfsClient.
There was a problem hiding this comment.
Ok, will go with the disappearing =)
IIUC hdfs_io.h will be removed altogether:
FileSystemandHadoopFileSystemwill go intohdfs.cc, folded into the publicHadoopFileSystemHdfsConnectionConfigwill also go intohdfs.cc- declarations that are left will go into
hdfs_internal.cc
There was a problem hiding this comment.
Most of these declarations should IMHO go into the arrow::filesystem::internal namespace, except for HdfsConnectionConfig which can go into arrow::filesystem.
|
Hi @pitrou, could you please take a quick look at the changes when you have a moment? I've done my best to implement the suggested changes, but am sure there's still room for improvement.
The Python and MATLAB test failures are not related. |
|
Hi @AlenkaF
I think you're misreading the output, the test is actually skipped when the driver fails unloading, which is normal: https://github.com/apache/arrow/actions/runs/15109276550/job/42464862030?pr=45998#step:7:3277 The problem is in the other tests, because it seems a destructor crashes: https://github.com/apache/arrow/actions/runs/15109276550/job/42464862030?pr=45998#step:7:3281 |
Hmm, rather than trying to find the exact explanation, a simple solution would be to change these functions into static methods, for example this: ARROW_EXPORT Status MakeReadableFile(const std::string& path, int32_t buffer_size,
const io::IOContext& io_context, LibHdfsShim* driver,
hdfsFS fs, hdfsFile file,
std::shared_ptr<HdfsReadableFile>* out);would become: class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile {
public:
(...)
static Result<std::shared_ptr<HdfsReadableFile>> Make(
const std::string& path, int32_t buffer_size,
const io::IOContext& io_context, LibHdfsShim* driver,
hdfsFS fs, hdfsFile file); |
|
Aha, I see! Thanks, will look into it. |
|
@pitrou I cleaned up the CI failures (others are not related) and am hoping this changes will not be too bad to review :) |
72eae6e to
7810940
Compare
benibus
left a comment
There was a problem hiding this comment.
Thanks! This looks pretty good to me. Just a few comments.
c7beefc to
281b51a
Compare
|
@pitrou gentle ping. Would I be too optimistic to try to get it into 21.0.0? |
f09d6d9 to
6d1ec18
Compare
There was a problem hiding this comment.
Pull request overview
This PR removes long-deprecated HDFS types (ObjectType, FileStatistics) and shifts HDFS integration away from arrow::io toward the arrow::fs FileSystem APIs, consolidating/relocating implementation and updating Python bindings accordingly.
Changes:
- Deletes the deprecated
arrow::ioHDFS header/implementation and stops exporting it fromarrow/io/api.h. - Refactors HDFS implementation into
arrow::fs::HadoopFileSystem(and moves the internal libhdfs shim / stream implementations underarrow/filesystem). - Updates PyArrow bindings to use the filesystem HDFS API and rehomes
have_libhdfs()to the top-levelpyarrowmodule.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| python/pyarrow/io.pxi | Removes the have_libhdfs() helper from the pyarrow.io extension module. |
| python/pyarrow/includes/libarrow.pxd | Drops deprecated ObjectType / FileStatistics and removes arrow::io HDFS bindings. |
| python/pyarrow/includes/libarrow_fs.pxd | Adds filesystem-level HaveLibHdfs() and HdfsConnectionConfig declarations for PyArrow. |
| python/pyarrow/_hdfs.pyx | Adds an internal _have_libhdfs() implemented via the filesystem API. |
| python/pyarrow/init.py | Reintroduces have_libhdfs() at the top-level Python API (delegating to _hdfs). |
| cpp/src/arrow/meson.build | Removes io/hdfs*.cc from the IO component sources under Meson. |
| cpp/src/arrow/io/hdfs.h | Deletes the deprecated arrow::io HDFS API (including deprecated structs). |
| cpp/src/arrow/io/hdfs.cc | Deletes the deprecated arrow::io HDFS implementation. |
| cpp/src/arrow/io/CMakeLists.txt | Removes the arrow-io HDFS test registration from the IO test suite. |
| cpp/src/arrow/io/api.h | Stops exporting HDFS via arrow/io/api.h. |
| cpp/src/arrow/filesystem/hdfs.h | Makes filesystem HDFS self-contained (new config struct + extra methods + HaveLibHdfs). |
| cpp/src/arrow/filesystem/hdfs.cc | Refactors implementation to directly use the libhdfs shim and new internal stream types. |
| cpp/src/arrow/filesystem/hdfs_internal.h | Moves/expands internal shim/types/streams into filesystem internals. |
| cpp/src/arrow/filesystem/hdfs_internal.cc | Moves stream implementations and related logic into filesystem internals. |
| cpp/src/arrow/filesystem/hdfs_internal_test.cc | Ports the internal HDFS tests to the filesystem implementation. |
| cpp/src/arrow/filesystem/CMakeLists.txt | Adds hdfs_internal_test to the filesystem test suite and links required libs. |
| cpp/src/arrow/CMakeLists.txt | Removes IO-level HDFS sources and adds filesystem-level hdfs_internal.cc to build when ARROW_HDFS=ON. |
Comments suppressed due to low confidence (2)
cpp/src/arrow/filesystem/hdfs_internal_test.cc:165
ConnectsAgaindeclares a localclientbut then assigns the new filesystem toclient_instead, leavingclientunused. With common warning settings this can break the build (-Wunused-variable).
cpp/src/arrow/filesystem/hdfs_internal_test.cc:74WriteDummyFile()no longer callsClose()on theHdfsOutputStream. Relying on the destructor to close means close/flush failures won't fail the test (they're only warned), and it can make test behavior depend on destructor timing.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (2)
cpp/src/arrow/filesystem/hdfs_internal_test.cc:165
ConnectsAgaindeclares a localclientbut assigns the new filesystem instance to the fixture memberclient_instead. This makesclientunused (potential -Werror build break) and also mutates the shared fixture state unexpectedly.
cpp/src/arrow/filesystem/hdfs_internal_test.cc:74WriteDummyFileno longer callsClose()on the output stream, so close/flush errors won’t be asserted and could be silently ignored (the destructor only warns). It’s better for the test helper to explicitly close and propagate any failure.
|
|
||
| Status DeleteFile(const std::string& path) override; | ||
|
|
||
| Status MakeDirectory(const std::string& path); |
There was a problem hiding this comment.
I understand these methods existed on the legacy HDFS filesystem class, but I think we should only implement the standard FileSystem methods.
| class HdfsReadableFile; | ||
| class HdfsOutputStream; | ||
|
|
||
| struct HdfsPathInfo; |
There was a problem hiding this comment.
I don't think we should expose these?
There was a problem hiding this comment.
Correct 😬 Will move to internal.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (2)
cpp/src/arrow/filesystem/hdfs_internal.cc:39
- This file now uses
std::minandstd::numeric_limitsin the newly added HDFS file implementations, but it doesn’t include<algorithm>or<limits>. This will fail to compile on toolchains that don’t transitively include these headers.
cpp/src/arrow/filesystem/hdfs_internal.cc:570 GetPathInfoFailedis defined inside an anonymous namespace, so it isn’t actuallyarrow::fs::internal::GetPathInfoFailed. The later callinternal::GetPathInfoFailed(path_)therefore won’t compile. Define the helper directly inarrow::fs::internal(or change the call to match the actual scope).
| RETURN_NOT_OK(ConnectLibHdfs(&driver_shim)); | ||
| RETURN_NOT_OK(io::HadoopFileSystem::Connect(&options_.connection_config, &client_)); | ||
| const HdfsConnectionConfig* config = &options_.connection_config; | ||
| RETURN_NOT_OK(ConnectLibHdfs(&driver_)); |
| bool Exists(const std::string& path); | ||
|
|
||
| Status GetPathInfoStatus(const std::string& path, HdfsPathInfo* info); | ||
|
|
||
| Status ListDirectory(const std::string& path, std::vector<HdfsPathInfo>* listing); |
| from queue import Queue, Empty as QueueEmpty | ||
|
|
||
| from pyarrow.lib cimport check_status, HaveLibHdfs | ||
| from pyarrow.lib cimport check_status |
Rationale for this change
ObjectTypeandFileStatisticsin io/hdfs.h have been deprecated for a while and can be removed.What changes are included in this PR?
ObjectTypeandFileStatisticsstructs are removed and instead FileSystem API inarrow::fsis used. Together with this change, the hdfs connected code is moved fromcpp/src/arrow/iotocpp/src/arrow/filesystemmergingFileSystemandHadoopFileSystemclasses fromarrow::iointo the publicHadoopFileSystemclass.Are these changes tested?
Existing tests should pass.
Are there any user-facing changes?
Deprecated structs are removed and all hdfs related code is now a part of the filesystem module.
Also closes: #22457 (not sure about
io/interfaces.h?)