feat: integration s3 with arrow filesystem by MisterRaindrop · Pull Request #548 · apache/iceberg-cpp

MisterRaindrop · 2026-01-29T11:41:27Z

I have implemented Arrow FileSystem to access S3, but I'm still not sure if it meets the requirements.

There are still task or question to complete for the current PR, and it is not ready for merging yet.

Question:
Currently, the object storage options include Azure, AWS, and GCS. I have chosen AWS as the implementation for now is ok?

Task:
I need to deploy MinIO to facilitate testing access to S3, but I'm not sure where it would be best to set it up?

wgtmac · 2026-01-29T14:15:56Z

Thanks for adding this!

I have chosen AWS as the implementation for now is ok?

Yes, I believe this is worth doing. I supposed to reuse ArrowFileSystemFileIO by passing an arrow::FileSystem of S3 which is supported by Arrow if built with ARROW_S3=ON. But I haven't explored it yet.

I need to deploy MinIO to facilitate testing access to S3

There is a related discussion with regard to minio's recent license change: https://lists.apache.org/thread/vnw9jonmfcsz6bwojhfch1nmywyl50h3. I'm not sure if there is any good alternative to test it.

MisterRaindrop · 2026-01-30T01:44:42Z

I recommend using MinIO. It is relatively stable and suitable for the current project development phase. Once the community reaches a consensus, the cost of replacing MinIO will not be high.

wgtmac · 2026-01-31T02:06:53Z

I think it is fine to use minio at this moment to unblock us. Let me know what you think on my proposed approach above. We might also need to add a FileIO registry to provide default implementation on us and enable users to override their own implementations of s3 and others. The key in the FileIO registry can be associated with table property io-impl.

wgtmac · 2026-01-31T02:14:06Z

We may also need to add top-level CMake options like ICEBERG_S3, ICEBERG_AZURE, ICEBERG_GCS to mirror Arrow equivalents and vendor Arrow with ARROW_S3=ON. This might be tricky in the CI because we need to deal with Arrow's dependency on AWS SDK and others.

zhjwpku · 2026-01-31T09:17:28Z

I recommend using MinIO. It is relatively stable and suitable for the current project development phase. Once the community reaches a consensus, the cost of replacing MinIO will not be high.

FYI, there is a PR to replace MinIO with RustFS, apache/iceberg#14928

MisterRaindrop · 2026-02-02T05:50:32Z

Yes, I believe this is worth doing. I supposed to reuse ArrowFileSystemFileIO by passing an arrow::FileSystem of S3 which is supported by Arrow if built with ARROW_S3=ON. But I haven't explored it yet.

ArrowFileSystemFileIO is ok, I referenced MakeLocalFileIO and implemented a simple MakeS3FileIO interface using arrowfilesystem.

I think it is fine to use minio at this moment to unblock us. Let me know what you think on my proposed approach above. We might also need to add a FileIO registry to provide default implementation on us and enable users to override their own implementations of s3 and others. The key in the FileIO registry can be associated with table property io-impl.

you mean this io-impl=org.apache.iceberg.aws.s3.S3FileIO ?

It's equivalent to setting the io-impl string in the catalog's properties. Then, RestCatalog the FileIORegistry looks up the implementation in the io-impl map. Is that roughly how it works? If so, I can try implementing some simple code to see if it's correct.

wgtmac · 2026-02-02T06:49:36Z

Yes, I think it looks reasonable.

MisterRaindrop · 2026-02-03T07:47:11Z

The current code is only simple implemented. Could you help me check it is ok?
@wgtmac

wgtmac · 2026-02-03T10:07:14Z

Thanks for the update! I'm busy with some internal stuff these days. Will try to review this as soon as possible.

wgtmac

I just took a quick pass on the latest commit. Can we simplify the implementation like this:

Use a CMake option to enable S3.
Define reserved iceberg properties for S3 and add functions to convert them to Arrow S3 options.
To create a concrete S3FileIO, using Arrow API to create a S3FileSystem and wrap it by ArrowFileSystemFileIO.
Register the factory to create FileIO of S3 to the registry before use.
Add a file io utility to create the FileIO instance based on various condition.

wgtmac · 2026-02-09T08:24:09Z

cmake_modules/IcebergThirdpartyToolchain.cmake

  # Work around undefined symbol: arrow::ipc::ReadSchema(arrow::io::InputStream*, arrow::ipc::DictionaryMemo*)
  set(ARROW_IPC ON)
  set(ARROW_FILESYSTEM ON)
+  set(ARROW_S3 ON)


Can we add a cmake option ICEBERG_S3 and only toggle on ARROW_S3 when ICEBERG_S3 is on?

wgtmac · 2026-02-09T09:00:15Z

src/iceberg/arrow/arrow_s3_file_io.cc

+
+#include <arrow/filesystem/filesystem.h>
+#include <arrow/filesystem/localfs.h>
+#if __has_include(<arrow/filesystem/s3fs.h>)


If we add ICEBERG_S3 option, we don't need to deal with this check.

wgtmac · 2026-02-09T09:03:27Z

src/iceberg/catalog/rest/rest_catalog.cc

+    impl_name = io_impl->second;
+  } else {
+    // Use default based on warehouse URI scheme
+    if (warehouse.rfind("s3://", 0) == 0) {


Why using rfind?

BTW, shouldn't we use uri instead of warehouse property?

wgtmac · 2026-02-09T09:21:05Z

src/iceberg/arrow/arrow_s3_file_io.cc

+/// This implementation is thread-safe as it creates a new FileSystem instance
+/// for each operation. However, it may be less efficient than caching the
+/// FileSystem. S3 initialization is done once per process.
+class ArrowUriFileIO : public FileIO {


Why do we need this instead of reusing ArrowFIleSystemFileIO?

wgtmac · 2026-02-09T09:30:46Z

src/iceberg/arrow/arrow_s3_file_io.cc

+///
+/// \param properties The configuration properties map.
+/// \return Configured S3Options.
+::arrow::fs::S3Options ConfigureS3Options(


I agree this is something that we need.

wgtmac · 2026-02-09T09:43:55Z

src/iceberg/catalog/rest/rest_catalog.h

+  /// This overload automatically creates an appropriate FileIO based on the "io-impl"
+  /// property or the warehouse location URI scheme.
+  ///
+  /// FileIO selection logic:


It is better to add a iceberg/util/file_io_util.h to handle this logic and support reusing. Please note that Arrow Filesystem support is only available in the iceberg-bundle library, so we can only talk to the FileIO registry to create an FileIO instance.

src/iceberg/arrow/arrow_s3_file_io.cc

Key changes based on reviewer (wgtmac) feedback: 1. Add ICEBERG_S3 CMake option to conditionally enable ARROW_S3, replacing the unconditional `set(ARROW_S3 ON)`. 2. Replace `#if __has_include(<arrow/filesystem/s3fs.h>)` with `#ifdef ICEBERG_HAVE_S3` compile definition controlled by CMake. 3. Remove ArrowUriFileIO class - reuse existing ArrowFileSystemFileIO by wrapping Arrow S3FileSystem created via Arrow API. 4. Create missing files: - s3_properties.h: S3 configuration property key constants - file_io_registry.h/.cc: FileIO factory registry (Register/Load) - file_io_register.cc: Arrow FileIO factory registration - file_io_util.h/.cc: Reusable FileIO creation utility - file_io_registry_test.cc: Unit tests for the registry 5. Extract FileIO creation logic from rest_catalog.cc into iceberg/util/file_io_util.h for reusability. 6. Fix code issues: - Implement path-style access (force_virtual_addressing = false) - Add timeout value validation (non-negative check) - Replace rfind("s3://", 0) with starts_with("s3://") - Fix format string bug in rest_catalog_test.cc 7. Update CI workflow and build script to pass ICEBERG_S3=ON. https://claude.ai/code/session_01GzV7A8VoYyWUN8QdtqgiMq

wgtmac · 2026-03-18T14:01:39Z

Do you want to revive this?

MisterRaindrop · 2026-03-18T14:09:15Z

Do you want to revive this?

Are you referring to this claude commit?

I'm really sorry. I was just curious and tried it out. Actually, I did work on part of it, but I’ve been really busy lately and haven’t been able to spare more time. I think things might ease up next week or the week after. 😭😭😭

wgtmac · 2026-03-18T15:16:18Z

No worries. I just want to check if there is any update. Take your time :)

MisterRaindrop · 2026-03-26T08:35:36Z

@wgtmac I update, please review it, thanks

wgtmac

Thanks for updating this! I haven't looked at the test yet and have some initial comments.

wgtmac · 2026-03-27T08:31:42Z

src/iceberg/file_io_registry.h

+class ICEBERG_EXPORT FileIORegistry {
+ public:
+  /// Well-known implementation names
+  static constexpr const char* kArrowLocalFileIO = "org.apache.iceberg.arrow.ArrowFileIO";


Let's use std::string_view instead of C-style string.

wgtmac · 2026-03-27T08:36:16Z

src/iceberg/file_io_registry.h

+ public:
+  /// Well-known implementation names
+  static constexpr const char* kArrowLocalFileIO = "org.apache.iceberg.arrow.ArrowFileIO";
+  static constexpr const char* kArrowS3FileIO = "org.apache.iceberg.arrow.ArrowS3FileIO";


It looks odd to use the Java classpath style here. I haven't looked at other impls yet, perhaps it is worth investigating their conventions as well? My initial idea is just to use default keys like "local" and "s3" to locate the FileIO implementations.

wgtmac · 2026-03-27T08:37:15Z

src/iceberg/file_io_registry.h

+  static constexpr const char* kArrowS3FileIO = "org.apache.iceberg.arrow.ArrowS3FileIO";
+
+  /// Factory function type for creating FileIO instances.
+  using Factory = std::function<Result<std::shared_ptr<FileIO>>(


It is better to return unique_ptr by default.

wgtmac · 2026-03-27T08:37:43Z

src/iceberg/file_io_registry.h

+
+  /// Factory function type for creating FileIO instances.
+  using Factory = std::function<Result<std::shared_ptr<FileIO>>(
+      const std::string& warehouse,


Suggested change

const std::string& warehouse,

const std::string& name,

wgtmac · 2026-03-27T08:42:59Z

src/iceberg/arrow/file_io_register.cc

+    // Register Arrow local filesystem FileIO
+    FileIORegistry::Register(
+        FileIORegistry::kArrowLocalFileIO,
+        [](const std::string& /*warehouse*/,


Suggested change

[](const std::string& /*warehouse*/,

[](const std::string& /*name*/,

Same for below.

wgtmac · 2026-03-27T09:54:30Z

CMakeLists.txt

 option(ICEBERG_BUILD_BUNDLE "Build the battery included library" ON)
 option(ICEBERG_BUILD_REST "Build rest catalog client" ON)
 option(ICEBERG_BUILD_REST_INTEGRATION_TESTS "Build rest catalog integration tests" OFF)
+option(ICEBERG_S3 "Build with S3 support" ON)


It is worth noting that ICEBERG_S3 should be disabled if ICEBERG_BUILD_BUNDLE is OFF.

Should we disable it by default?

Currently, it appears that the entire project will prioritize development around the REST catalog, which primarily interacts with S3. Are we certain we want to disable ICEBERG_S3 by default?

Yes, but it will incur unnecessary burden to developers who do not care about S3. For example, Arrow will need to download and bundle AWS SDK and use docker to run the test.

wgtmac · 2026-03-27T09:55:43Z

src/iceberg/arrow/arrow_s3_file_io.cc

+  }
+  return {};
+#else
+  return NotImplemented("Arrow S3 support is not enabled");


Suggested change

return NotImplemented("Arrow S3 support is not enabled");

return NotSupported("Arrow S3 support is not enabled");

wgtmac · 2026-03-27T09:56:50Z

src/iceberg/arrow/arrow_s3_file_io.cc

+  static std::once_flag init_flag;
+  static ::arrow::Status init_status = ::arrow::Status::OK();
+  std::call_once(init_flag, []() {
+    ::arrow::fs::S3GlobalOptions options;


nit: add a TODO comment to support options supported by ::arrow::fs::S3GlobalOptions.

wgtmac · 2026-03-27T09:58:58Z

src/iceberg/arrow/arrow_s3_file_io.cc

+  auto connect_timeout_it = properties.find(S3Properties::kConnectTimeoutMs);
+  if (connect_timeout_it != properties.end()) {
+    try {
+      options.connect_timeout = std::stod(connect_timeout_it->second) / 1000.0;


Can we use from_chars just like ParseInteger does?

wgtmac · 2026-03-27T09:59:35Z

src/iceberg/arrow/arrow_s3_file_io.cc

+Result<std::unique_ptr<FileIO>> MakeS3FileIO(
+    const std::string& uri,
+    const std::unordered_map<std::string, std::string>& properties) {
+  if (!uri.starts_with("s3://")) {


Define a constant for the magic s3://. s3a should be supported as well.

s3a, as far as I know, is a protocol header for Hadoop. Don't we support the standard S3?

MisterRaindrop force-pushed the arrowfs_with_s3 branch from 8436b72 to 5197fa9 Compare February 3, 2026 07:41

wgtmac reviewed Feb 9, 2026

View reviewed changes

MisterRaindrop added 2 commits March 25, 2026 20:20

feat: integration s3 with arrow filesystem

18bd320

fix

3cdf39a

MisterRaindrop force-pushed the arrowfs_with_s3 branch from 5197fa9 to 3cdf39a Compare March 26, 2026 07:46

wgtmac requested changes Mar 27, 2026

View reviewed changes

	[](const std::string& /warehouse/,
	[](const std::string& /name/,

	return NotImplemented("Arrow S3 support is not enabled");
	return NotSupported("Arrow S3 support is not enabled");

Conversation

MisterRaindrop commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wgtmac commented Jan 29, 2026

Uh oh!

MisterRaindrop commented Jan 30, 2026

Uh oh!

wgtmac commented Jan 31, 2026

Uh oh!

wgtmac commented Jan 31, 2026

Uh oh!

zhjwpku commented Jan 31, 2026

Uh oh!

MisterRaindrop commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wgtmac commented Feb 2, 2026

Uh oh!

MisterRaindrop commented Feb 3, 2026

Uh oh!

wgtmac commented Feb 3, 2026

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wgtmac commented Mar 18, 2026

Uh oh!

MisterRaindrop commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wgtmac commented Mar 18, 2026

Uh oh!

MisterRaindrop commented Mar 26, 2026

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

MisterRaindrop commented Jan 29, 2026 •

edited

Loading

MisterRaindrop commented Feb 2, 2026 •

edited

Loading

MisterRaindrop commented Mar 18, 2026 •

edited

Loading