I have been using the latest R
arrow
package (arrow_2.0.0.20201106
) that supports reading and writing from AWS S3 directly (which is awesome).
I don't seem to have issues when I write and read my own file (see below):
write_parquet(iris, "iris.parquet")
system("aws s3 mv iris.parquet s3://myawsbucket/iris.parquet")
df <- read_parquet("s3://myawsbucket/iris.parquet")
But when I try to read in one of the sample R
arrow
files, I get the following error:
df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
Error in parquet___arrow___FileReader__ReadTable1(self) :
IOError: NotImplemented: Support for codec 'snappy' not built
When I check if the codec is available, it looks like it is not:
codec_is_available(type="snappy")
[1] FALSE
Anyone know a way to make the "snappy" codec available?
Thanks, Mike
###########
Thanks to the answer from @Neal below. Here is the code that installed all needed dependencies for me.
Sys.setenv(ARROW_S3="ON")
Sys.setenv(NOT_CRAN="true")
install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
I had to run
Sys.setenv(ARROW_WITH_SNAPPY = "ON")
before running install.packages
.
Use Sys.setenv(ARROW_R_DEV = TRUE)
for verbose build output.
For reference, see the full list of compile and link options below:
-- Compile and link options:
--
-- ARROW_CXXFLAGS="" [default=""]
-- Compiler flags to append when compiling Arrow
-- ARROW_BUILD_STATIC=ON [default=ON]
-- Build static libraries
-- ARROW_BUILD_SHARED=OFF [default=ON]
-- Build shared libraries
-- ARROW_PACKAGE_KIND="" [default=""]
-- Arbitrary string that identifies the kind of package
-- (for informational purposes)
-- ARROW_GIT_ID="" [default=""]
-- The Arrow git commit id (if any)
-- ARROW_GIT_DESCRIPTION="" [default=""]
-- The Arrow git commit description (if any)
-- ARROW_NO_DEPRECATED_API=OFF [default=OFF]
-- Exclude deprecated APIs from build
-- ARROW_USE_CCACHE=ON [default=ON]
-- Use ccache when compiling (if available)
-- ARROW_USE_LD_GOLD=OFF [default=OFF]
-- Use ld.gold for linking on Linux (if available)
-- ARROW_USE_PRECOMPILED_HEADERS=OFF [default=OFF]
-- Use precompiled headers when compiling
-- ARROW_SIMD_LEVEL=SSE4_2 [default=NONE|SSE4_2|AVX2|AVX512]
-- Compile-time SIMD optimization level
-- ARROW_RUNTIME_SIMD_LEVEL=MAX [default=NONE|SSE4_2|AVX2|AVX512|MAX]
-- Max runtime SIMD optimization level
-- ARROW_ARMV8_ARCH=armv8-a [default=armv8-a|armv8-a+crc+crypto]
-- Arm64 arch and extensions
-- ARROW_ALTIVEC=ON [default=ON]
-- Build with Altivec if compiler has support
-- ARROW_RPATH_ORIGIN=OFF [default=OFF]
-- Build Arrow libraries with RATH set to $ORIGIN
-- ARROW_INSTALL_NAME_RPATH=ON [default=ON]
-- Build Arrow libraries with install_name set to @rpath
-- ARROW_GGDB_DEBUG=ON [default=ON]
-- Pass -ggdb flag to debug builds
--
-- Test and benchmark options:
--
-- ARROW_BUILD_EXAMPLES=OFF [default=OFF]
-- Build the Arrow examples
-- ARROW_BUILD_TESTS=OFF [default=OFF]
-- Build the Arrow googletest unit tests
-- ARROW_ENABLE_TIMING_TESTS=ON [default=ON]
-- Enable timing-sensitive tests
-- ARROW_BUILD_INTEGRATION=OFF [default=OFF]
-- Build the Arrow integration test executables
-- ARROW_BUILD_BENCHMARKS=OFF [default=OFF]
-- Build the Arrow micro benchmarks
-- ARROW_BUILD_BENCHMARKS_REFERENCE=OFF [default=OFF]
-- Build the Arrow micro reference benchmarks
-- ARROW_TEST_LINKAGE=static [default=shared|static]
-- Linkage of Arrow libraries with unit tests executables.
-- ARROW_FUZZING=OFF [default=OFF]
-- Build Arrow Fuzzing executables
-- ARROW_LARGE_MEMORY_TESTS=OFF [default=OFF]
-- Enable unit tests which use large memory
--
-- Lint options:
--
-- ARROW_ONLY_LINT=OFF [default=OFF]
-- Only define the lint and check-format targets
-- ARROW_VERBOSE_LINT=OFF [default=OFF]
-- If off, 'quiet' flags will be passed to linting tools
-- ARROW_GENERATE_COVERAGE=OFF [default=OFF]
-- Build with C++ code coverage enabled
--
-- Checks options:
--
-- ARROW_TEST_MEMCHECK=OFF [default=OFF]
-- Run the test suite using valgrind --tool=memcheck
-- ARROW_USE_ASAN=OFF [default=OFF]
-- Enable Address Sanitizer checks
-- ARROW_USE_TSAN=OFF [default=OFF]
-- Enable Thread Sanitizer checks
-- ARROW_USE_UBSAN=OFF [default=OFF]
-- Enable Undefined Behavior sanitizer checks
--
-- Project component options:
--
-- ARROW_BUILD_UTILITIES=OFF [default=OFF]
-- Build Arrow commandline utilities
-- ARROW_COMPUTE=ON [default=OFF]
-- Build the Arrow Compute Modules
-- ARROW_CSV=ON [default=OFF]
-- Build the Arrow CSV Parser Module
-- ARROW_CUDA=OFF [default=OFF]
-- Build the Arrow CUDA extensions (requires CUDA toolkit)
-- ARROW_DATASET=ON [default=OFF]
-- Build the Arrow Dataset Modules
-- ARROW_FILESYSTEM=ON [default=OFF]
-- Build the Arrow Filesystem Layer
-- ARROW_FLIGHT=OFF [default=OFF]
-- Build the Arrow Flight RPC System (requires GRPC, Protocol Buffers)
-- ARROW_GANDIVA=OFF [default=OFF]
-- Build the Gandiva libraries
-- ARROW_HDFS=OFF [default=OFF]
-- Build the Arrow HDFS bridge
-- ARROW_HIVESERVER2=OFF [default=OFF]
-- Build the HiveServer2 client and Arrow adapter
-- ARROW_IPC=ON [default=ON]
-- Build the Arrow IPC extensions
-- ARROW_JEMALLOC=ON [default=ON]
-- Build the Arrow jemalloc-based allocator
-- ARROW_JNI=OFF [default=OFF]
-- Build the Arrow JNI lib
-- ARROW_JSON=ON [default=OFF]
-- Build Arrow with JSON support (requires RapidJSON)
-- ARROW_MIMALLOC=ON [default=OFF]
-- Build the Arrow mimalloc-based allocator
-- ARROW_PARQUET=ON [default=OFF]
-- Build the Parquet libraries
-- ARROW_ORC=OFF [default=OFF]
-- Build the Arrow ORC adapter
-- ARROW_PLASMA=OFF [default=OFF]
-- Build the plasma object store along with Arrow
-- ARROW_PLASMA_JAVA_CLIENT=OFF [default=OFF]
-- Build the plasma object store java client
-- ARROW_PYTHON=OFF [default=OFF]
-- Build the Arrow CPython extensions
-- ARROW_S3=ON [default=OFF]
-- Build Arrow with S3 support (requires the AWS SDK for C++)
-- ARROW_TENSORFLOW=OFF [default=OFF]
-- Build Arrow with TensorFlow support enabled
-- ARROW_TESTING=OFF [default=OFF]
-- Build the Arrow testing libraries
--
-- Thirdparty toolchain options:
--
-- ARROW_DEPENDENCY_SOURCE=BUNDLED [default=AUTO|BUNDLED|SYSTEM|CONDA|VCPKG|BREW]
-- Method to use for acquiring arrow's build dependencies
-- ARROW_VERBOSE_THIRDPARTY_BUILD=OFF [default=OFF]
-- Show output from ExternalProjects rather than just logging to files
-- ARROW_DEPENDENCY_USE_SHARED=ON [default=ON]
-- Link to shared libraries
-- ARROW_BOOST_USE_SHARED=OFF [default=ON]
-- Rely on boost shared libraries where relevant
-- ARROW_BROTLI_USE_SHARED=ON [default=ON]
-- Rely on Brotli shared libraries where relevant
-- ARROW_BZ2_USE_SHARED=ON [default=ON]
-- Rely on Bz2 shared libraries where relevant
-- ARROW_GFLAGS_USE_SHARED=ON [default=ON]
-- Rely on GFlags shared libraries where relevant
-- ARROW_GRPC_USE_SHARED=ON [default=ON]
-- Rely on gRPC shared libraries where relevant
-- ARROW_LZ4_USE_SHARED=ON [default=ON]
-- Rely on lz4 shared libraries where relevant
-- ARROW_OPENSSL_USE_SHARED=ON [default=ON]
-- Rely on OpenSSL shared libraries where relevant
-- ARROW_PROTOBUF_USE_SHARED=ON [default=ON]
-- Rely on Protocol Buffers shared libraries where relevant
-- ARROW_THRIFT_USE_SHARED=ON [default=ON]
-- Rely on thrift shared libraries where relevant
-- ARROW_UTF8PROC_USE_SHARED=ON [default=ON]
-- Rely on utf8proc shared libraries where relevant
-- ARROW_SNAPPY_USE_SHARED=ON [default=ON]
-- Rely on snappy shared libraries where relevant
-- ARROW_UTF8PROC_USE_SHARED=ON [default=ON]
-- Rely on utf8proc shared libraries where relevant
-- ARROW_ZSTD_USE_SHARED=ON [default=ON]
-- Rely on zstd shared libraries where relevant
-- ARROW_USE_GLOG=OFF [default=OFF]
-- Build libraries with glog support for pluggable logging
-- ARROW_WITH_BACKTRACE=ON [default=ON]
-- Build with backtrace support
-- ARROW_WITH_BROTLI=OFF [default=OFF]
-- Build with Brotli compression
-- ARROW_WITH_BZ2=OFF [default=OFF]
-- Build with BZ2 compression
-- ARROW_WITH_LZ4=OFF [default=OFF]
-- Build with lz4 compression
-- ARROW_WITH_SNAPPY=true [default=OFF]
-- Build with Snappy compression
-- ARROW_WITH_ZLIB=ON [default=OFF]
-- Build with zlib compression
-- ARROW_WITH_ZSTD=OFF [default=OFF]
-- Build with zstd compression
-- ARROW_WITH_UTF8PROC=ON [default=ON]
-- Build with support for Unicode properties using the utf8proc library
-- (only used if ARROW_COMPUTE is ON)
-- ARROW_WITH_RE2=ON [default=ON]
-- Build with support for regular expressions using the re2 library
-- (only used if ARROW_COMPUTE or ARROW_GANDIVA is ON)
--
-- Parquet options:
--
-- PARQUET_MINIMAL_DEPENDENCY=OFF [default=OFF]
-- Depend only on Thirdparty headers to build libparquet.
-- Always OFF if building binaries
-- PARQUET_BUILD_EXECUTABLES=OFF [default=OFF]
-- Build the Parquet executable CLI tools. Requires static libraries to be built.
-- PARQUET_BUILD_EXAMPLES=OFF [default=OFF]
-- Build the Parquet examples. Requires static libraries to be built.
-- PARQUET_REQUIRE_ENCRYPTION=OFF [default=OFF]
-- Build support for encryption. Fail if OpenSSL is not found
--
-- Gandiva options:
--
-- ARROW_GANDIVA_JAVA=OFF [default=OFF]
-- Build the Gandiva JNI wrappers
-- ARROW_GANDIVA_STATIC_LIBSTDCPP=OFF [default=OFF]
-- Include -static-libstdc++ -static-libgcc when linking with
-- Gandiva static libraries
-- ARROW_GANDIVA_PC_CXX_FLAGS="" [default=""]
-- Compiler flags to append when pre-compiling Gandiva operations
--
-- Advanced developer options:
--
-- ARROW_EXTRA_ERROR_CONTEXT=OFF [default=OFF]
-- Compile with extra error context (line numbers, code)
-- ARROW_OPTIONAL_INSTALL=OFF [default=OFF]
-- If enabled install ONLY targets that have already been built. Please be
-- advised that if this is enabled 'install' will fail silently on components
-- that have not been built
I'm assuming you're on Linux since the macOS and Windows binary packages have snappy support--that right? Usually if you've installed the Linux package with S3 support you've also built all of the compression libraries, but it is possible to build S3 without the compression libs. How exactly did you install the package?
https://arrow.apache.org/docs/r/articles/install.html may be a useful reference.
Side note: you can just write_parquet(iris, "s3://myawsbucket/iris.parquet")
, no need to write to a local file and shell out to copy it to S3.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With