Apache Arrow 0.17.0 Release
Published
21 Apr 2020
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 0.17.0 release. This covers over 2 months of development work and includes 569 resolved issues from 79 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Community
Since the 0.16.0 release, two committers have joined the Project Management Committee (PMC):
Thank you for all your contributions!
Columnar Format Notes
A C-level Data Interface was designed to ease data sharing inside a single process. It allows different runtimes or libraries to share Arrow data using a well-known binary layout and metadata representation, without any copies. Third party libraries can use the C interface to import and export the Arrow columnar format in-process without requiring on any new code dependencies.
The C++ library now includes an implementation of the C Data Interface, and Python and R have bindings to that implementation.
Arrow Flight RPC notes
- Adopted new DoExchange bi-directional data RPC
- ListFlights supports being passed a Criteria argument in Java/C++/Python. This allows applications to search for flights satisfying a given query.
- Custom metadata can be attached to errors that the server sends to the client, which can be used to encode richer application-specific information.
- A number of minor bugs were fixed, including proper handling of empty null arrays in Java and round-tripping of certain Arrow status codes in C++/Python.
C++ notes
Feather V2
The “Feather V2” format based on the Arrow IPC file format was developed. Feather V2 features full support for all Arrow data types, and resolves the 2GB per-column limitation for large amounts of string data that the original Feather implementation had. Feather V2 also introduces experimental IPC message compression using LZ4 frame format or ZSTD. This will be formalized later in the Arrow format.
C++ Datasets
- Improve speed on high latency file system by relaxing discovery validation
- Better performance with Arrow IPC files using column projection
- Add the ability to list files in FileSystemDataset
- Add support for Parquet file reader options
- Support dictionary columns in partition expression
- Fix various crashes and other issues
C++ Parquet notes
- Complete support for writing nested types to Parquet format was completed. The legacy code can be accessed through parquet write option C++ and an environment variable in Python. Read support will come in a future release.
- The BYTE_STREAM_SPLIT encoding was implemented for floating-point types. It helps improve the efficiency of memory compression for high-entropy data.
- Expose Parquet schema field_id as Arrow field metadata
- Support for DataPageV2 data page format
C++ build notes
- We continued to make the core C++ library build simpler and faster. Among the improvements are the removal of the dependency on Thrift IDL compiler at build time; while Parquet still requires the Thrift runtime C++ library, its dependencies are much lighter. We also further reduced the number of build configurations that require Boost, and when Boost is needed to be built, we only download the components we need, reducing the size of the Boost bundle by 90%.
- Improved support for building on ARM platforms
- Upgraded LLVM version from 7 to 8
- Simplified SIMD build configuration with ARROW_SIMD_LEVEL option allowing no SIMD, SSE4.2, AVX2, or AVX512 to be selected.
- Fixed a number of bugs affecting compilation on aarch64 platforms
Other C++ notes
- Many crashes on invalid input detected by OSS-Fuzz in the IPC reader and in Parquet-Arrow reading were fixed. See our recent blog post for more details.
- A “Device” abstraction was added to simplify buffer management and movement across heterogeneous hardware configurations, e.g. CPUs and GPUs.
- A streaming CSV reader was implemented, yielding individual RecordBatches and helping limit overall memory occupation.
- Array casting from Decimal128 to integer types and to Decimal128 with different scale/precision was added.
- Sparse CSF tensors are now supported.
- When creating an Array, the null bitmap is not kept if the null count is known to be zero
- Compressor support for the LZ4 frame format (LZ4_FRAME) was added
- An event-driven interface for reading IPC streams was added.
- Further core APIs that required passing an explicit out-parameter were
migrated to
Result<T>
. - New analytics kernels for match, sort indices / argsort, top-k
Java notes
- Netty dependencies were removed for BufferAllocator and ReferenceManager classes. In the future, we plan to move netty related classes to a separate module.
- New features were provided to support efficiently appending vector/vector schema root values in batch.
- Comparing a range of values in dense union vectors has been supported.
- The quick sort algorithm was improved to avoid degenerating to the worst case.
Python notes
Datasets
- Updated
pyarrow.dataset
module following the changes in the C++ Datasets project. This release also adds richer documentation on the datasets module. - Support for the improved dataset functionality in
pyarrow.parquet.read_table/ParquetDataset
. To enable, passuse_legacy_dataset=False
. Among other things, this allows to specify filters for all columns and not only the partition keys (using row group statistics) and enables different partitioning schemes. See the “note” in theParquetDataset
documentation.
Packaging
- Wheels for Python 3.8 are now available
- Support for Python 2.7 has been dropped as Python 2.x reached end-of-life in January 2020.
- Nightly wheels and conda packages are now available for testing or other development purposes. See the installation guide
Other improvements
- Conversion to numpy/pandas for FixedSizeList, LargeString, LargeBinary
- Sparse CSC matrices and Sparse CSF tensors support was added. (ARROW-7419, ARROW-7427)
R notes
Highlights include support for the Feather V2 format and the C Data Interface,
both described above. Along with low-level bindings for the C interface, this
release adds tooling to work with Arrow data in Python using reticulate
. See
vignette("python", package = "arrow")
for a guide to getting started.
Installation on Linux now builds C++ the library from source by default. For a
faster, richer build, set the environment variable NOT_CRAN=true
. See
vignette("install", package = "arrow")
for details and more options.
For more on what’s in the 0.17 R package, see the R changelog.
Ruby and C GLib notes
Ruby
- Support Ruby 2.3 again
C GLib
- Add GArrowRecordBatchIterator
- Add support for GArrowFilterOptions
- Add support for Peek() to GIOInputStream
- Add some metadata bindings to GArrowSchema
- Add LocalFileSystem support
- Add support for writer properties of Parquet
- Add support for MapArray
- Add support for BooleanNode
Rust notes
- DictionayArray support.
- Various improvements to code safety.
- Filter kernel now supports temporal types.
Rust Parquet notes
- Array reader now supports temporal types.
- Parquet writer now supports custom meta-data key/value pairs.
Rust DataFusion notes
- Logical plans can now reference columns by name (as well as by index) using
the new
UnresolvedColumn
expression. There is a new optimizer rule to resolve these into column indices. - Scalar UDFs can now be registered with the execution context and used from logical query plans as well as from SQL. A number of math scalar functions have been implemented using this feature (sqrt, cos, sin, tan, asin, acos, atan, floor, ceil, round, trunc, abs, signum, exp, log, log2, log10).
- Various SQL improvements, including support for
SELECT *
andSELECT COUNT(*)
, and improvements to parsing of aggregate queries. - Flight examples are provided, with a client that sends a SQL statement to a Flight server and receives the results.
- The interactive SQL command-line tool now has improved documentation and better formatting of query results.
Project Operations
We’ve continued our migration of general automation toward GitHub Actions. The majority of our commit-by-commit continuous integration (CI) is now running on GitHub Actions. We are working on different solutions for using dedicated hardware as part of our CI. The Buildkite self-hosted CI/CD platform is now supported on Apache repositories and GitHub Actions also supports self-hosted workers.