Apache Arrow 1.0.0 Release
Published
24 Jul 2020
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 1.0.0 release. This covers over 3 months of development work and includes 810 resolved issues from 100 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
Despite a “1.0.0” version, this is the 18th major release of Apache Arrow and marks a transition to binary stability of the columnar format (which was already informally backward-compatible going back to December 2017) and a transition to Semantic Versioning for the Arrow software libraries.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
1.0.0 Columnar Format and Stability Guarantees
The 1.0.0 release indicates that the Arrow columnar format is declared stable, with forward and backward compatibility guarantees.
The Arrow columnar format received several recent changes and additions, leading to the 1.0.0 format version:
-
The metadata version was bumped to a new version V5, indicating an incompatible change in the buffer layout of Union types. All other types keep the same layout as in V4. V5 also includes format additions to assist with forward compatibility (detecting unsupported changes sent by future library versions). Libraries remain backward compatible with data generated by all libraries back to 0.8.0 (December 2017) and the Java and C++ libraries are capable of generating V4-compatible messages (for sending data to applications using 0.8.0 to 0.17.1).
-
Dictionary indices are now allowed to be unsigned integers rather than only signed integers. Using UInt64 is still discouraged because of poor Java support.
-
A “Feature” enum has been added to announce the use of specific optional features in an IPC stream, such as buffer compression. This new field is not used by any implementation yet.
-
Optional buffer compression using LZ4 or ZStandard was added to the IPC format.
-
Decimal types now have an optional “bitWidth” field, defaulting to 128.
This will allow for future support of other decimal widths such as 32- and 64-bit. -
The validity bitmap buffer has been removed from Union types. The nullity of a slot in a Union array is determined exclusively by the constituent arrays forming the union.
Integration testing has been expanded to test for extension types and nested dictionaries. See the implementation matrix for details.
Community
Since the last release, we have added two new committers:
- Liya Fan
- Ji Liu
Thank you for all your contributions!
Arrow Flight RPC notes
Flight now offers DoExchange, a fully bidirectional data endpoint, in addition to DoGet and DoPut, in C++, Java, and Python. Middlewares in all languages now expose binary-valued headers. Additionally, servers and clients can set Arrow IPC read/write options in all languages, making compatibility easier with earlier versions of Arrow Flight.
In C++ and Python, Flight now exposes more options from gRPC, including the address of the client (on the server) and the ability to set low-level gRPC client options. Flight also supports mutual TLS authentication and the ability for a client to control the size of a data message on the wire.
C++ notes
- Support for static linking with Arrow has been vastly improved, including the
introduction of a
libarrow_bundled_dependencies.a
library bundling all external dependencies that are built from source by Arrow’s build system rather than installed by an external package manager. This makes it significantly easier to create dependency-free applications with all libraries statically-linked. - Following the Arrow format changes, Union arrays cannot have a top-level bitmap anymore.
- A number of improvements were made to reduce the overall generated binary size in the Arrow library.
- A convenience API
GetBuildInfo
allows querying the characteristics of the Arrow library. We encourage you to suggest any desired addition to the returned information. - We added an optional dependency to the
utf8proc
library, used in several compute functions (see below). - Instead of sharing the same concrete classes, sparse and dense unions now
have separated classes (
SparseUnionType
andDenseUnionType
, as well asSparseUnionArray
,DenseUnionArray
,SparseUnionScalar
,DenseUnionScalar
). - Arrow can now be built for iOS using the right set of CMake options, though we don’t officially support it. See this writeup for details.
Compute functions
The compute kernel layer was extensively reworked. It now offers a generic function lookup, dispatch and execution mechanism. Furthermore, new internal scaffoldings make it vastly easier to write new function kernels, with many common details like type checking and function dispatch based on type combinations handled by the framework rather than implemented manually by the function developer.
Around 30 new array compute functions have been added. For example, Unicode-compliant predicates and transforms, such as lowercase and uppercase transforms, are now available.
The available compute functions are listed exhaustively in the Sphinx-generated documentation.
Datasets
Datasets can now be read from CSV files.
Datasets can be expanded to their component fragments, enabling fine grained interoperability with other consumers of data files. Where applicable, metadata is available as a property of the fragment, including partition information and (for the parquet format) per-column statistics.
Datasets of parquet files can now be assembled from a single _metadata
file,
such as those created by systems like Dask and Spark. _metadata
contains the metadata of all fragments, allowing construction of a statistics-
aware dataset with a single IO call.
Feather
The Feather format is now available in version 2, which is simply the Arrow IPC file format with another name.
IPC
By default, we now write IPC streams with metadata V5. However, metadata V4
can be requested by setting the appropriate member in IpcWriteOptions
. V4 as
well as V5 metadata IPC streams can be read properly, with one exception: a V4
metadata stream containing Union arrays with top-level null values will refuse
reading.
As noted above, there are no changes between V4 and V5 that break backwards compatibility. For forward compatibility scenarios (where you need to generate data to be read by an older Arrow library), you can set the V4 compatibility mode.
Support for dictionary replacement and dictionary delta was implemented.
Parquet
Writing files with the LZ4 codec is disabled because it produces files incompatible with the widely-used Hadoop Parquet implementation. Support will be reenabled once we align the LZ4 implementation with the special buffer encoding expected by Hadoop.
Java notes
The Java package introduces a number of low level changes in this release.
Most notable are the work in support of allocating large arrow buffers and
removing Netty from the public API. Users will have to update their
dependencies to use one of the two supported allocators Netty:
arrow-memory-netty
or Unsafe (internal java api for direct memory)
arrow-memory-unsafe
.
The Java Vector implementation has improved its interoperability having
verified LargeVarChar
, LargeBinary
, LargeList
, Union
, Extension types
and duplicate field names in Structs
are binary compatible with C++ and the
specification.
Python notes
The size of wheel packages is significantly reduced, up to 75%. One side effect is that these wheels do not enable Gandiva anymore (which requires the LLVM runtime to be statically-linked). We are interested in providing Gandiva as an add-on package as a separate Python wheel in the future.
The Scalar class hierarchy was reworked to more closely follow its C++ counterpart.
TLS CA certificates are looked up more reliably when using the S3 filesystem, especially with manylinux wheels.
The encoding of CSV files can now be specified explicitly, defaulting to UTF8. Custom timestamp parsers can now be used for CSV files.
Filesystems can now be implemented in pure Python. As a result, fsspec-based filesystems can now be used in datasets.
parquet.read_table
is now backed by the dataset API by default, enabling
filters on any column and more flexible partitioning.
R notes
The R package added support for converting to and from many additional Arrow
types. Tables showing how R types are mapped to Arrow types and vice versa have
been added to the introductory vignette, and nearly all types are handled.
In addition, R attributes
like custom classes and metadata are now preserved
when converting a data.frame
to an Arrow Table and are restored when loading
them back into R.
For more on what’s in the 1.0.0 R package, see the R changelog.
Ruby and C GLib notes
The Ruby and C GLib packages added support for the new compute function framework, in which users can find a compute function dynamically and call it. Users don’t need to wait for a C GLib binding for new compute functions: if the C++ package provides a new compute function, users can use it without additional code in the Ruby and C GLib packages.
The Ruby and C GLib packages added support for Apache Arrow
Dataset. The Ruby package provides a new gem for Apache Arrow Dataset,
red-arrow-dataset
. The C GLib package provides a new module for
Apache Arrow Dataset, arrow-dataset-glib
. They just have a few
features for now but we will add more in future releases.
The Ruby and C GLib packages added support for reading only the specified row group in an Apache Parquet file.
Ruby
The Ruby package added support for column level compression in writing Apache Parquet files.
The Ruby package changed the Arrow::DictionaryArray#[]
behavior. It now
returns the dictionary value instead of the dictionary index. This is a
backwards-incompatible change.
Rust notes
- A new integration test crate has been added, allowing the Rust implementation to participate in integration testing.
- A new benchmark crate has been added for benchmarking performance against popular data sets. The initial examples run SQL queries against the NYC Taxi data set using DataFusion. This is useful for comparing performance against other Arrow implementations.
- Rust toolchain has been upgraded to 1.44 nightly.
Arrow Core
- Support for binary, string, and list arrays with i64 offsets to support large lists.
- A new sort kernel has been added.
- There have been various improvements to dictionary array support.
- CSV reader enhancements include a new CsvReadOptions struct and support for schema inference from multiple CSV files.
- There are significant (10x - 40x) performance improvements to SIMD comparison kernels.
DataFusion
- There are numerous UX improvements to LogicalPlan and LogicalPlanBuilder, including support for named columns.
- General improvements to code base, such as removing many uses of
Arc
and using slices instead of&Vec
as function arguments. - ParquetScanExec performance improvement (almost 2x).
- ExecutionContext can now be shared between threads.
- Rust closures can now be used as Scalar UDFs.
- Sort support has been added to SQL and LogicalPlan.