Apache Arrow 10.0.0 Release
Published
31 Oct 2022
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 10.0.0 release. This covers over 3 months of development work and includes 473 resolved issues from 100 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Community
Since the 9.0.0 release, Yanghong Zhong, Remzi Yang, Dan Harris and Bogumił Kamińsk have been invited to be committers. L.C. Hsieh, Weston Pace, Raphael Taylor-Davies, Nicola Crane and Jacob Quinn have joined the Project Management Committee (PMC). Thanks for your contributions and participation in the project!
Columnar Format Notes
Arrow Flight RPC notes
A JDBC driver based on Arrow Flight SQL is now available, courtesy of a code donation from Dremio (ARROW-7744). For more details, see “Expanding Arrow’s Reach with a JDBC Driver for Arrow Flight SQL”. Flight SQL is now supported in Go (ARROW-17326). Protocol definitions for transactions and Substrait plans were added to Flight SQL and are implemented in C++ and Java (ARROW-17688). General “best practices” documentation was added for C++ (ARROW-17407). The C++ implementation now has basic OpenTelemetry integration (ARROW-14958).
C++ notes
C++11 is no longer supported
The Arrow C++ codebase has moved to C++17 as its language compatibility standard (ARROW-17545). This means Arrow C++, including its header files, now requires a C++17-compliant compiler and standard library to be used. Such compilers are widely available on most platforms.
Compatibility backports of C++14 and C++17 features, such as std::string_view
or std::variant
, have been removed in favor of the standard library version
of these APIs (ARROW-17546). This will also make integration of Arrow C++
with other codebases easier.
It is expected that the Arrow C++ codebase will be gradually modernized to use C++17 features in subsequent releases, when the need arises.
Plasma is deprecated
The Plasma module is deprecated and will be removed in a future release.
Compute / Acero
Extension types are now supported in hash joins (ARROW-16695).
Datasets
The fragments of a dataset can now be iterated in an asynchronous fashion,
using Dataset::GetFragmentsAsync
(ARROW-17318).
Filesystems
It is now possible to configure a timeout policy for S3 (ARROW-16521).
Error messages for S3 have been improved to give more context about the error (ARROW-17079).
GetFileInfoGenerator
has been optimized for local filesystems, with
dedicated options to tune chunking and readahead (ARROW-17306).
JSON
Previously, the JSON reader could only read Decimal fields from JSON strings (i.e. quoted). Now it can also read Decimal fields from JSON numbers as well (ARROW-17847).
Parquet
Before Arrow 3.0.0, data pages version 2 were incorrectly written out, making them unreadable with spec-compliant readers. A compatibility fix has been introduced so that they can still be read with contemporary versions of Arrow (ARROW-17100).
Substrait
The Substrait consumer, which allows Substrait plans to be executed by the Acero execution engine, has received some improvements:
-
Aggregations are now supported (ARROW-15591).
-
Conversion options have been added so that the level of compliance and rountrippability can be chosen when converting between Substrait and Acero representations of a plan (ARROW-16988).
-
Support for many standard Substrait functions has been added (ARROW-15582, ARROW-17523)
Some work has also been done in the reverse direction, to allow Acero execution plans to be serialized as Substrait plans (ARROW-16855).
Other changes
Our CMake package files have been overhauled (ARROW-12175). As a result,
namespaced targets are now exported, such as Arrow::arrow_shared
.
Legacy (non-namespaced) names are still available, for example arrow_shared
.
Compiling in release mode now uses -O2
, not -O3
, by default (ARROW-17436).
The RISC-V architecture is now recongnized at build time (ARROW-17440).
The PyArrow-specific C++ code was moved into the PyArrow source tree
(see below, ARROW-16340). The ARROW_PYTHON
CMake variable has been deprecated
and will be removed in a later release; you should instead enable the necessary
components explicitly (ARROW-17868).
Some classes with a Equals
method now also support operator==
(ARROW-6772).
It was decided to only do this when equality is computationally cheap (i.e.
not on data collections such as Array, RecordBatch…).
C# notes
Bug Fixes
- DecimalArray incorrectly appends values very large and very small values. (ARROW-17223)
Gandiva notes
Gandiva has been migrated to use LLVM opaque pointer types, as typed pointers had been deprecated (ARROW-17790).
Go notes
- A new CI job has been added to run all of the tests with the
-asan
option using go1.18 (ARROW-17324) - Go now passes all integration tests on data types and IPC handling.
- The Go Arrow and Parquet packages now require go1.17+ (ARROW-17646)
Compute
The compute package that was importable via github.com/apache/arrow/go/v9/arrow/compute
is now a separate module which requires go1.18+ (only the compute module, the rest of the packages still work fine under go1.17). (ARROW-17456).
Scalar and Vector kernel infrastructure has been implemented for performing compute operations providing the following functionality:
- casting Arrow Arrays from one type to another (ARROW-17454)
- Using Filter and Take functions on an Arrow Array, Record Batch, or Table (ARROW-17669)
Arrow
- Sparse and Dense Union Arrays have been implemented along with appropriate builders and data type support including in IPC reading and writing. (ARROW-3678, ARROW-17276). This includes scalar types for Dense and Sparse union (ARROW-17390)
- LargeBinary, LargeString and LargeList arrays have been implemented for handling arrays with 64-bit offsets. This also included fixing a bug so that binary builders are correctly resettable. (ARROW-8226, ARROW-17275)
- Support for Decimal256 arrays has been implemented (ARROW-10600)
- Automatic Endianness Conversion for non-native endianness is now an option for IPC streams (ARROW-17219)
- CSV Writer now supports Timestamp, Date32 and Date64 types (ARROW-17273)
- CSV Writer now supports custom formatting for boolean values (ARROW-17277)
- The Go Arrow Library now provides a FlightSQL client and server implementation (ARROW-17326). An example server implementation is provided for a FlightSQL server using SQLite (ARROW-17359)
- CSV Reader now supports schema type inference via NewInferringReader, along with options for specifying the type of some columns and skipping columns (ARROW-17778)
Parquet
- RowGroupReader.Column(index int) no longer panics if provided an invalid column index. Instead the signature has been changed to now return (PageReader, error) similar to other methods in the codebase. (ARROW-17274)
- Bitpacking and other internal required implementations for ppc64le have been added for the Parquet library (ARROW-17372)
- A bug has been fixed that caused inconsistent row information data from a table written by Athena (ARROW-17453)
- Fixed a bug that caused panics when writing a Nullable List of Structs (ARROW-17169)
- Key Value metadata in an Arrow Schema will be propagated to the Parquet file when using pqarrow even when not using StoreSchema (ARROW-17627)
- A memory leak when using statistics on ByteArray columns has been fixed (ARROW-17573)
Java notes
Many important features, enhancements, and bug fixes are included in this release, as are documentation enhancements, and a large number of improvements to build processes and project infrastructure. Selected highlights can be found below.
New Features and Enhancements
- JDBC Driver for Arrow Flight SQL (13800)
- Initial implementation of immutable Table API (14316)
- Substrait, transaction, cancellation for Flight SQL (13492)
- Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory (13811, 13973, 14182)
- Add utility to bind Arrow data to JDBC parameters (13589)
Build enhancements
- Add Windows build script that produces DLLs (14203)
- C Data Interface and Dataset libraries can now be built with mvn commands (13881, 13889)
Java notes:
- Java Plasma JNI bindings have been deprecated (14262)
JavaScript notes
- No major changes.
Python notes
Compatibility notes:
- Some deprecated APIs (deprecated at least since pyarrow 1.0.0) have been removed: the
RecordBatchReader.get_next_batch
method,DataType.num_children
attribute, etc (ARROW-17649). - When writing to Arrow IPC file format with
pyarrow.dataset.write_dataset
usingformat="ipc"
orformat="arrow"
, the default extension for the resulting files is changed to.arrow
instead of.feather
. You can still useformat="feather"
to write identical files but using the.feather
extension (ARROW-17089).
New features and improvements:
- Filters in
pq.read_table()
can be passed as an expression in addition to the legacy list of tuples. For example,filters=pc.field("col") < 4
is equivalent tofilters=[("col", "<", 4)]
(ARROW-17483). - The
batch_readahead
andfragment_readahead
arguments for scanning Datasets are exposed in Python (ARROW-17299). - ExtensionArrays can now be created from a storage array through the
pa.array(..)
constructor (ARROW-17834). - Converting ListArrays containing ExtensionArray values to numpy or pandas works by falling back to the storage array (ARROW-17813).
- The
pyarrow.substrait.run_query()
function gained atable_provider
keyword to run the query against in-memory tables (ARROW-17521). - The
StructType
class gained afield()
method to retrieve a child field (ARROW-17131). - Casting Tables to a new schema now honors the nullability flag in the target schema (ARROW-16651).
- Parquet files are now explicitly closed after reading (ARROW-13763).
Further, the Python bindings benefit from improvements in the C++ library (e.g. new compute functions); see the C++ notes above for additional details.
Build notes
The PyArrow-specific C++ code, previously part of Arrow C++ codebase, is now integrated into PyArrow. The tests are run automatically as part of the PyArrow test suite. See: ARROW-16340, ARROW-17122 and PyArrow C++ API notes).
The build process is generally not affected by the change, but the ARROW_PYTHON
CMake
variable has been deprecated and will be removed in a later release; you should instead
enable the necessary components explicitly
(ARROW-17868).
R notes
Many improvements to Arrow dplyr queries are added in this version, including:
dplyr::across()
can be used to apply the same computation across multiple columns;- long-running queries can now be cancelled;
- the data source file name can be added as a column when reading multi-file datasets with
add_filename()
; - joins now support extension arrays;
- and all supported Arrow dplyr functions are now documented on the R documentation site.
For more on what’s in the 10.0.0 R package, see the R changelog.
Ruby and C GLib notes
Ruby
- Plasma binding has been deprecated (ARROW-17864)
C GLib
- Plasma binding has been deprecated (ARROW-17862)
Rust notes
The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.