February 2022 Rust Apache Arrow and Parquet Highlights
Published
13 Feb 2022
By
The Apache Arrow PMC (pmc)
The Rust implementation of Apache Arrow has just released version 9.0.2
.
While a major version of this magnitude may shock some in the Rust community to whom it implies a slow moving 20 year old piece of software, nothing could be further from the truth!
With regular and predictable bi-weekly releases, the library continues
to evolve rapidly, and 9.0.2
is no exception. Some recent highlights:
parquet
: async, performance, safety and nested types
The parquet 9.0.2
release includes an async
reader, a long time requested feature. Using the async
reader it is now possible to read only the relevant parts of a parquet
file from a networked source such as object storage. Previously the
entire file had to be buffered locally. We are hoping to add an async
writer in a future release and would love some
help.
It is also significantly faster to read parquet data (up to
60x
in some cases) than with previous versions of the parquet
crate. Kudos to tustvold and
yordan-pavlov for their
contributions in these areas.
With 8.0.0
and later, the code that reads and writes RecordBatch
es
to and from Parquet now supports all types, including deeply nested
structs and lists. Thanks helgikrs for
cleaning up the last corner cases!
Other notable recent additions to parquet are UTF-8
validation on
string data for improved security against malicious inputs.
Planned upcoming work includes pushing more
filtering directly
into the parquet scan as well as an async
writer.
arrow
: performance, dyn kernels, and DecimalArray
The compute
kernels have been improved significantly in arrow 9.0.2
. Some filter
benchmarks
are twice as fast and the SIMD kernels are also significantly
faster. Many thanks to
tustvold and
jhorstmann.
Additional substantial
improvements are likely to land in arrow 10.0.0
.
We are working on new set of “dynamic” dyn_
kernels (for example,
eq_dyn
)
that make it easier to invoke the heavily optimized kernels provided
by the arrow
crate. Work is underway to expand the breadth of types
supported by these new kernels to make them even more useful. Thanks
to matthewmturner and
viirya for their help in this
effort.
While arrow
has had basic support for DecimalArray
since version
3.0.0
, support has been expanded for Decimal
type in calculation
kernels such as sort
, take
and filter
thanks to some great
contributions from liukun4515. There
is ongoing work to
improve the API ergonomics and performance of DecimalArray
as well.
Security
The 6.4.0
release resolved the last outstanding
RUSTSEC
advisory on the
arrow crate and the 8.0.0
release resolved the last outstanding
known security issues. While these security issues were mostly limited
misuse of the low level “power user” APIs which most users do not (and
should not) be using, it was good to tighten up that area.
Now that arrow-rs
is releasing major versions every other week, we
are also able to update dependencies at the same pace, helping to
ensure that security fixes upstream can flow more quickly to
downstream projects.
Final shoutout
It takes a community to build great software, and we would like to
thank everyone who has contributed to the arrow-rs repository since
the 7.0.0
release:
git shortlog -sn 7.0.0..9.0.0
22 Raphael Taylor-Davies
18 Andrew Lamb
6 Helgi Kristvin Sigurbjarnarson
6 Remzi Yang
5 Jörn Horstmann
4 Liang-Chi Hsieh
3 Jiayu Liu
2 dependabot[bot]
2 Yijie Shen
1 Matthew Turner
1 Kun Liu
1 Yang
1 Edd Robinson
1 Patrick More
How to Get Involved
If you are interested in contributing to the Rust subproject in Apache Arrow, you can find a list of open issues suitable for beginners here and the full list here.
Other ways to get involved include trying out Arrow on some of your data and filing bug reports, and helping to improve the documentation.