Apache Arrow 0.4.0 Release
Published
23 May 2017
By
Wes McKinney (wesm)
The Apache Arrow team is pleased to announce the 0.4.0 release of the project. While only 17 days since the release, it includes 77 resolved JIRAs with some important new features and bug fixes.
See the Install Page to learn how to get the libraries for your platform.
Expanded JavaScript Implementation
The TypeScript Arrow implementation has undergone some work since 0.3.0 and can now read a substantial portion of the Arrow streaming binary format. As this implementation develops, we will eventually want to include JS in the integration test suite along with Java and C++ to ensure wire cross-compatibility.
Python Support for Apache Parquet on Windows
With the 1.1.0 C++ release of Apache Parquet, we have enabled the
pyarrow.parquet
extension on Windows for Python 3.5 and 3.6. This should
appear in conda-forge packages and PyPI in the near future. Developers can
follow the source build instructions.
Generalizing Arrow Streams
In the 0.2.0 release, we defined the first version of the Arrow streaming binary format for low-cost messaging with columnar data. These streams presume that the message components are written as a continuous byte stream over a socket or file.
We would like to be able to support other other transport protocols, like gRPC, for the message components of Arrow streams. To that end, in C++ we defined an abstract stream reader interface, for which the current contiguous streaming format is one implementation:
class RecordBatchReader {
public:
virtual std::shared_ptr<Schema> schema() const = 0;
virtual Status GetNextRecordBatch(std::shared_ptr<RecordBatch>* batch) = 0;
};
It would also be good to define abstract stream reader and writer interfaces in the Java implementation.
In an upcoming blog post, we will explain in more depth how Arrow streams work, but you can learn more about them by reading the IPC specification.
C++ and Cython API for Python Extensions
As other Python libraries with C or C++ extensions use Apache Arrow, they will need to be able to return Python objects wrapping the underlying C++ objects. In this release, we have implemented a prototype C++ API which enables Python wrapper objects to be constructed from C++ extension code:
#include "arrow/python/pyarrow.h"
if (!arrow::py::import_pyarrow()) {
// Error
}
std::shared_ptr<arrow::RecordBatch> cpp_batch = GetData(...);
PyObject* py_batch = arrow::py::wrap_batch(cpp_batch);
This API is intended to be usable from Cython code as well:
cimport pyarrow
pyarrow.import_pyarrow()
Python Wheel Installers on macOS
With this release, pip install pyarrow
works on macOS (OS X) as well as
Linux. We are working on providing binary wheel installers for Windows as well.