Is there a size limitation/problem when writing the apache arrow format with the java API

Question

My arrow writer reading data from CSV files works fine for data less 1 GB, but stucks at about this limit (the writing code seems to block). I have enough memory given to the process (-Xmx12g) and the data size is about 1.2GB. A similar structured file with less rows and approx 0.4 GB works fine with the same code.

I am just interested to know, if apache arrow currently has some limitations for the vectors created or limitations in the number of rows.

Micah Kornfield · Accepted Answer

It would be good to clarify how exactly it is failing (I assume you are seeing an exception). But to address the question.

Currently, each Buffer that comprises a vector has a 2GB limit. In addition all Arrow Vectors currently use an int index, so the there is a row limit of 2^31-1. Due to how the default allocation process works (doubling of buffer sizes), the you might be getting close to the actual limit without preallocating.

The best practice for Arrow in general and Java in particular is to create small batches (e.g. read N rows convert them to a batch and write them out again, instead of trying to read the whole file). An example of this approach is viewable in the recently refactored JDBC adaptor.

There has been recent discussion on the developer mailing list for changing the API to support 64-bit indexing/sizes.

Is there a size limitation/problem when writing the apache arrow format with the java API

Tags:

java

limit

apache-arrow

ronstein2000

1 Answers

Micah Kornfield

Recent Activity

Donate For Us

Is there a size limitation/problem when writing the apache arrow format with the java API

Tags:

java

limit

apache-arrow

ronstein2000

1 Answers

Micah Kornfield

Related questions

Recent Activity

Donate For Us