Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a size limitation/problem when writing the apache arrow format with the java API

My arrow writer reading data from CSV files works fine for data less 1 GB, but stucks at about this limit (the writing code seems to block). I have enough memory given to the process (-Xmx12g) and the data size is about 1.2GB. A similar structured file with less rows and approx 0.4 GB works fine with the same code.

I am just interested to know, if apache arrow currently has some limitations for the vectors created or limitations in the number of rows.

like image 461
ronstein2000 Avatar asked Oct 28 '25 15:10

ronstein2000


1 Answers

It would be good to clarify how exactly it is failing (I assume you are seeing an exception). But to address the question.

Currently, each Buffer that comprises a vector has a 2GB limit. In addition all Arrow Vectors currently use an int index, so the there is a row limit of 2^31-1. Due to how the default allocation process works (doubling of buffer sizes), the you might be getting close to the actual limit without preallocating.

The best practice for Arrow in general and Java in particular is to create small batches (e.g. read N rows convert them to a batch and write them out again, instead of trying to read the whole file). An example of this approach is viewable in the recently refactored JDBC adaptor.

There has been recent discussion on the developer mailing list for changing the API to support 64-bit indexing/sizes.

like image 88
Micah Kornfield Avatar answered Oct 31 '25 04:10

Micah Kornfield



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!