My arrow writer reading data from CSV files works fine for data less 1 GB, but stucks at about this limit (the writing code seems to block). I have enough memory given to the process (-Xmx12g) and the data size is about 1.2GB. A similar structured file with less rows and approx 0.4 GB works fine with the same code.
I am just interested to know, if apache arrow currently has some limitations for the vectors created or limitations in the number of rows.
It would be good to clarify how exactly it is failing (I assume you are seeing an exception). But to address the question.
Currently, each Buffer that comprises a vector has a 2GB limit. In addition all Arrow Vectors currently use an int index, so the there is a row limit of 2^31-1. Due to how the default allocation process works (doubling of buffer sizes), the you might be getting close to the actual limit without preallocating.
The best practice for Arrow in general and Java in particular is to create small batches (e.g. read N rows convert them to a batch and write them out again, instead of trying to read the whole file). An example of this approach is viewable in the recently refactored JDBC adaptor.
There has been recent discussion on the developer mailing list for changing the API to support 64-bit indexing/sizes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With