I am using Parquet.Net to read parquet files, but the only option to read from the parquet file is. <pre class="prettyprint"><code>//get the first group Parquet.ParquetRowGroupReader rowGroup = myParquet.OpenRowGroupReader(0); //gets the first column Parquet.Data.DataColumn col1 = rowGroup.ReadColumn(myParquet.Schema.GetDataFields()[0]); </code></pre> This allows me to get the the first column from the first rowGroup, but the problem is, the first rowGroup can be something like 4million rows and readColumn will read all 4million values. How do I tell readColumn that I only want it to read, say the first 100 rows. Reading all 4million rows wastes memory and file read time. I actually got a memory error, until I changed my code to resize that 4million value array down to my 100. After calling each column. I don't necessarily need row based access, I can work with columns, I just don't need a whole rowGroup worth of values in each column. Is this possible? If row based access is better, how does one use it? The Parquet.Net project site doesn't give any examples, and just talks about tables.

According to the source code this capability exists in <code>DataColumnReader</code> but this is an <code>internal</code> class and thus not directly usable. <code>ParquetRowGroupReader</code> uses it inside its <code>ReadColumn</code> method, but exposes no such options. What can be done in practice is copying the whole <code>DataColumnReader</code> class and using it directly, but this could breed future compatibility issues. If the problem can wait for some time, I'd recommend copying the class and then opening an issue + pull request to the library with the enhanced class, so the copied class can eventually be removed.

How do I read only part of a column from a Parquet file using Parquet.net?

Tags:

c#

dataframe

datatables

parquet

bigdata

I am using Parquet.Net to read parquet files, but the only option to read from the parquet file is.

//get the first group
Parquet.ParquetRowGroupReader rowGroup = myParquet.OpenRowGroupReader(0);

//gets the first column
Parquet.Data.DataColumn col1 = rowGroup.ReadColumn(myParquet.Schema.GetDataFields()[0]);

This allows me to get the the first column from the first rowGroup, but the problem is, the first rowGroup can be something like 4million rows and readColumn will read all 4million values.

How do I tell readColumn that I only want it to read, say the first 100 rows. Reading all 4million rows wastes memory and file read time.

I actually got a memory error, until I changed my code to resize that 4million value array down to my 100. After calling each column.

I don't necessarily need row based access, I can work with columns, I just don't need a whole rowGroup worth of values in each column. Is this possible? If row based access is better, how does one use it? The Parquet.Net project site doesn't give any examples, and just talks about tables.

670

asked Jul 21 '20 01:07

Ranald Fong

2 Answers

According to the source code this capability exists in DataColumnReader but this is an internal class and thus not directly usable.

ParquetRowGroupReader uses it inside its ReadColumn method, but exposes no such options.

What can be done in practice is copying the whole DataColumnReader class and using it directly, but this could breed future compatibility issues.

If the problem can wait for some time, I'd recommend copying the class and then opening an issue + pull request to the library with the enhanced class, so the copied class can eventually be removed.

122

answered Sep 24 '22 02:09

henry700

If you look at the parquet-dotnet documentation they do not recommend writing more than 5000 records into one row group for performance reasons, though at the bottom of the page they say they are designed to hold 50000 rows on average:

It's not recommended to have more than 5'000 rows in a single row group for performance reasons

We are working with 100000 in a row group with my team, overall it may depend on what you are storing but 4000000 records in one row group inside a column does sounds like too much.

So to answer your question, to read only part of the column make your row groups inside the column smaller and then read only as many row groups as you wish. If you want to only read 100 records, read in the first row group and take first 100 from it, reasonably sized row groups are very fast to read.

answered Sep 23 '22 02:09

Martina

Related questions
                            
                                Instantiated prefab scale is not displayed correctly on client
                            
                                MSDN SafeHandle example
                            
                                How to record audio with WasapiLoopbackCapture when no voice is coming out from speaker in c#?
                            
                                How can I add a type invariant setter to a covariant interface?
                            
                                Unable to obtain configuration from IdentityServer4
                            
                                Order of implicit conversions in c#
                            
                                ASP.NET Core 2.1 - Error Implementing MemoryCache
                            
                                Does .Net Core 2.1 support HTTP/2 requests?
                            
                                Connecting to DocumentDB using SSL with the MongoDB C# driver
                            
                                EF Core Updates Seeded Data on every Migration without being changed
                            
                                How to read a QR code directly from a mobile camera using ZXing [ASP.Net WebForm]
                            
                                Catch exception not of a type
                            
                                SSL handshake fails in Xamarin
                            
                                what is the rationale about not using dotnet watch in docker anymore
                            
                                What is the allocation being saved here?
                            
                                Is iterating over an array with a for loop a thread safe operation in C# ? What about iterating an IEnumerable<T> with a foreach loop?
                            
                                Why is my .NET framework app looking for the wrong version of the .NET core/standard platform extension assembly, and how do I fix it?
                            
                                How to validate multi part compressed (i.e zip) files have all parts or not in C#?
                            
                                How to handle dynamic error pages in .net MVC Core?
                            
                                Convert IntPtr to Int64: conv.u8 or conv.i8?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With