Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Index in Parquet

I would like to be able to do a fast range query on a Parquet table. The amount of data to be returned is very small compared to the total size but because a full column scan has to be performed it is too slow for my use case.

Using an index would solve this problem and I read that this was to be added in Parquet 2.0. However, I cannot find any other information on this so I am guessing that it was not. I do not think that there would be any fundamental obstacles preventing the addition of (multi-column) indexes, if the data were sorted, which in my case it is.

My question is: when will indexes be added to Parquet, and what would be the high level design for doing so? I think I would already be happy with an index that points out the correct partition.

Kind regards,

Sjoerd.

like image 890
Sjoerd van Hagen Avatar asked Nov 13 '14 13:11

Sjoerd van Hagen


People also ask

Does Parquet have index?

Apache Parquet is the most commonly used open source format for analytical data. The Parquet project recently added column indexes to the format, which enable query engines like Apache Impala, Apache Hive, and Apache Spark to achieve better performance on selective queries.

What metadata is in a Parquet file?

There are three types of metadata: file metadata, column (chunk) metadata and page header metadata. All thrift structures are serialized using the TCompactProtocol.

What is schema in Parquet?

Apache Parquet is a binary file format that stores data in a columnar fashion for compressed, efficient columnar data representation in the Hadoop ecosystem. Parquet files can be stored in any file system, not just HDFS.


2 Answers

Update Dec/2018:

Parquet Format version 2.5 added column indexes.

https://github.com/apache/parquet-format/blob/master/CHANGES.md#version-250

See https://issues.apache.org/jira/browse/PARQUET-1201 for list of sub-tasks for that new feature.

Notice that this feature just got merged into Parquet format itself, it will take some time for different backends (Spark, Hive, Impala etc) to start supporting it.

This new feature is called Column Indexes. Basically Parquet has added two new structures in parquet layout - Column Index and Offset Index.

Below is a more detailed technical explanation what it solves and how.

Problem Statement

In the current format, Statistics are stored for ColumnChunks in ColumnMetaData and for individual pages inside DataPageHeader structs. When reading pages, a reader has to process the page header in order to determine whether the page can be skipped based on the statistics. This means the reader has to access all pages in a column, thus likely reading most of the column data from disk.

Goals

Make both range scans and point lookups I/O efficient by allowing direct access to pages based on their min and max values. In particular:

  1. A single-row lookup in a rowgroup based on the sort column of that rowgroup will only read one data page per retrieved column. Range scans on the sort column will only need to read the exact data pages that contain relevant data.
  2. Make other selective scans I/O efficient: if we have a very selective predicate on a non-sorting column, for the other retrieved columns we should only need to access data pages that contain matching rows.
  3. No additional decoding effort for scans without selective predicates, e.g., full-row group scans. If a reader determines that it does not need to read the index data, it does not incur any overhead.
  4. Index pages for sorted columns use minimal storage by storing only the boundary elements between pages.

Non-Goals

Support for the equivalent of secondary indices, ie, an index structure sorted on the key values over non-sorted data.

Technical Approach

We add two new per-column structures to the row group metadata: ColumnIndex: this allows navigation to the pages of a column based on column values and is used to locate data pages that contain matching values for a scan predicate OffsetIndex: this allows navigation by row index and is used to retrieve values for rows identified as matches via the ColumnIndex. Once rows of a column are skipped, the corresponding rows in the other columns have to be skipped. Hence the OffsetIndexes for each column in a RowGroup are stored together.

The new index structures are stored separately from RowGroup, near the footer, so that a reader does not have to pay the I/O and deserialization cost for reading the them if it is not doing selective scans. The index structures’ location and length are stored in ColumnChunk and RowGroup.

Cloudera's Impala team has made some tests on this new feature (not yet available as part of Apache Impala core product). Here's their performance improvements:

HDFS I/O in Bytes

and

Scanner CPU time in ms

As you can see some of the queries had a huge improvement in both both cpu time and amount of data it had to read from disks.

Original answer back from 2016:

struct IndexPageHeader {   /** TODO: **/ } 

https://github.com/apache/parquet-format/blob/6e5b78d6d23b9730e19b78dceb9aac6166d528b8/src/main/thrift/parquet.thrift#L505

Index Page Header is not implemented, as of yet.

See source code of Parquet format above. I don't see it even in Parquet 2.0 currently.

But yes - excellent answer from Ryan Blue above on Parquet that it has pseudo-indexing capabilities (bloom filters).

If your're interested in more details, I recommend great document on how Parquet bloom filters and predicate push-down work https://www.slideshare.net/RyanBlue3/parquet-performance-tuning-the-missing-guide a more technical implementation-specific document - https://homepages.cwi.nl/~boncz/msc/2018-BoudewijnBraams.pdf

like image 144
Tagar Avatar answered Oct 06 '22 12:10

Tagar


Parquet currently keeps min/max statistics for each data page. A data page is a group of ~1MB of values (after encoding) for a single column; multiple pages are what make up Parquet's column chunks.

Those min/max values are used to filter both column chunks and the pages that make up a chunk. So you should be able to improve your query time by sorting records by the columns you want to filter on, then writing the data into Parquet. That way, you get the most out of the stats filtering.

You can also get more granular filtering with this technique by decreasing the page and row group sizes, though you're then trading encoding efficiency and I/O efficiency.

like image 31
blue Avatar answered Oct 06 '22 13:10

blue