Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Query Parquet data through Vertica (Vertica Hadoop Integration)

So I have a Hadoop cluster with three nodes. Vertica is co-located on cluster. There are Parquet files (partitioned by Hive) on HDFS. My goal is to query those files using Vertica.

Right now what I did is using HDFS Connector, basically create an external table in Vertica, then link it to HDFS:

CREATE EXTERNAL TABLE tableName (columns)
AS COPY FROM "hdfs://hostname/...../data" PARQUET;

Since the data size is big. This method will not achieve good performance.

I have done some research, Vertica Hadoop Integration

I have tried HCatalog but there's some configuration error on my Hadoop so that's not working.

My use case is to not change data format on HDFS(Parquet), while query it using Vertica. Any ideas on how to do that?

EDIT: The only reason Vertica got slow performance is because it cant use Parquet's partitions. With higher version Vertica(8+), it can utlize hive's metadata now. So no HCatalog needed.

like image 234
Jesse Avatar asked May 05 '17 15:05

Jesse


1 Answers

Terminology note: you're not using the HDFS Connector. Which is good, as it's deprecated as of 8.0.1. You're using the direct interface described in Reading Hadoop Native File Formats, with libhdfs++ (the hdfs scheme) rather than WebHDFS (the webhdfs scheme). That's all good so far. (You can also use the HCatalog Connector, but you need to do some additional configuration and it will not be faster than an external table.)

Your Hadoop cluster has only 3 nodes and Vertica is co-located on them, so you should be getting the benefits of node locality automatically -- Vertica will use the nodes that have the data locally when planning queries.

You can improve query performance by partitioning and sorting the data so Vertica can use predicate pushdown, and also by compressing the Parquet files. You said you don't want to change the data so maybe these suggestions don't work for you; they're not specific to Vertica so they might be worth considering anyway. (If you're using other tools to interact with your Parquet data, they'll benefit from these changes too.) The documentation of these techniques was improved in 8.0.x (link is to 8.1 but this was in 8.0.x too).

Additional partitioning support was added in 8.0.1. It looks like you're using at least 8.0; I can't tell if you're using 8.0.1. If you are, you can create the external table to only pay attention to the partitions you care about with something like:

CREATE EXTERNAL TABLE t (id int, name varchar(50), 
                        created date, region varchar(50))
AS COPY FROM 'hdfs:///path/*/*/*' 
PARQUET(hive_partition_cols='created,region');
like image 155
Monica Cellio Avatar answered Nov 15 '22 05:11

Monica Cellio