Im using the Cloudera quickstart vm 5.1.0-1
Im trying to load my 3GB csv in Hadoop via Hue and what I tried so far is: - Load the csv into the HDFS and specifically into a folder called datasets positioned at /user/hive/datasets - Use the Metastore Manager to load it into the default db
Everything works fine meaning that I manage to load it with the right columns. The main problem is that when I query the table with Impala launching the following query:
show table stats new_table
I realize that the size is only 64 MB instead of the actual size of the csv which should be 3GB.
Also, if I do a count(*) via Impala the number of rows is only 70000 against the actual 7 million.
Any help would be deeply appreciated.
Thanks in advance.
so ... if your "download_row_limit" attribute = 100000 on hue. ini the result of your query will be truncated to 100000 and you can download this number of lines. You can change the attibute on hue.
Hadoop User Experience (HUE) is an open source interface which makes Apache Hadoop's use easier. It is a web-based application. It has a job designer for MapReduce, a file browser for HDFS, an Oozie application for making workflows and coordinators, an Impala, a shell, a Hive UI, and a group of Hadoop APIs.
I've had the exact same problem. This is an issue with how Hue imports the file via the web interface, which has a 64MB limit.
I've been importing large datasets by using the Hive CLI and the -f flag against a text file with the DDL code.
Example:
hive -f beer_data_loader.hql
beer_data_loader.hql:
CREATE DATABASE IF NOT EXISTS beer
COMMENT "Beer Advocate Database";
CREATE TABLE IF NOT EXISTS beer.beeradvocate_raw(
beer_name STRING,
beer_ID BIGINT,
beer_brewerID INT,
beer_ABV FLOAT,
beer_style STRING,
review_appearance FLOAT,
review_aroma FLOAT,
review_palate FLOAT,
review_taste FLOAT,
review_overall FLOAT,
review_time BIGINT,
review_profileName STRING,
review_text STRING
)
COMMENT "Beer Advocate Data Raw"
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS parquet;
CREATE EXTERNAL TABLE IF NOT EXISTS beer.beeradvocate_temp(
beer_name STRING,
beer_ID BIGINT,
beer_brewerID INT,
beer_ABV FLOAT,
beer_style STRING,
review_appearance FLOAT,
review_aroma FLOAT,
review_palate FLOAT,
review_taste FLOAT,
review_overall FLOAT,
review_time BIGINT,
review_profileName STRING,
review_text STRING
)
COMMENT "Beer Advocate External Loading Table"
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION '/user/name/beeradvocate.data';
INSERT OVERWRITE TABLE beer.beeradvocate_raw SELECT * FROM beer.beeradvocate_temp;
DROP TABLE beer.beeradvocate_temp;
Seems like a bug in Hue. Found a workaround. The file gets truncated if you select the "Import data from file" checkbox when you create the table. Leave that unchecked to create an empty table. Then select the newly created table in the Metastore Manager and use the "Import Data" option in the Actions menu to populate it. This should populate all the rows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With