Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load large csv in hadoop via Hue would only store a 64MB block

Im using the Cloudera quickstart vm 5.1.0-1

Im trying to load my 3GB csv in Hadoop via Hue and what I tried so far is: - Load the csv into the HDFS and specifically into a folder called datasets positioned at /user/hive/datasets - Use the Metastore Manager to load it into the default db

Everything works fine meaning that I manage to load it with the right columns. The main problem is that when I query the table with Impala launching the following query:

show table stats new_table

I realize that the size is only 64 MB instead of the actual size of the csv which should be 3GB.

Also, if I do a count(*) via Impala the number of rows is only 70000 against the actual 7 million.

Any help would be deeply appreciated.

Thanks in advance.

like image 745
bobo32 Avatar asked Oct 16 '14 21:10

bobo32


People also ask

How do I download more than 100k from Hue?

so ... if your "download_row_limit" attribute = 100000 on hue. ini the result of your query will be truncated to 100000 and you can download this number of lines. You can change the attibute on hue.

What is the use of Hue in Hadoop?

Hadoop User Experience (HUE) is an open source interface which makes Apache Hadoop's use easier. It is a web-based application. It has a job designer for MapReduce, a file browser for HDFS, an Oozie application for making workflows and coordinators, an Impala, a shell, a Hive UI, and a group of Hadoop APIs.


2 Answers

I've had the exact same problem. This is an issue with how Hue imports the file via the web interface, which has a 64MB limit.

I've been importing large datasets by using the Hive CLI and the -f flag against a text file with the DDL code.

Example:

hive -f beer_data_loader.hql



beer_data_loader.hql:

  CREATE DATABASE IF NOT EXISTS beer  
  COMMENT "Beer Advocate Database";


CREATE TABLE IF NOT EXISTS beer.beeradvocate_raw(  
    beer_name           STRING,
    beer_ID             BIGINT,
    beer_brewerID       INT,
    beer_ABV            FLOAT,
    beer_style          STRING,
    review_appearance   FLOAT,
    review_aroma        FLOAT,
    review_palate       FLOAT,
    review_taste        FLOAT,
    review_overall      FLOAT,
    review_time         BIGINT,
    review_profileName  STRING,
    review_text         STRING
    )
 COMMENT "Beer Advocate Data Raw"
 ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
 STORED AS parquet;


CREATE EXTERNAL TABLE IF NOT EXISTS beer.beeradvocate_temp(  
    beer_name           STRING,
    beer_ID             BIGINT,
    beer_brewerID       INT,
    beer_ABV            FLOAT,
    beer_style          STRING,
    review_appearance   FLOAT,
    review_aroma        FLOAT,
    review_palate       FLOAT,
    review_taste        FLOAT,
    review_overall      FLOAT,
    review_time         BIGINT,
    review_profileName  STRING,
    review_text         STRING
    )
 COMMENT "Beer Advocate External Loading Table"
 ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
 LOCATION '/user/name/beeradvocate.data';


INSERT OVERWRITE TABLE beer.beeradvocate_raw SELECT * FROM beer.beeradvocate_temp;  
DROP TABLE beer.beeradvocate_temp; 
like image 140
JamCon Avatar answered Oct 01 '22 18:10

JamCon


Seems like a bug in Hue. Found a workaround. The file gets truncated if you select the "Import data from file" checkbox when you create the table. Leave that unchecked to create an empty table. Then select the newly created table in the Metastore Manager and use the "Import Data" option in the Actions menu to populate it. This should populate all the rows.

like image 35
Peter G Avatar answered Oct 01 '22 19:10

Peter G