Load large csv in hadoop via Hue would only store a 64MB block

Tags:

Im using the Cloudera quickstart vm 5.1.0-1

Im trying to load my 3GB csv in Hadoop via Hue and what I tried so far is: - Load the csv into the HDFS and specifically into a folder called datasets positioned at /user/hive/datasets - Use the Metastore Manager to load it into the default db

Everything works fine meaning that I manage to load it with the right columns. The main problem is that when I query the table with Impala launching the following query:

show table stats new_table

I realize that the size is only 64 MB instead of the actual size of the csv which should be 3GB.

Also, if I do a count(*) via Impala the number of rows is only 70000 against the actual 7 million.

Any help would be deeply appreciated.

Thanks in advance.

745

asked Oct 16 '14 21:10

bobo32

2 Answers

I've had the exact same problem. This is an issue with how Hue imports the file via the web interface, which has a 64MB limit.

I've been importing large datasets by using the Hive CLI and the -f flag against a text file with the DDL code.

Example:

hive -f beer_data_loader.hql

beer_data_loader.hql:

  CREATE DATABASE IF NOT EXISTS beer  
  COMMENT "Beer Advocate Database";


CREATE TABLE IF NOT EXISTS beer.beeradvocate_raw(  
    beer_name           STRING,
    beer_ID             BIGINT,
    beer_brewerID       INT,
    beer_ABV            FLOAT,
    beer_style          STRING,
    review_appearance   FLOAT,
    review_aroma        FLOAT,
    review_palate       FLOAT,
    review_taste        FLOAT,
    review_overall      FLOAT,
    review_time         BIGINT,
    review_profileName  STRING,
    review_text         STRING
    )
 COMMENT "Beer Advocate Data Raw"
 ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
 STORED AS parquet;


CREATE EXTERNAL TABLE IF NOT EXISTS beer.beeradvocate_temp(  
    beer_name           STRING,
    beer_ID             BIGINT,
    beer_brewerID       INT,
    beer_ABV            FLOAT,
    beer_style          STRING,
    review_appearance   FLOAT,
    review_aroma        FLOAT,
    review_palate       FLOAT,
    review_taste        FLOAT,
    review_overall      FLOAT,
    review_time         BIGINT,
    review_profileName  STRING,
    review_text         STRING
    )
 COMMENT "Beer Advocate External Loading Table"
 ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
 LOCATION '/user/name/beeradvocate.data';


INSERT OVERWRITE TABLE beer.beeradvocate_raw SELECT * FROM beer.beeradvocate_temp;  
DROP TABLE beer.beeradvocate_temp;

140

answered Oct 01 '22 18:10

JamCon

Seems like a bug in Hue. Found a workaround. The file gets truncated if you select the "Import data from file" checkbox when you create the table. Leave that unchecked to create an empty table. Then select the newly created table in the Metastore Manager and use the "Import Data" option in the Actions menu to populate it. This should populate all the rows.

answered Oct 01 '22 19:10

Peter G

Related questions
                            
                                PySpark: Handing NULL in Joins
                            
                                Streaming data store in hive using spark
                            
                                Python Hadoop streaming on windows, Script not a valid Win32 application
                            
                                Spark & Scala: saveAsTextFile() exception
                            
                                Starting HBASE, java.lang.ClassNotFoundException: org.apache.htrace.SamplerBuilder
                            
                                How to fix "Error: Could not find or load main class ”-Djava.library.path=.usr.local.hadoop.lib” while installing hadoop
                            
                                Is the input format responsible for implementing data locality in Hadoop's MapReduce?
                            
                                Hadoop for JSON files
                            
                                HBase schema/key for real-time analytics solution
                            
                                HBase setting timestamp
                            
                                Pig approach to pairing data fields in a data set
                            
                                Can apache flume hdfs sink accept dynamic path to write?
                            
                                Load snappy-compressed files into Elastic MapReduce
                            
                                Building Hadoop with Maven - "Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.6:run (create-testdirs)"
                            
                                How to get the SerDe Properties of an existing Hive Table
                            
                                Impala on Hadoop 2.2.0 without CDH?
                            
                                Hadoop maps are failing due to ConnectException
                            
                                Flume: Directory to Avro -> Avro to HDFS - Not valid avro after transfer
                            
                                org.apache.hadoop.mapred.LocalClientProtocolProvider not found
                            
                                Hbase master keeps dying, claims a hbase:namespace already exists

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Load large csv in hadoop via Hue would only store a 64MB block

Tags:

hadoop

hive

cloudera

impala

hue

bobo32

People also ask

2 Answers

JamCon

Peter G

Recent Activity

Donate For Us