I'm trying to create a table in Impala from a CSV that I've uploaded into an HDFS directory. The CSV contains values with commas enclosed inside quotes.
Example:
1.66.96.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.66.128.0/17,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.0.0/17,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.128.0/18,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.192.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
The Impala documentation says that this can be solved with an ESCAPED BY
clause. Here's my current code:
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
I've also tried using the ESCAPED BY '"'
clause. In both cases, Impala takes the comma within the quotes and uses it as a delimiter, splitting the value into two columns.
Any ideas on how to fix the code so that this won't happen?
EDIT (6/9/2015)
So, I've gone through the following variations, based on suggestions from @K S Nidhin and @JTUP. However, each variation returned the same result as the queries written without the SERDEPROPERTIES
operator, with the commas still causing values to appear in the wrong columns:
Variation 1
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES ( "quoteChar" = "'", "escapeChar" = "\\" )
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
Variation 2
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
WITH SERDEPROPERTIES ( 'quoteChar' = '"', 'escapeChar' = '\\' )
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
Variation 3
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
WITH SERDEPROPERTIES (
"separatorChar" = "\,",
"quoteChar" = "\""
)
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
Any other ideas, or further variations of the SERDEPROPERTIES
operator to try?
EDIT (6/10/2016)
I was able to get a different variation of the query using the SERDE
and SERDEPROPERTIES
operators to work in Hive (based on code provided in the Hive Documentation), with the proper table being created:
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4(network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '"',
'escapeChar' = '\\'
)
STORED AS TEXTFILE;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
Since the SERDE
operator isn't available in Impala, this solution wouldn't work there. I'm fine creating the tables in Hive, but it's still irksome that I can't find a viable solution in Impala.
what I did was to firstly convert the delimiter from comma to other character, e.g pipe ('|'). You can use csvformat(part of csvkit) on linux.
csvformat -D \| input_filename.csv > input_filename-pipe.csv
afterwards, set the delimiter to '|' in the impala query
TERMINATED BY '|'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With