I try to create table from CSV file which is save into HDFS. The problem is that the csv consist line break inside of quote. Example of record in CSV:
ID,PR_ID,SUMMARY
2063,1184,"This is problem field because consists line break
This is not new record but it is part of text of third column
"
I created hive table:
CREATE TEMPORARY EXTERNAL TABLE hive_database.hive_table
(
ID STRING,
PR_ID STRING,
SUMMARY STRING
)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\""
)
stored as textfile
LOCATION '/path/to/hdfs/dir/csv'
tblproperties('skip.header.line.count'='1');
Then I try to count the rows (The correct result should by 1)
Select count(*) from hive_database.hive_table;
But the result is 4 what is incorrect. Do you have any idea how to solve it? Thanks all.
There is right now no way to handle multilines csv in hive directly. However, there is some workaround:
produce a csv with \n
or \r\n
replaced with your own newline marker such <\br>
. You will be able to load it in hive. Then transform the resulting text by replacing the latter by the former
use spark, it has a multiline csv reader. This works out the box, while the csv beeing not read in a distributed way.
val df = spark.read
.option("wholeFile", true)
.option("multiline",true)
.option("header", true)
.option("inferSchema", "true")
.option("dateFormat", "yyyy-MM-dd")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
.csv("test.csv")
.write.format("orc")
.saveAsTable("myschma.myTable")
use an other format such parquet, avro, orc, sequence file, instead of a csv. For example you could use sqoop to produce them from a jdbc database. Or you could write your own program in java or python.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With