Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive Utf-8 Encoding number of characters supported?

Tags:

utf-8

hadoop

hive

Hi actually the problem is as follows the data i want to insert in hive table has latin words and its in utf-8 encoded format. But still hive does not display it properly.

Actual Data:- Actual Data

Data Inserted in hive

Hive Data

I changed the encoding of the table to utf-8 as well still same issue below are the hive DDL and commands

CREATE TABLE IF NOT EXISTS test6
(
CONTACT_RECORD_ID    string,
ACCOUNT    string,
CUST    string,
NUMBER    string,
NUMBER1    string,
NUMBER2    string,
NUMBER3    string,
NUMBER4    string,
NUMBER5    string,
NUMBER6    string,
NUMBER7    string,
LIST    string
)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '|';
ALTER TABLE test6 SET serdeproperties ('serialization.encoding'='UTF-8');

Does hive support only the first 128 characters of UTF-8? Please do suggest.

like image 591
Chetan Pulate Avatar asked Mar 29 '16 11:03

Chetan Pulate


People also ask

Can UTF 8 store a character in more than one byte?

Each UTF uses a different code unit size. For example, UTF-8 is based on 8-bit code units. Therefore, each character can be 8 bits (1 byte), 16 bits (2 bytes), 24 bits (3 bytes), or 32 bits (4 bytes).

Which is data size of hive?

The data loaded in the hive database is stored at the HDFS path – /user/hive/warehouse. If the location is not specified, by default all metadata gets stored in this path. In the HDFS path, the data is stored in blocks of size either 64 or 128 MB.


2 Answers

this may not be ideal solution , but this works. Hive somehow doesn't seem to treat them as UTF8. Please try to create the table with following parameters:

CREATE TABLE testjoins.yt_sample_mapping_1(
   `col1` string,
   `col2` string,
   `col3` string)
   ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
   WITH SERDEPROPERTIES ( "separatorChar" = ",", 
    "quoteChar" = "\"", 
    "escapeChar" = "\\", 
    "serialization.encoding"='ISO-8859-1') 
    TBLPROPERTIES ( 'store.charset'='ISO-8859-1', 
    'retrieve.charset'='ISO-8859-1');
like image 127
BalaramRaju Avatar answered Sep 23 '22 16:09

BalaramRaju


For me adding following line worked.

TBLPROPERTIES('serialization.encoding'='windows-1252')

Example code:

CREATE EXTERNAL TABLE IF NOT EXISTS test.tbl
(
    name string,
    gender string,
    age string,
    address string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n' STORED AS TEXTFILE
LOCATION 'adl://<Data-Lake-Store>.azuredatalakestore.net/<Folder-Name>/'
TBLPROPERTIES('serialization.encoding'='windows-1252');
like image 29
Tokci Avatar answered Sep 21 '22 16:09

Tokci