How to read sequence files exported from HBase

Question

I used the following code to export an HBase table and save the output to HDFS:

hbase org.apache.hadoop.hbase.mapreduce.Export \
MyHbaseTable1 hdfs://nameservice1/user/ken/data/exportTable1

Output files are binary files. If I use pyspark to read the file folder:

test1 = sc.textFile('hdfs://nameservice1/user/ken/data/exportTable1')
test1.show(5)

It shows:

u'SEQ\x061org.apache.hadoop.hbase.io.ImmutableBytesWritable%org.apache.hadoop.hbase.client.Result\x00\x00\x00\x00\x00\x00\ufffd-\x10A\ufffd~lUE\u025bt\ufffd\ufffd\ufffd&\x00\x00\x04\ufffd\x00\x00\x00'
u'\x00\x00\x00\x067-2010\ufffd	'
u'|'
u'\x067-2010\x12\x01r\x1a\x08clo-0101 \ufffd\ufffd\ufffd*(\x042\6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0'
u'u'

I can tell that

'7-2010' in the 2nd line is the Rowkey,
'r' in the 4th line is the column family,
'clo-0101' in the 4th line is the column name,
'6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0' is the value.

I don't know where 3rd and 5th line came from. It seems like Hbase-export followed its own rule to generate the file, if I use my own way to decode it, data might got corrupted.

Question:

How can I convert this file back to a readable format? For example:

7-2010, r, clo-0101, 6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0

I have tried:

test1 = sc.sequenceFile('/user/youyang/data/hbaseSnapshot1/', keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0)
test1.show(5)

and

test1 = sc.sequenceFile('hdfs://nameservice1/user/ken/data/exportTable1'
          , keyClass='org.apache.hadoop.hbase.mapreduce.TableInputFormat'
          , valueClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable'
          , keyConverter='org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter'
          , valueConverter='org.apache.spark.examples.pythonconverters.HBaseResultToStringCon verter'
          , minSplits=None
          , batchSize=100)

No luck, the code did not work, ERROR:

Caused by: java.io.IOException: Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result'. Please ensure that the configuration 'io.serializations' is properly configured, if you're using custom serialization.

Any suggestions? Thank you!

Alex A. · Accepted Answer

I had this problem recently myself. I solved it by going away from sc.sequenceFile, and instead using sc.newAPIHadoopFile (or just hadoopFile if you're on the old API). The Spark SequenceFile-reader appears to only handle keys/values that are Writable types (it's stated in the docs).

If you use newAPIHadoopFile it uses the Hadoop deserialization logic, and you can specify which Serialization types you need in the config-dictionary you give it:

hadoop_conf = {"io.serializations": "org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.hbase.mapreduce.ResultSerialization"}

sc.newAPIHadoopFile(
<input_path>,
'org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat',
 keyClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable',
 valueClass='org.apache.hadoop.hbase.client.Result',
 keyConverter='org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter',
 valueConverter='org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter',
 conf=hadoop_conf)

Note that the value in hadoop_conf for "io.serializations" is a comma separated list which includes "org.apache.hadoop.hbase.mapreduce.ResultSerialization". That is the key configuration you need to be able to deserialize the Result. The WritableSerialization is also needed in order to be able to deserialize ImmutableBytesWritable.

You can also use sc.newAPIHadoopRDD, but then you also need to set a value for "mapreduce.input.fileinputformat.inputdir" in the config dictionary.

How to read sequence files exported from HBase

Tags:

export

sequence

apache-spark

pyspark

hbase

kennyut

1 Answers

Alex A.

Recent Activity

Donate For Us

How to read sequence files exported from HBase

Tags:

export

sequence

apache-spark

pyspark

hbase

kennyut

1 Answers

Alex A.

Related questions

Recent Activity

Donate For Us