Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read a bytes column in spark

I have a data set which contains an ID field that is in an unknown (and not friendly) encoding. I can read the single column using plain python and verify that the values are distinct and consistent across multiple data sets (i.e. it can be used as a primary key for joining).

When loading the file using spark.read.csv, it seems that spark is converting the column to utf-8. However, some of the multibyte sequences are converted to the Unicode character U+FFFD REPLACEMENT CHARACTER. (EF BF BD in hex).

Is there a way to force Spark to read the column as bytes and not as a string?

Here is some code that can be used to recreate my issue (let column a be the ID field):

Create a File with Sample Data

data = [
    (bytes(b'\xba\xed\x85\x8e\x91\xd4\xc7\xb0'), '1', 'a'),
    (bytes(b'\xba\xed\x85\x8e\x91\xd4\xc7\xb1'), '2', 'b'),
    (bytes(b'\xba\xed\x85\x8e\x91\xd4\xc7\xb2'), '3', 'c')
]

with open('sample.csv', 'wb') as f:
    header = ["a", "b", "c"]
    f.write(",".join(header)+"\n")
    for d in data:
        f.write(",".join(d) + "\n")

Read using Pandas

import pandas as pd
df = pd.read_csv("sample.csv", converters={"a": lambda x: x.encode('hex')})
print(df)
#                  a  b  c
#0  baed858e91d4c7b0  1  a
#1  baed858e91d4c7b1  2  b
#2  baed858e91d4c7b2  3  c

Try reading the same file using Spark

spark_df = spark.read.csv("sample.csv", header=True)
spark_df.show()
#+-----+---+---+
#|a    |b  |c  |
#+-----+---+---+
#|�텎��ǰ|1  |a  |
#|�텎��DZ|2  |b  |
#|�텎��Dz|3  |c  |
#+-----+---+---+

Yikes! OK, so how about converting to hex?

import pyspark.sql.functions as f
spark_df.withColumn("a", f.hex("a")).show(truncate=False)
#+----------------------------+---+---+
#|a                           |b  |c  |
#+----------------------------+---+---+
#|EFBFBDED858EEFBFBDEFBFBDC7B0|1  |a  |
#|EFBFBDED858EEFBFBDEFBFBDC7B1|2  |b  |
#|EFBFBDED858EEFBFBDEFBFBDC7B2|3  |c  |
#+----------------------------+---+---+

(In this example the values are distinct, but that's not true in my larger file)

As you can see, the values are close, but some of the bytes have been replaced by EFBFBD

Is there any way to read the file in Spark (maybe using rdd?) so that my output looks like the pandas version:

#+----------------+---+---+
#|a               |b  |c  |
#+----------------+---+---+
#|baed858e91d4c7b0|1  |a  |
#|baed858e91d4c7b1|2  |b  |
#|baed858e91d4c7b2|3  |c  |
#+----------------+---+---+

I've tried casting to byte and specifying the schema so that this column is ByteType() but that didn't work.

Edit

I am using Spark v 2.1.

like image 307
pault Avatar asked Aug 21 '18 22:08

pault


People also ask

How do I iterate through columns in spark DataFrame?

select() This method is used to iterate the columns in the given PySpark DataFrame. It can be used with collect() method and takes column. Finally, it will display the rows according to the specified columns.

What is LongType in Spark?

LongType : Represents 8-byte signed integer numbers. The range of numbers is from -9223372036854775808 to 9223372036854775807 .

How do you iterate through a row in spark?

For looping through each row using map() first we have to convert the PySpark dataframe into RDD because map() is performed on RDD's only, so first convert into RDD it then use map() in which, lambda function for iterating through each row and stores the new RDD in some variable then convert back that new RDD into ...


1 Answers

The problem is rooted in the fact that delimited files are poorly suited to binary data.

If there is a known, consistent encoding for the text, use the charset option. See https://github.com/databricks/spark-csv#features (I don't know of a good place in the 2.x docs where delimited reading options are described so I still go back to the 1.x docs). I would recommend experimenting with 8-bit ASCII, e.g., ISO-8859-1 or US-ASCII.

If there is no such encoding, you would need to either transform the input to a different format, e.g., base64 encoding the first column, or manipulate the read data to get it back to what you need.

like image 135
Sim Avatar answered Sep 18 '22 20:09

Sim