I have a data set which contains an ID field that is in an unknown (and not friendly) encoding. I can read the single column using plain python and verify that the values are distinct and consistent across multiple data sets (i.e. it can be used as a primary key for joining).
When loading the file using spark.read.csv
, it seems that spark is converting the column to utf-8
. However, some of the multibyte sequences are converted to the Unicode character U+FFFD REPLACEMENT CHARACTER.
(EF BF BD
in hex).
Is there a way to force Spark to read the column as bytes and not as a string?
Here is some code that can be used to recreate my issue (let column a
be the ID field):
Create a File with Sample Data
data = [
(bytes(b'\xba\xed\x85\x8e\x91\xd4\xc7\xb0'), '1', 'a'),
(bytes(b'\xba\xed\x85\x8e\x91\xd4\xc7\xb1'), '2', 'b'),
(bytes(b'\xba\xed\x85\x8e\x91\xd4\xc7\xb2'), '3', 'c')
]
with open('sample.csv', 'wb') as f:
header = ["a", "b", "c"]
f.write(",".join(header)+"\n")
for d in data:
f.write(",".join(d) + "\n")
Read using Pandas
import pandas as pd
df = pd.read_csv("sample.csv", converters={"a": lambda x: x.encode('hex')})
print(df)
# a b c
#0 baed858e91d4c7b0 1 a
#1 baed858e91d4c7b1 2 b
#2 baed858e91d4c7b2 3 c
Try reading the same file using Spark
spark_df = spark.read.csv("sample.csv", header=True)
spark_df.show()
#+-----+---+---+
#|a |b |c |
#+-----+---+---+
#|�텎��ǰ|1 |a |
#|�텎��DZ|2 |b |
#|�텎��Dz|3 |c |
#+-----+---+---+
Yikes! OK, so how about converting to hex
?
import pyspark.sql.functions as f
spark_df.withColumn("a", f.hex("a")).show(truncate=False)
#+----------------------------+---+---+
#|a |b |c |
#+----------------------------+---+---+
#|EFBFBDED858EEFBFBDEFBFBDC7B0|1 |a |
#|EFBFBDED858EEFBFBDEFBFBDC7B1|2 |b |
#|EFBFBDED858EEFBFBDEFBFBDC7B2|3 |c |
#+----------------------------+---+---+
(In this example the values are distinct, but that's not true in my larger file)
As you can see, the values are close, but some of the bytes have been replaced by EFBFBD
Is there any way to read the file in Spark (maybe using rdd
?) so that my output looks like the pandas version:
#+----------------+---+---+
#|a |b |c |
#+----------------+---+---+
#|baed858e91d4c7b0|1 |a |
#|baed858e91d4c7b1|2 |b |
#|baed858e91d4c7b2|3 |c |
#+----------------+---+---+
I've tried casting to byte
and specifying the schema so that this column is ByteType()
but that didn't work.
Edit
I am using Spark v 2.1.
select() This method is used to iterate the columns in the given PySpark DataFrame. It can be used with collect() method and takes column. Finally, it will display the rows according to the specified columns.
LongType : Represents 8-byte signed integer numbers. The range of numbers is from -9223372036854775808 to 9223372036854775807 .
For looping through each row using map() first we have to convert the PySpark dataframe into RDD because map() is performed on RDD's only, so first convert into RDD it then use map() in which, lambda function for iterating through each row and stores the new RDD in some variable then convert back that new RDD into ...
The problem is rooted in the fact that delimited files are poorly suited to binary data.
If there is a known, consistent encoding for the text, use the charset
option. See https://github.com/databricks/spark-csv#features (I don't know of a good place in the 2.x docs where delimited reading options are described so I still go back to the 1.x docs). I would recommend experimenting with 8-bit ASCII, e.g., ISO-8859-1
or US-ASCII
.
If there is no such encoding, you would need to either transform the input to a different format, e.g., base64 encoding the first column, or manipulate the read data to get it back to what you need.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With