How can I read in a binary file from hdfs into a Spark dataframe?

Tags:

I am trying to port some code from pandas to (py)Spark. Unfortunately I am already failing with the input part, where I want to read in binary data and put it in a Spark Dataframe.

So far I am using fromfile from numpy:

dt = np.dtype([('val1', '<i4'),('val2','<i4'),('val3','<i4'),('val4','f8')])
data = np.fromfile('binary_file.bin', dtype=dt)
data=data[1:]                                           #throw away header
df_bin = pd.DataFrame(data, columns=data.dtype.names)

But for Spark I couldn't find how to do it. My workaround so far was to use csv-Files instead of the binary file, but that is not an ideal solution. I am aware that I shouldn't use numpy's fromfile with spark. How can I read in a binary file that is already loaded into hdfs?

I tried something like

fileRDD=sc.parallelize(['hdfs:///user/bin_file1.bin','hdfs:///user/bin_file2.bin])
fileRDD.map(lambda x: ???)

But it is giving me a No such file or directory error.

I have seen this question: spark in python: creating an rdd by loading binary data with numpy.fromfile but that only works if I have the files stored in the home of the driver node.

388

asked May 24 '16 12:05

WilliamEllisWebb

3 Answers

So, for anyone that starts with Spark as me and stumbles upon binary files. Here is how I solved it:

dt=np.dtype([('idx_metric','>i4'),('idx_resource','>i4'),('date','>i4'),
             ('value','>f8'),('pollID','>i2')])
schema=StructType([StructField('idx_metric',IntegerType(),False),
                   StructField('idx_resource',IntegerType(),False), 
                   StructField('date',IntegerType),False), 
                   StructField('value',DoubleType(),False), 
                   StructField('pollID',IntegerType(),False)])

filenameRdd=sc.binaryFiles('hdfs://nameservice1:8020/user/*.binary')

def read_array(rdd):
    #output=zlib.decompress((bytes(rdd[1])),15+32) # in case also zipped
    array=np.frombuffer(bytes(rdd[1])[20:],dtype=dt) # remove Header (20 bytes)
    array=array.newbyteorder().byteswap() # big Endian
    return array.tolist()

unzipped=filenameRdd.flatMap(read_array)
bin_df=sqlContext.createDataFrame(unzipped,schema)

And now you can do whatever fancy stuff you want in Spark with your dataframe.

134

answered Oct 22 '22 20:10

WilliamEllisWebb

Edit: Please review the use of sc.binaryFiles as mentioned here: https://stackoverflow.com/a/28753276/5088142

try using:

hdfs://machine_host_name:8020/user/bin_file1.bin

you the host-name in fs.defaultFS in core-site.xml

answered Oct 22 '22 20:10

Yaron

Since Spark 3.0, Spark supports binary file data source, which reads binary files and converts each file into a single record that contains the raw content and metadata of the file.

https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html

answered Oct 22 '22 19:10

Aydin K.

Related questions
                            
                                python mean of list of lists
                            
                                Python Testing how to run parameterised Testcases and pass a parameter to setupClass
                            
                                Combinations with repetition in python, where order MATTERS
                            
                                Python Insert Image into the middle of an existing PowerPoint
                            
                                What is the difference between the title() method and wm_title() method in the Tkinter class?
                            
                                Tkinter TTK Button Bold Font
                            
                                Unexpected Behavior of itertools.groupby
                            
                                Flask application on uwsgi gives a TypeError: 'Flask' object is not iterable
                            
                                how to remove a object in a python list
                            
                                ScrapyJS - How to properly wait for page load?
                            
                                What is the difference between an S3 Object and an ObjectSummary?
                            
                                Explicit passing of Self when calling super class's __init__ in python
                            
                                Installing imutils in ubuntu
                            
                                Plotting with SymPy
                            
                                Cumulative operations on dtype objects
                            
                                Django - Filter a date within a range with validation
                            
                                Convert a Haskell code to Python or pseudocode
                            
                                FFT in numpy vs FFT in MATLAB do not have the same results
                            
                                Array of ints in numba
                            
                                numpy: How can I select specific indexes in an np array for k-fold cross validation?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I read in a binary file from hdfs into a Spark dataframe?

Tags:

python

numpy

apache-spark

hadoop

spark-dataframe

WilliamEllisWebb

People also ask

3 Answers

WilliamEllisWebb

Yaron

Aydin K.

Recent Activity

Donate For Us