Spark reading python3 pickle as input

1 Answers

A lot depends on the data itself. Generally speaking Spark doesn't perform particularly well when it has to read large, not splittable files. Nevertheless you can try to use binaryFiles method and combine it with the standard Python tools. Lets start with a dummy data:

import tempfile
import pandas as pd
import numpy as np

outdir = tempfile.mkdtemp()

for i in range(5):
    pd.DataFrame(
        np.random.randn(10, 2), columns=['foo', 'bar']
    ).to_pickle(tempfile.mkstemp(dir=outdir)[1])

Next we can read it using bianryFiles method:

rdd = sc.binaryFiles(outdir)

and deserialize individual objects:

import pickle
from io import BytesIO

dfs = rdd.values().map(lambda p: pickle.load(BytesIO(p)))
dfs.first()[:3]

##         foo       bar
## 0 -0.162584 -2.179106
## 1  0.269399 -0.433037
## 2 -0.295244  0.119195

One important note is that it typically requires significantly more memory than a simple methods like textFile.

Another approach is to parallelize only the paths and use libraries which can read directly from a distributed file system like hdfs3. This typically means lower memory requirements at the price of a significantly worse data locality.

Considering these two facts it is typically better to serialize your data in a format which can be loaded with a higher granularity.

Note:

SparkContext provides pickleFile method, but the name can be misleading. It can be used to read SequenceFiles containing pickle objects not the plain Python pickles.

102

answered Oct 21 '22 14:10

zero323

Related questions
                            
                                What is the advantage of flask.logger over the more generic python logging module?
                            
                                read HDF5 file to pandas DataFrame with conditions
                            
                                How to make 'pip install' not uninstall other versions?
                            
                                Kivy properly set own icon
                            
                                What type signature do generators have in Python?
                            
                                Find substrings in PyMongo
                            
                                PyQt4: How to pause a Thread until a signal is emitted?
                            
                                Python BigQuery allowLargeResults with pandas.io.gbq
                            
                                'Unexpected Keyword Argument' in super().__init__()
                            
                                Sklearn SVM: SVR and SVC, getting the same prediction for every input
                            
                                How do I ADD accents to a letter? [closed]
                            
                                How to read index data as string with pandas.read_csv()?
                            
                                How to normalize only certain columns in scikit-learn?
                            
                                Convert mask (boolean) array to list of x,y coordinates
                            
                                Chunking bytes (not strings) in Python 2 and 3
                            
                                TK Framework double implementation issue
                            
                                Python Pandas Distance matrix using jaccard similarity
                            
                                Alpine 3.3, Python 2.7.11, urllib2 causing SSL: CERTIFICATE_VERIFY_FAILED
                            
                                Pyinstaller Jinja2 TemplateNotFound
                            
                                Is there a way to download a video from a webpage with python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark reading python3 pickle as input

Tags:

python

serialization

apache-spark

rdd

pyspark

Michael Hooreman

People also ask

1 Answers

zero323

Recent Activity

Donate For Us