My data are available as sets of Python 3 pickled files. Most of them are serialization of Pandas DataFrames
.
I'd like to start using Spark because I need more memory and CPU that one computer can have. Also, I'll use HDFS for distributed storage.
As a beginner, I didn't found relevant information explaining how to use pickle files as input file.
Does it exists? If not, are there any workaround?
Thanks a lot
SparkContext provides pickleFile method, but the name can be misleading. It can be used to read SequenceFiles containing pickle objects not the plain Python pickles.
To use pickle, start by importing it in Python. To pickle this dictionary, you first need to specify the name of the file you will write it to, which is dogs in this case. Note that the file does not have an extension. To open the file for writing, simply use the open() function.
Pickle dump replaces current file data.
The pickle module is not secure. Only unpickle data you trust. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.
A lot depends on the data itself. Generally speaking Spark doesn't perform particularly well when it has to read large, not splittable files. Nevertheless you can try to use binaryFiles
method and combine it with the standard Python tools. Lets start with a dummy data:
import tempfile
import pandas as pd
import numpy as np
outdir = tempfile.mkdtemp()
for i in range(5):
pd.DataFrame(
np.random.randn(10, 2), columns=['foo', 'bar']
).to_pickle(tempfile.mkstemp(dir=outdir)[1])
Next we can read it using bianryFiles
method:
rdd = sc.binaryFiles(outdir)
and deserialize individual objects:
import pickle
from io import BytesIO
dfs = rdd.values().map(lambda p: pickle.load(BytesIO(p)))
dfs.first()[:3]
## foo bar
## 0 -0.162584 -2.179106
## 1 0.269399 -0.433037
## 2 -0.295244 0.119195
One important note is that it typically requires significantly more memory than a simple methods like textFile
.
Another approach is to parallelize only the paths and use libraries which can read directly from a distributed file system like hdfs3. This typically means lower memory requirements at the price of a significantly worse data locality.
Considering these two facts it is typically better to serialize your data in a format which can be loaded with a higher granularity.
Note:
SparkContext
provides pickleFile
method, but the name can be misleading. It can be used to read SequenceFiles
containing pickle objects not the plain Python pickles.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With