I want to create a predictive model on several hundred GBs of data. The data needs some not-intensive preprocessing that I can do in pyspark but not in tensorflow. In my situation, it would be much more convenient to directly pass the result of the pre-processing to TF, ideally treating the pyspark data frame as a virtual input file to TF, instead of saving the pre-processed data to disk. However, I haven't the faintest idea how to do that and I couldn't find anywhere on the internet.
After some thought, it seems to me that I actually need an iterator (like as defined by tf.data.Iterator
) over spark's data. However, I found comments online that hint to the fact that the distributed structure of spark makes it very hard, if not impossible. Why so? Imagine that I don't care about the order of the lines, why should it be impossible to iterate over the spark data?
TensorFlow's Basic Programming Elements TensorFlow allows us to assign data to three kinds of data elements: constants, variables, and placeholders. Let's take a closer look at what each of these data components represents.
The tf. data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training.
It sounds like you simply want to use tf.data.Dataset.from_generator()
you define a python generator which reads samples out of spark. Although I don't know spark very well, I'm certain you can do a reduce to the server that will be running the tensorflow model. Better yet, if you're distributing your training you can reduce to the set of servers who need some shard of your final dataset.
The import data programmers guide covers the Dataset
input pipeline in more detail. The tensorflow Dataset
will provide you with an iterator that's accessed directly by the graph so there's no need for tf.placeholders
or marshaling data outside of the tf.data.Dataset.from_generator()
code you write.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With