Import TensorFlow data from pyspark

Tags:

I want to create a predictive model on several hundred GBs of data. The data needs some not-intensive preprocessing that I can do in pyspark but not in tensorflow. In my situation, it would be much more convenient to directly pass the result of the pre-processing to TF, ideally treating the pyspark data frame as a virtual input file to TF, instead of saving the pre-processed data to disk. However, I haven't the faintest idea how to do that and I couldn't find anywhere on the internet.

After some thought, it seems to me that I actually need an iterator (like as defined by tf.data.Iterator) over spark's data. However, I found comments online that hint to the fact that the distributed structure of spark makes it very hard, if not impossible. Why so? Imagine that I don't care about the order of the lines, why should it be impossible to iterate over the spark data?

853

asked Apr 30 '18 13:04

Gianluca Micchi

1 Answers

It sounds like you simply want to use tf.data.Dataset.from_generator() you define a python generator which reads samples out of spark. Although I don't know spark very well, I'm certain you can do a reduce to the server that will be running the tensorflow model. Better yet, if you're distributing your training you can reduce to the set of servers who need some shard of your final dataset.

The import data programmers guide covers the Dataset input pipeline in more detail. The tensorflow Dataset will provide you with an iterator that's accessed directly by the graph so there's no need for tf.placeholders or marshaling data outside of the tf.data.Dataset.from_generator() code you write.

answered Sep 23 '22 02:09

David Parks

Related questions
                            
                                How do you validate a GraphQL mutation in Python
                            
                                Where is the csrftoken stored in Django database?
                            
                                How to write output data into pdf?
                            
                                Prohibit passing several feature switches in python click
                            
                                Django Ignoring Asynch Tests Completely (Django Channels)
                            
                                pandas: fill nans given a condition
                            
                                Tensorflow-gpu with pyinstaller
                            
                                What is the best method for setting up a config file in Python
                            
                                Measuring time it takes to move data from RAM to GPU memory in Tensorflow
                            
                                Installation of pipenv on Windows fails
                            
                                Tensorflow MNIST Estimator: batch size affects the graph expected input?
                            
                                box around cursor pygame
                            
                                Setting background color of a box in ipywidgets
                            
                                How does ctypes.cdll.LoadLibrary(None) work?
                            
                                Can't pickle _thread.lock objects Pyspark send request to elasticseach
                            
                                How to use SQLAlchemy to create a full text search index on SQLite and query it?
                            
                                Avoid Parameter Binding When Executing Query with SQLAlchemy
                            
                                "ImportError: No module named serial" - after installing pyserial
                            
                                How exactly do eval and exec interact with __future__?
                            
                                unittest - compare list irrespective of order

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Import TensorFlow data from pyspark

Tags:

python

tensorflow

pyspark

Gianluca Micchi

People also ask

1 Answers

David Parks

Recent Activity

Donate For Us