What is the Spark DataFrame method `toPandas` actually doing?

Tags:

I'm a beginner of Spark-DataFrame API.

I use this code to load csv tab-separated into Spark Dataframe

lines = sc.textFile('tail5.csv') parts = lines.map(lambda l : l.strip().split('\t')) fnames = *some name list* schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames]) ddf = sqlContext.createDataFrame(parts,schemaData)

Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),

Does it store the Pandas object to local memory?
Does Pandas low-level computation handled all by Spark?
Does it exposed all pandas dataframe functionality?(I guess yes)
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?

916

asked Mar 24 '15 06:03

Napitupulu Jon

1 Answers

Using spark to read in a CSV file to pandas is quite a roundabout method for achieving the end goal of reading a CSV file into memory.

It seems like you might be misunderstanding the use cases of the technologies in play here.

Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.

In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.

Just use pandas.read_csv and read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.

Now to answer your questions:

Does it store the Pandas object to local memory:

Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.

Does Pandas low-level computation handled all by Spark

No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply some API compatibility.

Does it exposed all pandas dataframe functionality?

No. For example, Series objects have an interpolate method which isn't available in PySpark Column objects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.

Can I convert it toPandas and just be done with it, without so much touching DataFrame API?

Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csv will likely handle your use case unless you're working with a huge amount of data.

Try to solve your problem with simple, low-tech, easy-to-understand libraries, and only go to something more complicated as you need it. Many times, you won't need the more complex technology.

190

answered Oct 08 '22 13:10

Phillip Cloud

Related questions
                            
                                When are parentheses required around a tuple?
                            
                                How do I create a date picker in tkinter?
                            
                                Colour chart for Tkinter and Tix
                            
                                How to define free-variable in python?
                            
                                What is a Python bytestring?
                            
                                Python assignment destructuring
                            
                                R summary() equivalent in numpy
                            
                                Is there anything like VirtualEnv for Java?
                            
                                python encoding utf-8
                            
                                Should I use Pylons or Pyramid?
                            
                                What makes a user-defined class unhashable?
                            
                                What is the use of buffering in python's built-in open() function?
                            
                                Partial coloring of text in matplotlib
                            
                                Is self.__dict__.update(**kwargs) good or poor style?
                            
                                json.dump throwing "TypeError: {...} is not JSON serializable" on seemingly valid object?
                            
                                Preserve Python tuples with JSON
                            
                                Does asyncio supports asynchronous I/O for file operations?
                            
                                Understanding tensordot
                            
                                Variable-length lookbehind-assertion alternatives for regular expressions
                            
                                One chart with two different y axis ranges in Bokeh?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the Spark DataFrame method `toPandas` actually doing?

Tags:

python

pandas

apache-spark

pyspark

Napitupulu Jon

People also ask

1 Answers

Phillip Cloud

Recent Activity

Donate For Us