I'm using Spark 1.3.0 and Python. I have a dataframe and I wish to add an additional column which is derived from other columns. Like this, <pre class="prettyprint"><code>>>old_df.columns [col_1, col_2, ..., col_m] >>new_df.columns [col_1, col_2, ..., col_m, col_n] </code></pre> where <pre class="prettyprint"><code>col_n = col_3 - col_4 </code></pre> How do I do this in PySpark?

One way to achieve that is to use <code>withColumn</code> method: <pre class="prettyprint"><code>old_df = sqlContext.createDataFrame(sc.parallelize( [(0, 1), (1, 3), (2, 5)]), ('col_1', 'col_2')) new_df = old_df.withColumn('col_n', old_df.col_1 - old_df.col_2) </code></pre> Alternatively you can use SQL on a registered table: <pre class="prettyprint"><code>old_df.registerTempTable('old_df') new_df = sqlContext.sql('SELECT *, col_1 - col_2 AS col_n FROM old_df') </code></pre>

Adding a new column in Data Frame derived from other columns (Spark)

Tags:

python

apache-spark

apache-spark-sql

pyspark

I'm using Spark 1.3.0 and Python. I have a dataframe and I wish to add an additional column which is derived from other columns. Like this,

>>old_df.columns [col_1, col_2, ..., col_m]  >>new_df.columns [col_1, col_2, ..., col_m, col_n]

where

col_n = col_3 - col_4

How do I do this in PySpark?

340

asked Jul 10 '15 05:07

oikonomiyaki

1 Answers

One way to achieve that is to use withColumn method:

old_df = sqlContext.createDataFrame(sc.parallelize(     [(0, 1), (1, 3), (2, 5)]), ('col_1', 'col_2'))  new_df = old_df.withColumn('col_n', old_df.col_1 - old_df.col_2)

Alternatively you can use SQL on a registered table:

old_df.registerTempTable('old_df') new_df = sqlContext.sql('SELECT *, col_1 - col_2 AS col_n FROM old_df')

answered Sep 28 '22 00:09

zero323

Related questions
                            
                                error using pip search (pip search stopped working)
                            
                                Better/Faster to Loop through set or list?
                            
                                Django: Generic detail view must be called with either an object pk or a slug
                            
                                Seaborn: countplot() with frequencies
                            
                                Multiple outputs in Keras
                            
                                Pyspark convert a standard list to data frame [duplicate]
                            
                                Beautiful Soup cannot find a CSS class if the object has other classes, too
                            
                                Moon / Lunar Phase Algorithm
                            
                                How do I visualize social networks with Python
                            
                                Overlapping y-axis tick label and x-axis tick label in matplotlib
                            
                                urllib.quote() throws KeyError
                            
                                How to programmatically create a topic in Apache Kafka using Python
                            
                                Extending setuptools extension to use CMake in setup.py?
                            
                                How can I draw a log-normalized imshow plot with a colorbar representing the raw data in matplotlib
                            
                                How to check the size of a float in python?
                            
                                ImportError: No module named OpenGL.GL
                            
                                What is the purpose of using nginx with gunicorn? [duplicate]
                            
                                Fill form values in a web page via a Python script (not testing)
                            
                                Why does Python have an __ne__ operator method instead of just __eq__?
                            
                                Read .csv file from URL into Python 3.x - _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With