I am creating a dataframe with pyspark, like this: <pre class="prettyprint"><code>+----+------+ | k| v| +----+------+ |key1|value1| |key1|value1| |key1|value1| |key2|value1| |key2|value1| |key2|value1| +----+------+ </code></pre> I want to add one 'rowNum' column using 'withColumn' method, the result of dataframe changed like this: <pre class="prettyprint"><code>+----+------+------+ | k| v|rowNum| +----+------+------+ |key1|value1| 1| |key1|value1| 2| |key1|value1| 3| |key2|value1| 4| |key2|value1| 5| |key2|value1| 6| +----+------+------+ </code></pre> the range of rowNum is from 1 to n, n is equal to number of raws. I modified my code, like this: <pre class="prettyprint"><code>from pyspark.sql.window import Window from pyspark.sql import functions as F w = Window().partitionBy("v").orderBy('k') my_df= my_df.withColumn("rowNum", F.rowNumber().over(w)) </code></pre> But, I got error message: <pre class="prettyprint"><code>'module' object has no attribute 'rowNumber' </code></pre> I replaced rowNumber() method with row_number, the above code can run. But, When I run code: <pre class="prettyprint"><code>my_df.show() </code></pre> I got error message again: <pre class="prettyprint"><code>Py4JJavaError: An error occurred while calling o898.showString. : java.lang.UnsupportedOperationException: Cannot evaluate expression: row_number() at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:224) at org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate.doGenCode(interfaces.scala:342) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101) at scala.Option.getOrElse(Option.scala:121) </code></pre>

If you require require a sequential <code>rowNum</code> value from 1 to n, rather than a <code>monotonically_increasing_id</code> you can use <code>zipWithIndex()</code> Recreating your example data as follows: <pre class="prettyprint"><code>rdd = sc.parallelize([('key1','value1'), ('key1','value1'), ('key1','value1'), ('key1','value1'), ('key1','value1'), ('key1','value1')]) </code></pre> You can then use <code>zipWithIndex()</code> to add an index to each row. The <code>map</code> is used to reformat the data and to add 1 to the index so it starts at 1. <pre class="prettyprint"><code>rdd_indexed = rdd.zipWithIndex().map(lambda x: (x[0][0],x[0][1],x[1]+1)) df = rdd_indexed.toDF(['id','score','rowNum']) df.show() +----+------+------+ | id| score|rowNum| +----+------+------+ |key1|value1| 1| |key1|value1| 2| |key1|value1| 3| |key1|value1| 4| |key1|value1| 5| |key1|value1| 6| +----+------+------+ </code></pre>

add one column including values from 1 to n in dataframe

Tags:

pyspark

I am creating a dataframe with pyspark, like this:

+----+------+
|   k|     v|
+----+------+
|key1|value1|
|key1|value1|
|key1|value1|
|key2|value1|
|key2|value1|
|key2|value1|
+----+------+

I want to add one 'rowNum' column using 'withColumn' method, the result of dataframe changed like this:

+----+------+------+
|   k|     v|rowNum|
+----+------+------+
|key1|value1|     1|
|key1|value1|     2|
|key1|value1|     3|
|key2|value1|     4|
|key2|value1|     5|
|key2|value1|     6|
+----+------+------+

the range of rowNum is from 1 to n, n is equal to number of raws. I modified my code, like this:

from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = Window().partitionBy("v").orderBy('k')
my_df= my_df.withColumn("rowNum", F.rowNumber().over(w))

But, I got error message:

'module' object has no attribute 'rowNumber'

I replaced rowNumber() method with row_number, the above code can run. But, When I run code:

my_df.show()

I got error message again:

Py4JJavaError: An error occurred while calling o898.showString.
: java.lang.UnsupportedOperationException: Cannot evaluate expression: row_number()
    at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:224)
    at org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate.doGenCode(interfaces.scala:342)
    at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
    at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
    at scala.Option.getOrElse(Option.scala:121)

773

asked Mar 09 '17 08:03

Ivan Lee

3 Answers

Solution in Spark 2.2:

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = df.withColumn("rowNum", row_number().over(w))

answered Oct 23 '22 01:10

cph_sto

If you require require a sequential rowNum value from 1 to n, rather than a monotonically_increasing_id you can use zipWithIndex()

Recreating your example data as follows:

rdd = sc.parallelize([('key1','value1'),
                      ('key1','value1'),
                      ('key1','value1'),
                      ('key1','value1'),
                      ('key1','value1'),
                      ('key1','value1')])

You can then use zipWithIndex() to add an index to each row. The map is used to reformat the data and to add 1 to the index so it starts at 1.

rdd_indexed = rdd.zipWithIndex().map(lambda x: (x[0][0],x[0][1],x[1]+1))
df = rdd_indexed.toDF(['id','score','rowNum'])
df.show()


+----+------+------+
|  id| score|rowNum|
+----+------+------+
|key1|value1|     1|
|key1|value1|     2|
|key1|value1|     3|
|key1|value1|     4|
|key1|value1|     5|
|key1|value1|     6|
+----+------+------+

answered Oct 23 '22 01:10

Alex

You can do this with windows

from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
your_df= your_df.withColumn("rowNum", rowNumber().over(w))

Here your_df is data frame in which you need this column.

answered Oct 23 '22 00:10

Rakesh Kumar

Related questions
                            
                                Apply custom function to cells of selected columns of a data frame in PySpark
                            
                                How to get egg or wheel file of pip-installed python package?
                            
                                Combine multiple raw files into single parquet file
                            
                                Authentication for Spark standalone cluster
                            
                                Pickling monkey-patched Keras model for use in PySpark
                            
                                Why do I get so many empty partitions when repartionning a Spark Dataframe?
                            
                                Error running spark on databricks: constructor public XXX is not whitelisted
                            
                                Pass additional arguments to foreachBatch in pyspark
                            
                                Spark SQL - Regex for matching only numbers
                            
                                saving a dataframe to JSON file on local drive in pyspark
                            
                                Sending Large CSV to Kafka using python Spark
                            
                                How to pass additional parameters to user-defined methods in pyspark for filter method?
                            
                                pyspark expected zero arguments for construction of ClassDict (for pyspark.mllib.linalg.DenseVector)
                            
                                Pyspark command not recognised
                            
                                PYSPARK : casting string to float when reading a csv file
                            
                                pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989
                            
                                What's the difference among ShuffledRDD, MapPartitionsRDD and ParallelCollectionRDD?
                            
                                How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT
                            
                                Convert Sparse Vector to Dense Vector in Pyspark
                            
                                How to create a table as select in pyspark.sql

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With