Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Dataframe :How to add a index Column : Aka Distributed Data Index

I read data from a csv file ,but don't have index.

I want to add a column from 1 to row's number.

What should I do,Thanks (scala)

like image 674
Liangpi Avatar asked Apr 14 '17 07:04

Liangpi


People also ask

How do you add a column in PySpark DataFrame at a specific position?

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

How do I add a column in Spark Dataset?

A new column could be added to an existing Dataset using Dataset. withColumn() method. withColumn accepts two arguments: the column name to be added, and the Column and returns a new Dataset<Row>. The syntax of withColumn() is provided below.

Does Spark support indexing?

Long story short, Spark absolutely does support the right kind of indexing -- the ability to create complicated derived data from raw data to make future uses more efficient.

How do I use indexing in PySpark?

Indexing and Accessing in Pyspark DataFrame There is an alternative way to do that in Pyspark by creating new column "index". Then, we can use ". filter()" function on our "index" column. Then, to access it by row and column, use ".


2 Answers

With Scala you can use:

import org.apache.spark.sql.functions._   df.withColumn("id",monotonicallyIncreasingId) 

You can refer to this exemple and scala docs.

With Pyspark you can use:

from pyspark.sql.functions import monotonically_increasing_id   df_index = df.select("*").withColumn("id", monotonically_increasing_id()) 
like image 89
Omar14 Avatar answered Sep 22 '22 03:09

Omar14


monotonically_increasing_id - The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

"I want to add a column from 1 to row's number."

Let say we have the following DF

 +--------+-------------+-------+ | userId | productCode | count | +--------+-------------+-------+ |     25 |        6001 |     2 | |     11 |        5001 |     8 | |     23 |         123 |     5 | +--------+-------------+-------+ 

To generate the IDs starting from 1

val w = Window.orderBy("count") val result = df.withColumn("index", row_number().over(w)) 

This would add an index column ordered by increasing value of count.

 +--------+-------------+-------+-------+ | userId | productCode | count | index | +--------+-------------+-------+-------+ |     25 |        6001 |     2 |     1 | |     23 |         123 |     5 |     2 | |     11 |        5001 |     8 |     3 | +--------+-------------+-------+-------+ 
like image 26
anshu kumar Avatar answered Sep 22 '22 03:09

anshu kumar