Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark DataFrame equivalent to Pandas Dataframe `.iloc()` method?

Is there a way to reference Spark DataFrame columns by position using an integer?

Analogous Pandas DataFrame operation:

df.iloc[:0] # Give me all the rows at column position 0 
like image 806
conner.xyz Avatar asked May 27 '16 15:05

conner.xyz


Video Answer


2 Answers

The equivalent of Python df.iloc is collect

PySpark examples:

X = df.collect()[0]['age'] 

or

X = df.collect()[0][1]  #row 0 col 1
like image 125
Chadee Fouad Avatar answered Sep 23 '22 18:09

Chadee Fouad


Not really, but you can try something like this:

Python:

df = sc.parallelize([(1, "foo", 2.0)]).toDF()
df.select(*df.columns[:1])  # I assume [:1] is what you really want
## DataFrame[_1: bigint]

or

df.select(df.columns[1:3])
## DataFrame[_2: string, _3: double]

Scala

val df = sc.parallelize(Seq((1, "foo", 2.0))).toDF()
df.select(df.columns.slice(0, 1).map(col(_)): _*)

Note:

Spark SQL doesn't support and it is unlikely to ever support row indexing so it is not possible to index across row dimension.

like image 29
zero323 Avatar answered Sep 24 '22 18:09

zero323