Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting the value of a DataFrame column in Spark

I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :

val name=df.select("name")
 val name1=name.collect()

But none of the above is returning the value of column "name".

Spark version :2.2.0 Scala version :2.11.11

like image 605
Ayan Biswas Avatar asked Sep 21 '17 16:09

Ayan Biswas


2 Answers

There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.

So the alternate is to check few items from the dataframe. What I generally do is

df.limit(10).select("name").as[String].collect()

This will provide output of 10 element. But now the output doesn't look good

So, 2nd alternative is

df.select("name").show(10)

This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.

Hence there is third option

df.select("name").take(10).foreach(println)

Takes 10 element and print them.

Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use

df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)

You can check the "sample" function on dataframe

like image 175
Avishek Bhattacharya Avatar answered Oct 04 '22 02:10

Avishek Bhattacharya


The first will do :)

val name = df.select("name") will return another DataFrame. You can do for example name.show() to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver

You can also do:

val names = df.select("name").as[String].collect()

This will return array of names in this DataFrame

like image 36
T. Gawęda Avatar answered Oct 04 '22 04:10

T. Gawęda