Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to print only a certain column of DataFrame in PySpark?

Tags:

Can one use the actions collect or take to print only a given column of DataFrame?

This

df.col.collect() 

gives error

TypeError: 'Column' object is not callable

and this:

df[df.col].take(2) 

gives

pyspark.sql.utils.AnalysisException: u"filter expression 'col' of type string is not a boolean.;"

like image 226
mar tin Avatar asked Mar 10 '16 10:03

mar tin


People also ask

How to select columns from Dataframe in pyspark?

PySpark Select Columns From DataFrame 1. Select Single & Multiple Columns From PySpark You can select the single or multiple columns of the DataFrame by... 2. Select All Columns From List Sometimes you may need to select all DataFrame columns from a Python list. In the below... 3. Select Columns by ...

How to format the data in pyspark?

Formatting the data in Pyspark means showing the appropriate data types of the columns present in the dataset. To display all the headers we use the option () function. This function takes two arguments in the form of strings.

How to select a particular row from The Dataframe in Python?

where n is the no of rows to be returned from last in the dataframe. This method is used to select a particular row from the dataframe, It can be used with collect () function. Index is the index number of row to be displayed. Example: Python code to select the particular row.

What is the use of select () function in Dataframe?

In this article, you have learned select () is a transformation function of the DataFrame and is used to select single, multiple columns, select all columns from the list, select by index, and finally select nested struct columns, you have also learned how to select nested elements from the DataFrame. Happy Learning !!


Video Answer


1 Answers

select and show:

df.select("col").show() 

or select, flatMap, collect:

df.select("col").rdd.flatMap(list).collect() 

Bracket notation (df[df.col]) is used only for logical slicing and columns by itself (df.col) are not distributed data structures but SQL expressions and cannot be collected.

like image 92
zero323 Avatar answered Sep 19 '22 17:09

zero323