Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count number of columns in pyspark Dataframe?

I have a dataframe with 15 columns (4 categorical and the rest numeric).

I have created dummy variables for every categorical variable. Now I want to find the number of variables in my new dataframe.

I tried calculating length of printSchema(), but is NoneType:

print type(df.printSchema())

like image 984
Sushant Bharti Avatar asked Mar 15 '17 09:03

Sushant Bharti


People also ask

How do I count multiple columns in PySpark?

Using countDistinct() SQL Function if you want to get count distinct on selected multiple columns, use the PySpark SQL function countDistinct() . This function returns the number of distinct elements in a group.

How do you count the DataFrame in PySpark?

In this article, we will discuss how to get the number of rows and the number of columns of a PySpark dataframe. For finding the number of rows and number of columns we will use count() and columns() with len() function respectively. df. count(): This function is used to extract number of rows from the Dataframe.

How do you count the number of columns in a data frame?

Count the number of rows and columns of Dataframe using len() function. The len() function returns the length rows of the Dataframe, we can filter a number of columns using the df. columns to get the count of columns.

How do you display count in PySpark?

In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.


1 Answers

You are finding it wrong way, Here is sample example for this and about printSchema:-

df = sqlContext.createDataFrame([
    (1, "A", "X1"),
    (2, "B", "X2"),
    (3, "B", "X3"),
    (1, "B", "X3"),
    (2, "C", "X2"),
    (3, "C", "X2"),
    (1, "C", "X1"),
    (1, "B", "X1"),
], ["ID", "TYPE", "CODE"])


# Python 2:
print len(df.columns) #3
# Python 3
print(len(df.columns)) #3

columns provides list of all columns and we can check len. Instead printSchema prints schema of df which have columns and their data type, ex below:-

root
 |-- ID: long (nullable = true)
 |-- TYPE: string (nullable = true)
 |-- CODE: string (nullable = true)
like image 91
Rakesh Kumar Avatar answered Sep 23 '22 17:09

Rakesh Kumar