Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting only numeric/string columns names from a Spark DF in pyspark

I have a Spark DataFrame in Pyspark (2.1.0) and I am looking to get the names of numeric columns only or string columns only.

For example, this is the Schema of my DF:

root
 |-- Gender: string (nullable = true)
 |-- SeniorCitizen: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: double (nullable = true)
 |-- Churn: string (nullable = true)

This is what I need:

num_cols = [MonthlyCharges, TotalCharges]
str_cols = [Gender, SeniorCitizen, Churn]

How can I make it?

like image 274
Mara Avatar asked May 19 '17 09:05

Mara


1 Answers

PySpark provides a rich API related to schema types. As @DanieldePaula mentioned you can access fields' metadata through df.schema.fields.

Here is a different approach based on statically typed checking:

from pyspark.sql.types import StringType, DoubleType

df = spark.createDataFrame([
  [1, 2.3, "t1"],
  [2, 5.3, "t2"],
  [3, 2.1, "t3"],
  [4, 1.5, "t4"]
], ["cola", "colb", "colc"])

# get string
str_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, StringType)]
# ['colc']

# or double
dbl_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, DoubleType)]
# ['colb']
like image 67
abiratsis Avatar answered Oct 08 '22 04:10

abiratsis