I have a Spark DataFrame in Pyspark (2.1.0) and I am looking to get the names of numeric columns only or string columns only.
For example, this is the Schema of my DF:
root
|-- Gender: string (nullable = true)
|-- SeniorCitizen: string (nullable = true)
|-- MonthlyCharges: double (nullable = true)
|-- TotalCharges: double (nullable = true)
|-- Churn: string (nullable = true)
This is what I need:
num_cols = [MonthlyCharges, TotalCharges]
str_cols = [Gender, SeniorCitizen, Churn]
How can I make it?
PySpark provides a rich API related to schema types. As @DanieldePaula mentioned you can access fields' metadata through df.schema.fields
.
Here is a different approach based on statically typed checking:
from pyspark.sql.types import StringType, DoubleType
df = spark.createDataFrame([
[1, 2.3, "t1"],
[2, 5.3, "t2"],
[3, 2.1, "t3"],
[4, 1.5, "t4"]
], ["cola", "colb", "colc"])
# get string
str_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, StringType)]
# ['colc']
# or double
dbl_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, DoubleType)]
# ['colb']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With