I have a CSV input file. We read that using the following
val rawdata = spark.
read.
format("csv").
option("header", true).
option("inferSchema", true).
load(filename)
This neatly reads the data and builds the schema.
The next step is to split the columns into String and Integer columns. How?
If the following is the schema of my dataset...
scala> rawdata.printSchema
root
|-- ID: integer (nullable = true)
|-- First Name: string (nullable = true)
|-- Last Name: string (nullable = true)
|-- Age: integer (nullable = true)
|-- DailyRate: integer (nullable = true)
|-- Dept: string (nullable = true)
|-- DistanceFromHome: integer (nullable = true)
I'd like to split this into two variables (StringCols, IntCols) where:
This is what I have tried :
val names = rawdata.schema.fieldNames
val types = rawdata.schema.fields.map(r => r.dataType)
Now in types, I would like to loop and find all StringType and lookup up in names for the column name, similarly for IntegerType.
Here you go, you can filter your columns by type using the underlying schema and the dataType
import org.apache.spark.sql.types.{IntegerType, StringType}
val stringCols = df.schema.filter(c => c.dataType == StringType).map(_.name)
val intCols = df.schema.filter(c => c.dataType == IntegerType).map(_.name)
val dfOfString = df.select(stringCols.head, stringCols.tail : _*)
val dfOfInt = df.select(intCols.head, intCols.tail : _*)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With