Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split columns into two sets per type?

I have a CSV input file. We read that using the following

val rawdata = spark.
  read.
  format("csv").
  option("header", true).
  option("inferSchema", true).
  load(filename)

This neatly reads the data and builds the schema.

The next step is to split the columns into String and Integer columns. How?

If the following is the schema of my dataset...

scala> rawdata.printSchema
root
 |-- ID: integer (nullable = true)
 |-- First Name: string (nullable = true)
 |-- Last Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- DailyRate: integer (nullable = true)
 |-- Dept: string (nullable = true)
 |-- DistanceFromHome: integer (nullable = true)

I'd like to split this into two variables (StringCols, IntCols) where:

  • StringCols should have "First Name","Last Name","Dept"
  • IntCols should have "ID","Age","DailyRate","DistanceFromHome"

This is what I have tried :

val names = rawdata.schema.fieldNames
val types = rawdata.schema.fields.map(r => r.dataType)

Now in types, I would like to loop and find all StringType and lookup up in names for the column name, similarly for IntegerType.

like image 530
Balaji Krishnan Avatar asked Mar 02 '26 05:03

Balaji Krishnan


1 Answers

Here you go, you can filter your columns by type using the underlying schema and the dataType

import org.apache.spark.sql.types.{IntegerType, StringType}

val stringCols = df.schema.filter(c => c.dataType == StringType).map(_.name)
val intCols = df.schema.filter(c => c.dataType == IntegerType).map(_.name)

val dfOfString = df.select(stringCols.head, stringCols.tail : _*)
val dfOfInt = df.select(intCols.head, intCols.tail : _*)
like image 140
eliasah Avatar answered Mar 03 '26 19:03

eliasah



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!