Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create substring column in spark dataframe

I want to take a json file and map it so that one of the columns is a substring of another. For example to take the left table and produce the right table:

 ------------              ------------------------
|     a      |             |      a     |    b    |
|------------|       ->    |------------|---------|
|hello, world|             |hello, world|  hello  |

I can do this using spark-sql syntax but how can it be done using the in-built functions?

like image 416
J Smith Avatar asked Mar 15 '17 22:03

J Smith


2 Answers

Such statement can be used

import org.apache.spark.sql.functions._

dataFrame.select(col("a"), substring_index(col("a"), ",", 1).as("b"))

like image 57
pasha701 Avatar answered Sep 20 '22 21:09

pasha701


Suppose you have the following dataframe:

import spark.implicits._
import org.apache.spark.sql.functions._

var df = sc.parallelize(Seq(("foobar", "foo"))).toDF("a", "b")

+------+---+
|     a|  b|
+------+---+
|foobar|foo|
+------+---+

You could subset a new column from the first column as follows:

df = df.select(col("*"), substring(col("a"), 4, 6).as("c"))

+------+---+---+
|     a|  b|  c|
+------+---+---+
|foobar|foo|bar|
+------+---+---+
like image 27
Balázs Fehér Avatar answered Sep 21 '22 21:09

Balázs Fehér