Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark: how to transform Data Frame column with regex to another Data Frame?

I have Spark Data Frame 1 of several columns: (user_uuid, url, date_visit)

I want to transform this DF1 to Data Frame 2 with form : (user_uuid, domain, date_visit)

What I wanted to use is regular expression to detect domain and apply it to DF1 val regexpr = """(?i)^((https?):\/\/)?((www|www1)\.)?([\w-\.]+)""".r

Could you please help me composing code to transform Data Frames in Scala? I am completely new to Spark and Scala and syntax is hard. Thanks!

like image 407
snowindy Avatar asked Aug 20 '15 15:08

snowindy


1 Answers

Spark >= 1.5:

You can use regexp_extract function:

import org.apache.spark.sql.functions.regexp_extract

val patter: String = ??? 
val groupIdx: Int = ???

df.withColumn("domain", regexp_extract(url, pattern, groupIdx))

Spark < 1.5.0

Define an UDF

val pattern: scala.util.matching.Regex = ???

def getFirst(pattern: scala.util.matching.Regex) = udf(
  (url: String) => pattern.findFirstIn(url) match { 
    case Some(domain) => domain
    case None => "unknown"
  }
)

Use defined UDF:

df.select(
  $"user_uuid",
  getFirst(pattern)($"url").alias("domain"),
  $"date_visit"
)

or register temp table:

df.registerTempTable("df")

sqlContext.sql(s"""
  SELECT user_uuid, regexp_extract(url, '$pattern', $group_idx) AS domain, date_visit FROM df""")

Replace pattern with a valid Java regexp and group_id with an index of the group.

like image 189
zero323 Avatar answered Sep 20 '22 11:09

zero323