Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract words from a string column in spark dataframe

I have a column in spark dataframe which has text.

I want to extract all the words which start with a special character '@' and I am using regexp_extract from each row in that text column. If the text contains multiple words starting with '@' it just returns the first one.

I am looking for extracting multiple words which match my pattern in Spark.

data_frame.withColumn("Names", regexp_extract($"text","(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9_]+)",1).show

Sample input: @always_nidhi @YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking

Sample output: @always_nidhi,@YouTube

like image 387
Sree51 Avatar asked Dec 26 '17 17:12

Sree51


1 Answers

You can create a udf function in spark as below:

import java.util.regex.Pattern
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.lit

def regexp_extractAll = udf((job: String, exp: String, groupIdx: Int) => {
      println("the column value is" + job.toString())
      val pattern = Pattern.compile(exp.toString)
      val m = pattern.matcher(job.toString)
      var result = Seq[String]()
      while (m.find) {
        val temp = 
        result =result:+m.group(groupIdx)
      }
      result.mkString(",")
    })

And then call the udf as below:

data_frame.withColumn("Names", regexp_extractAll(new Column("text"), lit("@\\w+"), lit(0))).show()

Above you give you output as below:

+--------------------+
|               Names|
+--------------------+
|@always_nidhi,@Yo...|
+--------------------+

I have used regex, as per the output you have posted in the question. You can modify it to suite your needs.

like image 138
Amit Kumar Avatar answered Sep 22 '22 22:09

Amit Kumar