Extract words from a string column in spark dataframe

Question

I have a column in spark dataframe which has text.

I want to extract all the words which start with a special character '@' and I am using regexp_extract from each row in that text column. If the text contains multiple words starting with '@' it just returns the first one.

I am looking for extracting multiple words which match my pattern in Spark.

data_frame.withColumn("Names", regexp_extract($"text","(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9_]+)",1).show

Sample input: @always_nidhi @YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking

Sample output: @always_nidhi,@YouTube

Amit Kumar · Accepted Answer

You can create a udf function in spark as below:

import java.util.regex.Pattern
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.lit

def regexp_extractAll = udf((job: String, exp: String, groupIdx: Int) => {
      println("the column value is" + job.toString())
      val pattern = Pattern.compile(exp.toString)
      val m = pattern.matcher(job.toString)
      var result = Seq[String]()
      while (m.find) {
        val temp = 
        result =result:+m.group(groupIdx)
      }
      result.mkString(",")
    })

And then call the udf as below:

data_frame.withColumn("Names", regexp_extractAll(new Column("text"), lit("@\w+"), lit(0))).show()

Above you give you output as below:

+--------------------+
|               Names|
+--------------------+
|@always_nidhi,@Yo...|
+--------------------+

I have used regex, as per the output you have posted in the question. You can modify it to suite your needs.

Extract words from a string column in spark dataframe

Tags:

regex

scala

apache-spark

apache-spark-sql

Sree51

1 Answers

Amit Kumar

Recent Activity

Donate For Us

Extract words from a string column in spark dataframe

Tags:

regex

scala

apache-spark

apache-spark-sql

Sree51

1 Answers

Amit Kumar

Related questions

Recent Activity

Donate For Us