I have a column in spark dataframe which has text.
I want to extract all the words which start with a special character '@'
and I am using regexp_extract
from each row in that text column. If the text contains multiple words starting with '@'
it just returns the first one.
I am looking for extracting multiple words which match my pattern in Spark.
data_frame.withColumn("Names", regexp_extract($"text","(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9_]+)",1).show
Sample input: @always_nidhi @YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking
Sample output: @always_nidhi,@YouTube
You can create a udf function in spark as below:
import java.util.regex.Pattern
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.lit
def regexp_extractAll = udf((job: String, exp: String, groupIdx: Int) => {
println("the column value is" + job.toString())
val pattern = Pattern.compile(exp.toString)
val m = pattern.matcher(job.toString)
var result = Seq[String]()
while (m.find) {
val temp =
result =result:+m.group(groupIdx)
}
result.mkString(",")
})
And then call the udf as below:
data_frame.withColumn("Names", regexp_extractAll(new Column("text"), lit("@\\w+"), lit(0))).show()
Above you give you output as below:
+--------------------+
| Names|
+--------------------+
|@always_nidhi,@Yo...|
+--------------------+
I have used regex, as per the output you have posted in the question. You can modify it to suite your needs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With