Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark - String matching to create new column

Tags:

I have a dataframe like:

ID             Notes 2345          Checked by John 2398          Verified by Stacy 3983          Double Checked on 2/23/17 by Marsha  

Let's say for example there are only 3 employees to check: John, Stacy, or Marsha. I'd like to make a new column like so:

ID                Notes                              Employee 2345          Checked by John                          John 2398         Verified by Stacy                        Stacy 3983     Double Checked on 2/23/17 by Marsha          Marsha 

Is regex or grep better here? What kind of function should I try? Thanks!

EDIT: I've been trying a bunch of solutions, but nothing seems to work. Should I give up and instead create columns for each employee, with a binary value? IE:

ID                Notes                             John       Stacy    Marsha 2345          Checked by John                        1            0       0 2398         Verified by Stacy                       0            1       0 3983     Double Checked on 2/23/17 by Marsha         0            0       1 
like image 509
Ashley O Avatar asked Sep 25 '17 17:09

Ashley O


2 Answers

In short:

regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

This expression extracts employee name from any position where it is after by then space(s) in text column(col('Notes'))


In Detail:

Create a sample dataframe

data = [('2345', 'Checked by John'), ('2398', 'Verified by Stacy'), ('2328', 'Verified by Srinivas than some random text'),         ('3983', 'Double Checked on 2/23/17 by Marsha')]  df = sc.parallelize(data).toDF(['ID', 'Notes'])  df.show()  +----+--------------------+ |  ID|               Notes| +----+--------------------+ |2345|     Checked by John| |2398|   Verified by Stacy| |2328|Verified by Srini...| |3983|Double Checked on...| +----+--------------------+ 

Do the needed imports

from pyspark.sql.functions import regexp_extract, col 

On df extract Employee name from column using regexp_extract(column_name, regex, group_number).

Here regex('(.)(by)(\s+)(\w+)') means

  • (.) - Any character (except newline)
  • (by) - Word by in the text
  • (\s+) - One or many spaces
  • (\w+) - Alphanumeric or underscore chars of length one

and group_number is 4 because group (\w+) is in 4th position in expression

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))  result.show()  +----+--------------------+--------+ |  ID|               Notes|Employee| +----+--------------------+--------+ |2345|     Checked by John|    John| |2398|   Verified by Stacy|   Stacy| |2328|Verified by Srini...|Srinivas| |3983|Double Checked on...|  Marsha| +----+--------------------+--------+ 

Databricks notebook

Note:

regexp_extract(col('Notes'), '.by\s+(\w+)', 1)) seems much cleaner version and check the Regex in use here

like image 102
mrsrinivas Avatar answered Oct 09 '22 16:10

mrsrinivas


Brief

In its simplest form, and according to the example provided, this answer should suffice, albeit the OP should post more samples if other samples exist where the name should be preceded by any word other than by.


Code

See code in use here

Regex

^(\w+)[ \t]*(.*\bby[ \t]+(\w+)[ \t]*.*)$ 

Replacement

\1\t\2\t\3 

Results

Input

2345          Checked by John 2398          Verified by Stacy 3983          Double Checked on 2/23/17 by Marsha  

Output

2345    Checked by John John 2398    Verified by Stacy   Stacy 3983    Double Checked on 2/23/17 by Marsha     Marsha 

Note: The above output separates each column by the tab \t character, so it may not appear to be correct to the naked eye, but simply using an online regex parser and inserting \t into the regex match section should show you where each column begins/ends.


Explanation

Regex

  • ^ Assert position at the beginning of the line
  • (\w+) Capture one or more word characters (a-zA-Z0-9_) into group 1
  • [ \t]* Match any number of spaces or tab characters ([ \t] can be replaced with \h in some regex flavours such as PCRE)
  • (.*\bby[ \t]+(\w+)[ \t]*.*) Capture the following into group 2
    • .* Match any character (except newline unless the s modifier is used)
    • \bby Match a word boundary \b, followed by by literally
    • [ \t]+ Match one or more spaces or tab characters
    • (\w+) Capture one or more word characters (a-zA-Z0-9_) into group 3
    • [ \t]* Match any number of spaces or tab characters
    • .* Match any character any number of times
  • $ Assert position at the end of the line

Replacement

  • \1 Matches the same text as most recently matched by the 1st capturing group
  • \t Tab character
  • \1 Matches the same text as most recently matched by the 2nd capturing group
  • \t Tab character
  • \1 Matches the same text as most recently matched by the 3rd capturing group
like image 30
ctwheels Avatar answered Oct 09 '22 15:10

ctwheels