Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark Split Columns

Tags:

pyspark

from pyspark.sql import Row, functions as F
row = Row("UK_1","UK_2","Date","Cat",'Combined')
agg = ''
agg = 'Cat'
tdf = (sc.parallelize
    ([
        row(1,1,'12/10/2016',"A",'Water^World'),
        row(1,2,None,'A','Sea^Born'),
        row(2,1,'14/10/2016','B','Germ^Any'),
        row(3,3,'!~2016/2/276','B','Fin^Land'),
        row(None,1,'26/09/2016','A','South^Korea'),
        row(1,1,'12/10/2016',"A",'North^America'),
        row(1,2,None,'A','South^America'),
        row(2,1,'14/10/2016','B','New^Zealand'),
        row(None,None,'!~2016/2/276','B','South^Africa'),
        row(None,1,'26/09/2016','A','Saudi^Arabia')
        ]).toDF())
cols = F.split(tdf['Combined'], '^')
tdf = tdf.withColumn('column1', cols.getItem(0))
tdf = tdf.withColumn('column2', cols.getItem(1))
tdf.show(truncate = False  )

Above is my sample code.

For some reason it is not splitting the column by ^ character.

Any advice?

like image 213
Tronald Dump Avatar asked Oct 19 '17 17:10

Tronald Dump


People also ask

How do you split a row in Pyspark?

To split multiple array column data into rows pyspark provides a function called explode(). Using explode, we will get a new row for each element in the array.

How do you slice a string in Pyspark?

Using SQL function substring() Using the substring() function of pyspark. sql. functions module we can extract a substring or slice of a string from the DataFrame column by providing the position and length of the string you wanted to slice. Note: Please note that the position is not zero based, but 1 based index.


1 Answers

The pattern is a regular expression, see split; and ^ is an anchor that matches the beginning of string in regex, to match literally, you need to escape it:

cols = F.split(tdf['Combined'], r'\^')
tdf = tdf.withColumn('column1', cols.getItem(0))
tdf = tdf.withColumn('column2', cols.getItem(1))
tdf.show(truncate = False)

+----+----+------------+---+-------------+-------+-------+
|UK_1|UK_2|Date        |Cat|Combined     |column1|column2|
+----+----+------------+---+-------------+-------+-------+
|1   |1   |12/10/2016  |A  |Water^World  |Water  |World  |
|1   |2   |null        |A  |Sea^Born     |Sea    |Born   |
|2   |1   |14/10/2016  |B  |Germ^Any     |Germ   |Any    |
|3   |3   |!~2016/2/276|B  |Fin^Land     |Fin    |Land   |
|null|1   |26/09/2016  |A  |South^Korea  |South  |Korea  |
|1   |1   |12/10/2016  |A  |North^America|North  |America|
|1   |2   |null        |A  |South^America|South  |America|
|2   |1   |14/10/2016  |B  |New^Zealand  |New    |Zealand|
|null|null|!~2016/2/276|B  |South^Africa |South  |Africa |
|null|1   |26/09/2016  |A  |Saudi^Arabia |Saudi  |Arabia |
+----+----+------------+---+-------------+-------+-------+
like image 143
Psidom Avatar answered Oct 21 '22 14:10

Psidom