Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark split a column to multiple columns without pandas

my question is how to split a column to multiple columns. I don't know why df.toPandas() does not work.

For example, I would like to change 'df_test' to 'df_test2'. I saw many examples using the pandas module. Is there another way? Thank you in advance.

df_test = sqlContext.createDataFrame([
(1, '14-Jul-15'),
(2, '14-Jun-15'),
(3, '11-Oct-15'),
], ('id', 'date'))

df_test2

id     day    month    year
1       14     Jul      15
2       14     Jun      15
1       11     Oct      15
like image 903
nathanlim45 Avatar asked Dec 18 '15 19:12

nathanlim45


2 Answers

Spark >= 2.2

You can skip unix_timestamp and cast and use to_date or to_timestamp:

from pyspark.sql.functions import to_date, to_timestamp

df_test.withColumn("date", to_date("date", "dd-MMM-yy")).show()
## +---+----------+
## | id|      date|
## +---+----------+
## |  1|2015-07-14|
## |  2|2015-06-14|
## |  3|2015-10-11|
## +---+----------+


df_test.withColumn("date", to_timestamp("date", "dd-MMM-yy")).show()
## +---+-------------------+
## | id|               date|
## +---+-------------------+
## |  1|2015-07-14 00:00:00|
## |  2|2015-06-14 00:00:00|
## |  3|2015-10-11 00:00:00|
## +---+-------------------+

and then apply other datetime functions shown below.

Spark < 2.2

It is not possible to derive multiple top level columns in a single access. You can use structs or collection types with an UDF like this:

from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql import Row
from pyspark.sql.functions import udf, col

schema = StructType([
  StructField("day", StringType(), True),
  StructField("month", StringType(), True),
  StructField("year", StringType(), True)
])

def split_date_(s):
    try:
        d, m, y = s.split("-")
        return d, m, y
    except:
        return None

split_date = udf(split_date_, schema)

transformed = df_test.withColumn("date", split_date(col("date")))
transformed.printSchema()

## root
##  |-- id: long (nullable = true)
##  |-- date: struct (nullable = true)
##  |    |-- day: string (nullable = true)
##  |    |-- month: string (nullable = true)
##  |    |-- year: string (nullable = true)

but it is not only quite verbose in PySpark, but also expensive.

For date based transformations you can simply use built-in functions:

from pyspark.sql.functions import unix_timestamp, dayofmonth, year, date_format

transformed = (df_test
    .withColumn("ts",
        unix_timestamp(col("date"), "dd-MMM-yy").cast("timestamp"))
    .withColumn("day", dayofmonth(col("ts")).cast("string"))
    .withColumn("month", date_format(col("ts"), "MMM"))
    .withColumn("year", year(col("ts")).cast("string"))
    .drop("ts"))

Similarly you could use regexp_extract to split date string.

See also Derive multiple columns from a single column in a Spark DataFrame

Note:

If you use version not patched against SPARK-11724 this will require correction after unix_timestamp(...) and before cast("timestamp").

like image 147
zero323 Avatar answered Sep 22 '22 00:09

zero323


The Solution here is to use pyspark.sql.functions.split() function.

df = sqlContext.createDataFrame([
(1, '14-Jul-15'),
(2, '14-Jun-15'),
(3, '11-Oct-15'),
], ('id', 'date'))

split_col = pyspark.sql.functions.split(df['date'], '-')
df = df.withColumn('day', split_col.getItem(0))
df = df.withColumn('month', split_col.getItem(1))
df = df.withColumn('year', split_col.getItem(2))
df = df.drop("date")
like image 22
Drashti Dobariya Avatar answered Sep 20 '22 00:09

Drashti Dobariya