Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

I'm facing a very strange issue with pyspark on macOS Sierra. My goal is to parse dates in ddMMMyyyy format (eg: 31Dec1989) but get errors. I run Spark 2.0.1, Python 2.7.10 and Java 1.8.0_101. I tried also using Anaconda 4.2.0 (it ships with Python 2.7.12), but get errors too.

The same code on Ubuntu Server 15.04 with same Java version and Python 2.7.9 works without any error.

The official documentation about spark.read.load() states:

dateFormat – sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type. If None is set, it uses the default value value, yyyy-MM-dd.

The official Java documentation talks about MMM as the right format to parse month names like Jan, Dec, etc. but it throws a lot of errors starting with java.lang.IllegalArgumentException. The documentation states that LLL can be used too, but pyspark doesn't recognize it and throws pyspark.sql.utils.IllegalArgumentException: u'Illegal pattern component: LLL'.

I know of another solution to dateFormat, but this is the fastest way to parse data and the simplest to code. What am I missing here?

In order to run the following examples you simply have to place test.csv and test.py in the same directory, then run <spark-bin-directory>/spark-submit <working-directory>/test.py.

My test case using ddMMMyyyy format

I have a plain-text file named test.csv containing the following two lines:

col1
31Dec1989

and the code is the following:

from pyspark.sql import SparkSession
from pyspark.sql.types import *

spark = SparkSession \
    .builder \
    .appName("My app") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

struct = StructType([StructField("column", DateType())])
df = spark.read.load(   "test.csv", \
                            schema=struct, \
                            format="csv", \
                            sep=",", \
                            header="true", \
                            dateFormat="ddMMMyyyy", \
                            mode="FAILFAST")
df.show()

I get errors. I tried also moving month name before or after days and year (eg: 1989Dec31 and yyyyMMMdd) without success.

A working example using ddMMyyyy format

This example is identical to the previous one except from the date format. test.csv now contains:

col1
31121989

The following code prints the content of test.csv:

from pyspark.sql import SparkSession
from pyspark.sql.types import *

spark = SparkSession \
    .builder \
    .appName("My app") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

struct = StructType([StructField("column", DateType())])
df = spark.read.load(   "test.csv", \
                            schema=struct, \
                            format="csv", \
                            sep=",", \
                            header="true", \
                            dateFormat="ddMMyyyy", \
                            mode="FAILFAST")
df.show()

The ouput is the following (I omit the various verbose lines):

+----------+
|    column|
+----------+
|1989-12-31|
+----------+

UPDATE1

I made a simple Java class that uses java.text.SimpleDateFormat:

import java.text.*;
import java.util.Date;

class testSimpleDateFormat 
{
    public static void main(String[] args) 
    {
        SimpleDateFormat format = new SimpleDateFormat("yyyyMMMdd");
        String dateString = "1989Dec31";

        try {
            Date parsed = format.parse(dateString);
            System.out.println(parsed.toString());
        }
        catch(ParseException pe) {
            System.out.println("ERROR: Cannot parse \"" + dateString + "\"");
        }       
    }
}

This code doesn't work on my environment and throws this error:

java.text.ParseException: Unparseable date: "1989Dec31"

but works perfectly on another system (Ubuntu 15.04). This seems a Java issue, but I don't know how to solve it. I installed the latest available version of Java and all of my software has been updated.

Any ideas?


UPDATE2

I've found how to make it work under pure Java by specifying Locale.US:

import java.text.*;
import java.util.Date;
import java.util.*;

class HelloWorldApp 
{
    public static void main(String[] args) 
    {
        SimpleDateFormat format = new SimpleDateFormat("yyyyMMMdd", Locale.US);
        String dateString = "1989Dec31";

        try {
            Date parsed = format.parse(dateString);
            System.out.println(parsed.toString());
        }
        catch(ParseException pe) {
            System.out.println(pe);
            System.out.println("ERROR: Cannot parse \"" + dateString + "\"");
        }       
    }
}

Now, the question becomes: how to specify Java's Locale in pyspark?

like image 764
pietrop Avatar asked Oct 12 '16 20:10

pietrop


2 Answers

Probably worth noting that this was resolved on the Spark mailing list on 24 Oct 2016. Per the original poster:

This worked without setting other options: spark/bin/spark-submit --conf "spark.driver.extraJavaOptions=-Duser.language=en" test.py

and was reported as SPARK-18076 (Fix default Locale used in DateFormat, NumberFormat to Locale.US) against Spark 2.0.1 and was resolved in Spark 2.1.0.

Additionally, while the above workaround (passing in --conf "spark.driver.extraJavaOptions=-Duser.language=en") for the specific issue the submitter raised is no longer needed if using Spark 2.1.0, a notable side-effect is that for Spark 2.1.0 users, you can no longer pass in something like --conf "spark.driver.extraJavaOptions=-Duser.language=fr" if you wanted to parse a non-English date, e.g. "31mai1989".

In fact, as of Spark 2.1.0, when using spark.read() to load a csv, I think it's no longer possible to use the dateFormat option to parse a date such as "31mai1989", even if your default locale is French. I went as far as changing the default region and language in my OS to French and passed in just about every locale setting permutation I could think of, i.e.

JAVA_OPTS="-Duser.language=fr -Duser.country=FR -Duser.region=FR" \
JAVA_ARGS="-Duser.language=fr -Duser.country=FR -Duser.region=FR" \
LC_ALL=fr_FR.UTF-8 \
spark-submit \
--conf "spark.driver.extraJavaOptions=-Duser.country=FR -Duser.language=fr -Duser.region=FR" \
--conf "spark.executor.extraJavaOptions=-Duser.country=FR -Duser.language=fr -Duser.region=FR" \
test.py

to no avail, resulting in

java.lang.IllegalArgumentException
    at java.sql.Date.valueOf(Date.java:143)
    at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)

But again, this only affects parsing non-English dates in Spark 2.1.0.

like image 102
eddies Avatar answered Oct 27 '22 00:10

eddies


You have already identified the issue as one of locale in the Spark's JVM. You can check the default country and language settings that are used by your Spark's JVM by going to http://localhost:4040/environment/ after launching the spark shell. Search for "user.language" and user.country" under System Properties section. It should be US and en.

You can change them like this, if needed.

Option 1: Edit the spark-defaults.conf file in {SPARK_HOME}/conf folder. Add the following settings:

spark.executor.extraJavaOptions  -Duser.country=US -Duser.language=en
spark.driver.extraJavaOptions -Duser.country=US -Duser.language=en

Option 2: Pass the options to pyspark as a command line option

  $pyspark  --conf spark.driver.extraJavaOptions="-Duser.country=US,-Duser.language=en" spark.executor.extraJavaOptions="-Duser.country=US,-Duser.language=en"

Option 3: Change the language and region in your Mac OS. For example - What settings in Mac OS X affect the `Locale` and `Calendar` inside Java?

P.S. - I have only verified that Option 1 works. I have not tried out the other 2. More details about Spark configuration are here - http://spark.apache.org/docs/latest/configuration.html#runtime-environment

like image 38
Shankar P S Avatar answered Oct 26 '22 23:10

Shankar P S