I'm facing a very strange issue with pyspark
on macOS Sierra. My goal is to parse dates in ddMMMyyyy
format (eg: 31Dec1989
) but get errors. I run Spark 2.0.1, Python 2.7.10 and Java 1.8.0_101. I tried also using Anaconda 4.2.0 (it ships with Python 2.7.12), but get errors too.
The same code on Ubuntu Server 15.04 with same Java version and Python 2.7.9 works without any error.
The official documentation about spark.read.load()
states:
dateFormat
– sets the string that indicates a date format. Custom date formats follow the formats atjava.text.SimpleDateFormat
. This applies to date type. If None is set, it uses the default value value,yyyy-MM-dd
.
The official Java documentation talks about MMM
as the right format to parse month names like Jan
, Dec
, etc. but it throws a lot of errors starting with java.lang.IllegalArgumentException
.
The documentation states that LLL
can be used too, but pyspark
doesn't recognize it and throws pyspark.sql.utils.IllegalArgumentException: u'Illegal pattern component: LLL'
.
I know of another solution to dateFormat
, but this is the fastest way to parse data and the simplest to code. What am I missing here?
In order to run the following examples you simply have to place test.csv
and test.py
in the same directory, then run <spark-bin-directory>/spark-submit <working-directory>/test.py
.
ddMMMyyyy
formatI have a plain-text file named test.csv
containing the following two lines:
col1
31Dec1989
and the code is the following:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("My app") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
struct = StructType([StructField("column", DateType())])
df = spark.read.load( "test.csv", \
schema=struct, \
format="csv", \
sep=",", \
header="true", \
dateFormat="ddMMMyyyy", \
mode="FAILFAST")
df.show()
I get errors. I tried also moving month name before or after days and year (eg: 1989Dec31
and yyyyMMMdd
) without success.
ddMMyyyy
formatThis example is identical to the previous one except from the date format. test.csv
now contains:
col1
31121989
The following code prints the content of test.csv
:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("My app") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
struct = StructType([StructField("column", DateType())])
df = spark.read.load( "test.csv", \
schema=struct, \
format="csv", \
sep=",", \
header="true", \
dateFormat="ddMMyyyy", \
mode="FAILFAST")
df.show()
The ouput is the following (I omit the various verbose lines):
+----------+
| column|
+----------+
|1989-12-31|
+----------+
UPDATE1
I made a simple Java class that uses java.text.SimpleDateFormat
:
import java.text.*;
import java.util.Date;
class testSimpleDateFormat
{
public static void main(String[] args)
{
SimpleDateFormat format = new SimpleDateFormat("yyyyMMMdd");
String dateString = "1989Dec31";
try {
Date parsed = format.parse(dateString);
System.out.println(parsed.toString());
}
catch(ParseException pe) {
System.out.println("ERROR: Cannot parse \"" + dateString + "\"");
}
}
}
This code doesn't work on my environment and throws this error:
java.text.ParseException: Unparseable date: "1989Dec31"
but works perfectly on another system (Ubuntu 15.04). This seems a Java issue, but I don't know how to solve it. I installed the latest available version of Java and all of my software has been updated.
Any ideas?
UPDATE2
I've found how to make it work under pure Java by specifying Locale.US
:
import java.text.*;
import java.util.Date;
import java.util.*;
class HelloWorldApp
{
public static void main(String[] args)
{
SimpleDateFormat format = new SimpleDateFormat("yyyyMMMdd", Locale.US);
String dateString = "1989Dec31";
try {
Date parsed = format.parse(dateString);
System.out.println(parsed.toString());
}
catch(ParseException pe) {
System.out.println(pe);
System.out.println("ERROR: Cannot parse \"" + dateString + "\"");
}
}
}
Now, the question becomes: how to specify Java's Locale in pyspark
?
Probably worth noting that this was resolved on the Spark mailing list on 24 Oct 2016. Per the original poster:
This worked without setting other options:
spark/bin/spark-submit --conf "spark.driver.extraJavaOptions=-Duser.language=en" test.py
and was reported as SPARK-18076 (Fix default Locale used in DateFormat, NumberFormat to Locale.US) against Spark 2.0.1 and was resolved in Spark 2.1.0.
Additionally, while the above workaround (passing in --conf "spark.driver.extraJavaOptions=-Duser.language=en"
) for the specific issue the submitter raised is no longer needed if using Spark 2.1.0, a notable side-effect is that for Spark 2.1.0 users, you can no longer pass in something like --conf "spark.driver.extraJavaOptions=-Duser.language=fr"
if you wanted to parse a non-English date, e.g. "31mai1989".
In fact, as of Spark 2.1.0, when using spark.read()
to load a csv, I think it's no longer possible to use the dateFormat
option to parse a date such as "31mai1989", even if your default locale is French. I went as far as changing the default region and language in my OS to French and passed in just about every locale setting permutation I could think of, i.e.
JAVA_OPTS="-Duser.language=fr -Duser.country=FR -Duser.region=FR" \
JAVA_ARGS="-Duser.language=fr -Duser.country=FR -Duser.region=FR" \
LC_ALL=fr_FR.UTF-8 \
spark-submit \
--conf "spark.driver.extraJavaOptions=-Duser.country=FR -Duser.language=fr -Duser.region=FR" \
--conf "spark.executor.extraJavaOptions=-Duser.country=FR -Duser.language=fr -Duser.region=FR" \
test.py
to no avail, resulting in
java.lang.IllegalArgumentException
at java.sql.Date.valueOf(Date.java:143)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
But again, this only affects parsing non-English dates in Spark 2.1.0.
You have already identified the issue as one of locale in the Spark's JVM. You can check the default country and language settings that are used by your Spark's JVM by going to http://localhost:4040/environment/ after launching the spark shell. Search for "user.language" and user.country" under System Properties section. It should be US and en.
You can change them like this, if needed.
Option 1: Edit the spark-defaults.conf file in {SPARK_HOME}/conf folder. Add the following settings:
spark.executor.extraJavaOptions -Duser.country=US -Duser.language=en
spark.driver.extraJavaOptions -Duser.country=US -Duser.language=en
Option 2: Pass the options to pyspark as a command line option
$pyspark --conf spark.driver.extraJavaOptions="-Duser.country=US,-Duser.language=en" spark.executor.extraJavaOptions="-Duser.country=US,-Duser.language=en"
Option 3: Change the language and region in your Mac OS. For example - What settings in Mac OS X affect the `Locale` and `Calendar` inside Java?
P.S. - I have only verified that Option 1 works. I have not tried out the other 2. More details about Spark configuration are here - http://spark.apache.org/docs/latest/configuration.html#runtime-environment
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With