I am using datetime
in some Python udfs that I use in my pig
script. So far so good. I use pig 12.0 on Cloudera 5.5
However, I also need to use the pytz
or dateutil
packages as well and they dont seem to be part of a vanilla python install.
Can I use them in my Pig
udfs in some ways? If so, how? I think dateutil
is installed on my nodes (I am not admin, so how can I actually check that is the case?), but when I type:
import sys
#I append the path to dateutil on my local windows machine. Is that correct?
sys.path.append('C:/Users/me/AppData/Local/Continuum/Anaconda2/lib/site-packages')
from dateutil import tz
in my udfs.py
script, I get:
2016-08-30 09:56:06,572 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1121: Python Error. Traceback (most recent call last):
File "udfs.py", line 23, in <module>
from dateutil import tz
ImportError: No module named dateutil
when I run my pig script.
All my other python udfs (using datetime
for instance) work just fine. Any idea how to fix that?
Many thanks!
UPDATE
after playing a bit with the python path, I am now able to
import dateutil
(at least Pig does not crash). But if I try:
from dateutil import tz
I get an error.
from dateutil import tz
File "/opt/python/lib/python2.7/site-packages/dateutil/tz.py", line 16, in <module>
from six import string_types, PY3
File "/opt/python/lib/python2.7/site-packages/six.py", line 604, in <module>
viewkeys = operator.methodcaller("viewkeys")
AttributeError: type object 'org.python.modules.operator' has no attribute 'methodcaller'
How to overcome that? I use tz in the following manner
to_zone = dateutil.tz.gettz('US/Eastern')
from_zone = dateutil.tz.gettz('UTC')
and then I change the timezone of my timestamps. Can I just import dateutil to do that? what is the proper syntax?
UPDATE 2
Following yakuza's suggestion, I am able to
import sys
sys.path.append('/opt/python/lib/python2.7/site-packages')
sys.path.append('/opt/python/lib/python2.7/site-packages/pytz/zoneinfo')
import pytz
but now I get and error again
Caused by: Traceback (most recent call last): File "udfs.py", line 158, in to_date_local File "__pyclasspath__/pytz/__init__.py", line 180, in timezone pytz.exceptions.UnknownTimeZoneError: 'America/New_York'
when I define
to_zone = pytz.timezone('America/New_York')
from_zone = pytz.timezone('UTC')
Found some hints here UnknownTimezoneError Exception Raised with Python Application Compiled with Py2Exe
What to do? Awww, I just want to convert timezones in Pig :(
Well, as you probably know all Python UDF functions are not executed by Python interpreter, but Jython that is distributed with Pig. By default in 0.12.0 it should be Jython 2.5.3. Unfortunately six
package supports Python starting from Python 2.6 and it's package required by dateutil
. However pytz
seems not to have such dependency, and should support Python versions starting from Python 2.4.
So to achieve your goal you should distribute pytz
package to all your nodes for version 2.5 and in your Pig UDF add it's path to sys.path
. If you complete same steps you did for dateutil
everything should work as you expect. We are using very same approach with pygeoip
and it works like a charm.
When you run Pig script that references some Python UDF (more precisely Jython UDF), you script gets compiled to map/reduce job, all REGISTER
ed files are included in JAR file, and are distributed on nodes where code is actually executed. Now when your code is executed, Jython interpreter is started and executed from Java code. So now when Python code is executed on each node taking part in computation, all Python imports are resolved locally on node. Imports from standard libraries are taken from Jython implementation, but all "packages" have to be install otherwise, as there is no pip
for it. So to make external packages available to Python UDF you have to install required packages manually using other pip
or install from sources, but remember to download package compatible with Python 2.5! Then in every single UDF file, you have to append site-packages
on each node, where you installed packages (it's important to use same directory on each node). For example:
import sys
sys.path.append('/path/to/site-packages')
# Imports of non-stdlib packages
Let's assume some we have following files:
/opt/pytz_test/test_pytz.pig
:
REGISTER '/opt/pytz_test/test_pytz_udf.py' using jython as test;
A = LOAD '/opt/pytz_test/test_pytz_data.csv' AS (timestamp:int);
B = FOREACH A GENERATE
test.to_date_local(timestamp);
STORE B INTO '/tmp/test_pytz_output.csv' using PigStorage(',');
/opt/pytz_test/test_pytz_udf.py
:
from datetime import datetime
import sys
sys.path.append('/usr/lib/python2.6/site-packages/')
import pytz
@outputSchema('date:chararray')
def to_date_local(unix_timestamp):
"""
converts unix timestamp to a rounded date
"""
to_zone = pytz.timezone('America/New_York')
from_zone = pytz.timezone('UTC')
try :
as_datetime = datetime.utcfromtimestamp(unix_timestamp)
.replace(tzinfo=from_zone).astimezone(to_zone)
.date().strftime('%Y-%m-%d')
except:
as_datetime = unix_timestamp
return as_datetime
/opt/pytz_test/test_pytz_data.csv
:
1294778181
1294778182
1294778183
1294778184
Now let's install pytz
on our node (it has to be installed using Python version on which pytz
is compatible with Python 2.5 (2.5-2.7), in my case I'll use Python 2.6):
sudo pip2.6 install pytz
Please make sure, that file /opt/pytz_test/test_pytz_udf.py
adds to sys.path
reference to site-packages
where pytz
is installed.
Now once we run Pig with our test script:
pig -x local /opt/pytz_test/test_pytz.pig
We should be able to read output from our job, which should list:
2011-01-11
2011-01-11
2011-01-11
2011-01-11
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With