Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pig: is it possible to use pytz or dateutils for Python udfs?

I am using datetime in some Python udfs that I use in my pig script. So far so good. I use pig 12.0 on Cloudera 5.5

However, I also need to use the pytz or dateutil packages as well and they dont seem to be part of a vanilla python install.

Can I use them in my Pig udfs in some ways? If so, how? I think dateutil is installed on my nodes (I am not admin, so how can I actually check that is the case?), but when I type:

import sys
#I append the path to dateutil on my local windows machine. Is that correct?
sys.path.append('C:/Users/me/AppData/Local/Continuum/Anaconda2/lib/site-packages')

from dateutil import tz

in my udfs.py script, I get:

2016-08-30 09:56:06,572 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1121: Python Error. Traceback (most recent call last):
  File "udfs.py", line 23, in <module>
    from dateutil import tz
ImportError: No module named dateutil

when I run my pig script.

All my other python udfs (using datetime for instance) work just fine. Any idea how to fix that?

Many thanks!

UPDATE

after playing a bit with the python path, I am now able to

import dateutil 

(at least Pig does not crash). But if I try:

from dateutil import tz

I get an error.

  from dateutil import tz 
  File "/opt/python/lib/python2.7/site-packages/dateutil/tz.py", line 16, in <module>
    from six import string_types, PY3
  File "/opt/python/lib/python2.7/site-packages/six.py", line 604, in <module>
    viewkeys = operator.methodcaller("viewkeys")
AttributeError: type object 'org.python.modules.operator' has no attribute 'methodcaller'

How to overcome that? I use tz in the following manner

to_zone = dateutil.tz.gettz('US/Eastern')
from_zone = dateutil.tz.gettz('UTC')

and then I change the timezone of my timestamps. Can I just import dateutil to do that? what is the proper syntax?

UPDATE 2

Following yakuza's suggestion, I am able to

import sys
sys.path.append('/opt/python/lib/python2.7/site-packages')
sys.path.append('/opt/python/lib/python2.7/site-packages/pytz/zoneinfo')

import pytz

but now I get and error again

Caused by: Traceback (most recent call last): File "udfs.py", line 158, in to_date_local File "__pyclasspath__/pytz/__init__.py", line 180, in timezone pytz.exceptions.UnknownTimeZoneError: 'America/New_York'

when I define

to_zone = pytz.timezone('America/New_York')
from_zone = pytz.timezone('UTC')

Found some hints here UnknownTimezoneError Exception Raised with Python Application Compiled with Py2Exe

What to do? Awww, I just want to convert timezones in Pig :(

like image 626
ℕʘʘḆḽḘ Avatar asked Aug 26 '16 22:08

ℕʘʘḆḽḘ


1 Answers

Well, as you probably know all Python UDF functions are not executed by Python interpreter, but Jython that is distributed with Pig. By default in 0.12.0 it should be Jython 2.5.3. Unfortunately six package supports Python starting from Python 2.6 and it's package required by dateutil. However pytz seems not to have such dependency, and should support Python versions starting from Python 2.4.

So to achieve your goal you should distribute pytz package to all your nodes for version 2.5 and in your Pig UDF add it's path to sys.path. If you complete same steps you did for dateutil everything should work as you expect. We are using very same approach with pygeoip and it works like a charm.

How does it work

When you run Pig script that references some Python UDF (more precisely Jython UDF), you script gets compiled to map/reduce job, all REGISTERed files are included in JAR file, and are distributed on nodes where code is actually executed. Now when your code is executed, Jython interpreter is started and executed from Java code. So now when Python code is executed on each node taking part in computation, all Python imports are resolved locally on node. Imports from standard libraries are taken from Jython implementation, but all "packages" have to be install otherwise, as there is no pip for it. So to make external packages available to Python UDF you have to install required packages manually using other pip or install from sources, but remember to download package compatible with Python 2.5! Then in every single UDF file, you have to append site-packages on each node, where you installed packages (it's important to use same directory on each node). For example:

import sys
sys.path.append('/path/to/site-packages')
# Imports of non-stdlib packages

Proof of concept

Let's assume some we have following files:

/opt/pytz_test/test_pytz.pig:

REGISTER '/opt/pytz_test/test_pytz_udf.py' using jython as test;

A = LOAD '/opt/pytz_test/test_pytz_data.csv' AS (timestamp:int);
B = FOREACH A GENERATE
    test.to_date_local(timestamp);

STORE B INTO '/tmp/test_pytz_output.csv' using PigStorage(',');

/opt/pytz_test/test_pytz_udf.py:

from datetime import datetime
import sys

sys.path.append('/usr/lib/python2.6/site-packages/')

import pytz

@outputSchema('date:chararray')
def to_date_local(unix_timestamp):
    """
    converts unix timestamp to a rounded date
    """
    to_zone = pytz.timezone('America/New_York')
    from_zone = pytz.timezone('UTC')

    try :
        as_datetime = datetime.utcfromtimestamp(unix_timestamp)
            .replace(tzinfo=from_zone).astimezone(to_zone)
            .date().strftime('%Y-%m-%d')
    except:
        as_datetime = unix_timestamp
    return as_datetime

/opt/pytz_test/test_pytz_data.csv:

1294778181
1294778182
1294778183
1294778184

Now let's install pytz on our node (it has to be installed using Python version on which pytz is compatible with Python 2.5 (2.5-2.7), in my case I'll use Python 2.6):

sudo pip2.6 install pytz

Please make sure, that file /opt/pytz_test/test_pytz_udf.py adds to sys.path reference to site-packages where pytz is installed.

Now once we run Pig with our test script:

pig -x local /opt/pytz_test/test_pytz.pig

We should be able to read output from our job, which should list:

2011-01-11
2011-01-11
2011-01-11
2011-01-11
like image 109
Yakuza Avatar answered Oct 13 '22 08:10

Yakuza