Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting input into substrings in PIG (Hadoop)

Assume I have the following input in Pig:

some

And I would like to convert that into:

s
so
som
some

I've not (yet) found a way to iterate over a chararray in pig latin. I have found the TOKENIZE function but that splits on word boundries. So can "pig latin" do this or is this something that requires a Java class to do that?

like image 906
Niels Basjes Avatar asked Sep 09 '09 14:09

Niels Basjes


2 Answers

Niels, TOKENIZE takes a delimiter argument, so you can make it split each letter; however I can't think of a way to make it produce overlapping tokens.

It's pretty straightforward to write a UDF in Pig, though. You just implement a simple interface called EvalFunc (details here: http://wiki.apache.org/pig/UDFManual ). Pig was built around the idea of users writing their own functions to process most anything, and writing your own UDF is therefore a common and natural thing to do.

An even easier option, although not as efficient, is to use Pig streaming to pass your data through a script (I find whipping up a quick Perl or Python script to be faster than implementing Java classes for one-off jobs). There is an example of this here: http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/ -- it demonstrates the use of a pre-existing library, a Perl script, a UDF, and even an on-the-fly awk script.

like image 129
SquareCog Avatar answered Sep 21 '22 05:09

SquareCog


Here is how you might do it with pig streaming and python without writing custom UDFs:

Suppose your data is just 1 column of words. The python script (lets call it wordSeq.py) to process things would be:

#!/usr/bin/python
### wordSeq.py ### [don't forget to chmod u+x wordSeq.py !]
import sys
for word in sys.stdin:
  word = word.rstrip()
  sys.stdout.write('\n'.join([word[:i+1] for i in xrange(len(word))]) + '\n')

Then, in your pig script, you tell pig you are using streaming with the above script and that you want to ship your script as necessary:

-- wordSplitter.pig ---
DEFINE CMD `wordSeq.py` ship('wordSeq.py');
W0 = LOAD 'words';
W = STREAM W0 THROUGH CMD as (word: chararray);
like image 42
eytan Avatar answered Sep 22 '22 05:09

eytan