Assume I have the following input in Pig:
some
And I would like to convert that into:
s
so
som
some
I've not (yet) found a way to iterate over a chararray in pig latin. I have found the TOKENIZE function but that splits on word boundries. So can "pig latin" do this or is this something that requires a Java class to do that?
Niels, TOKENIZE takes a delimiter argument, so you can make it split each letter; however I can't think of a way to make it produce overlapping tokens.
It's pretty straightforward to write a UDF in Pig, though. You just implement a simple interface called EvalFunc (details here: http://wiki.apache.org/pig/UDFManual ). Pig was built around the idea of users writing their own functions to process most anything, and writing your own UDF is therefore a common and natural thing to do.
An even easier option, although not as efficient, is to use Pig streaming to pass your data through a script (I find whipping up a quick Perl or Python script to be faster than implementing Java classes for one-off jobs). There is an example of this here: http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/ -- it demonstrates the use of a pre-existing library, a Perl script, a UDF, and even an on-the-fly awk script.
Here is how you might do it with pig streaming and python without writing custom UDFs:
Suppose your data is just 1 column of words. The python script (lets call it wordSeq.py) to process things would be:
#!/usr/bin/python
### wordSeq.py ### [don't forget to chmod u+x wordSeq.py !]
import sys
for word in sys.stdin:
word = word.rstrip()
sys.stdout.write('\n'.join([word[:i+1] for i in xrange(len(word))]) + '\n')
Then, in your pig script, you tell pig you are using streaming with the above script and that you want to ship your script as necessary:
-- wordSplitter.pig ---
DEFINE CMD `wordSeq.py` ship('wordSeq.py');
W0 = LOAD 'words';
W = STREAM W0 THROUGH CMD as (word: chararray);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With