Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Pig and Python

Apologies if this question is poorly worded: I am embarking on a large scale machine learning project and I don't like programming in Java. I love writing programs in Python. I have heard good things about Pig. I was wondering if someone could clarify to me how usable Pig is in combination with Python for mathematically related work. Also, if I am to write "streaming python code", does Jython come into the picture? Is it more efficient if it does come into the picture?

Thanks

P.S: I for several reasons would not prefer to use Mahout's code as is. I might want to use a few of their data structures: It would be useful to know if that would be possible to do.

like image 922
dvk Avatar asked Jul 08 '11 09:07

dvk


People also ask

Is Apache Pig still used?

We mainly use Apache Pig for its capabilities that allows us to easily create data pipelines. Also it comes with its native language Pig latin which helps to manage to code execution easily. It brings the important features of most of the database systems like Hive, DBMS, Spark-SQL.

Is Pig a data visualization tool?

Pig is a high-level platform or tool which is used to process the large datasets. It provides a high-level of abstraction for processing over the MapReduce. It provides a high-level scripting language, known as Pig Latin which is used to develop the data analysis codes.

What is Piglet in Python?

Piglet is a templating engine that compiles templates to fast python byte code.

Is Pig a data flow language?

Pig–Pig is a data-flow language for expressing Map/Reduce programs for analyzing large HDFS distributed datasets. Pig provides relational (SQL) operators such as JOIN, Group By, etc. Pig is also having easy to plug in Java functions.


2 Answers

Another option to use Python with Hadoop is PyCascading. Instead of writing only the UDFs in Python/Jython, or using streaming, you can put the whole job together in Python, using Python functions as "UDFs" in the same script as where the data processing pipeline is defined. Jython is used as the Python interpreter, and the MapReduce framework for the stream operations is Cascading. The joins, groupings, etc. work similarly to Pig in spirit, so there is no surprise there if you already know Pig.

A word counting example looks like this:

@map(produces=['word'])
def split_words(tuple):
    # This is called for each line of text
    for word in tuple.get(1).split():
        yield [word]

def main():
    flow = Flow()
    input = flow.source(Hfs(TextLine(), 'input.txt'))
    output = flow.tsv_sink('output')

    # This is the processing pipeline
    input | split_words | GroupBy('word') | Count() | output

    flow.run()
like image 190
Gabor Szabo Avatar answered Oct 28 '22 19:10

Gabor Szabo


When you use streaming in pig, it doesn't matter what language you use... all it is doing is executing a command in a shell (like via bash). You can use Python, just like you can use grep or a C program.

You can now define Pig UDFs in Python natively. These UDFs will be called via Jython when they are being executed.

like image 37
Donald Miner Avatar answered Oct 28 '22 19:10

Donald Miner