What is the output schema to return a dictionary from Python UDF while using Apache PIG. I have a dictionary of dictionaries, something like this: <pre class="prettyprint"><code>dict = {x:{a:1,b:2,c:3}, y:{d:1,e:3,f:9}} </code></pre> and my output schema looks like <pre class="prettyprint"><code>@outputSchema("m:map[im:map[X:float,Y:float]]") </code></pre> ** square brackets because in Pig we use [] for map which this dictionary is converted to.

If you are using the standard jython UDFs and not any other distribution such as the streaming_python provided by mortar data, all you need to do is: <pre class="prettyprint"><code>@outputSchema('m:map[]') </code></pre> The keys will be the same that you have set in python. If you have another dictionaries within your dict you should not worry about it, pig will understand it and use the following syntax: <pre class="prettyprint"><code>([first#{third=inner_dict},first#outter_dict]) </code></pre> There is one big disadvantage about passing dict back to pig from a jython UDF, you are only able to set one datatype for all the values in the dict, meaning that if you don't set any datatype pig will use bytearray as the data type and this could be a problem when working with dates or complex structures. For example: <pre class="prettyprint"><code>@outputSchema('m:map[chararray]') </code></pre> Tuples and Bags: When you want to return a tuple or a bag back to pig from a jython UDFs it is useful to remember that python's lists convert to bags and tuples to tuples. For example: Lists: <pre class="prettyprint"><code>@outputSchema('m:bag{chararray}') </code></pre> Remember that Pig bags are filled with tuples, so if you want to set a nice structure for your bag, you could declare a tuple within the bag, and there you will be able to set all the datatypes you will be passing. Example: <pre class="prettyprint"><code>@outputSchema('map_reduce:bag{t:(key:chararray,value:int,start_date:datetime,end_date:datetime)}') </code></pre> Finally, tuples should be somehow intuitive, they are the easiest structure to use when using jython. Within a tuple you can set as many fields that you want and as many levels as you want as long as you follow the examples above. You could declare a tuple within a tuple, a tuple that has a bag and other values, etc. I strongly recommend using Java UDFs when trying to perform complex operations or working with complex data types such as JSON structures, arrays and lists. The learning curve can be a little more steep, but once you have passed that, your development will be much faster and also the throughput of your program.

How to : Python UDF dictionary return schema in PIG

Tags:

python

dictionary

schema

apache-pig

user-defined-functions

What is the output schema to return a dictionary from Python UDF while using Apache PIG.

I have a dictionary of dictionaries, something like this:

dict = {x:{a:1,b:2,c:3}, y:{d:1,e:3,f:9}}

and my output schema looks like

@outputSchema("m:map[im:map[X:float,Y:float]]")

** square brackets because in Pig we use [] for map which this dictionary is converted to.

881

asked Nov 12 '12 19:11

user1620334

1 Answers

If you are using the standard jython UDFs and not any other distribution such as the streaming_python provided by mortar data, all you need to do is:

@outputSchema('m:map[]')

The keys will be the same that you have set in python. If you have another dictionaries within your dict you should not worry about it, pig will understand it and use the following syntax:

([first#{third=inner_dict},first#outter_dict])

There is one big disadvantage about passing dict back to pig from a jython UDF, you are only able to set one datatype for all the values in the dict, meaning that if you don't set any datatype pig will use bytearray as the data type and this could be a problem when working with dates or complex structures. For example:

@outputSchema('m:map[chararray]')

Tuples and Bags:

When you want to return a tuple or a bag back to pig from a jython UDFs it is useful to remember that python's lists convert to bags and tuples to tuples. For example:

Lists:

@outputSchema('m:bag{chararray}')

Remember that Pig bags are filled with tuples, so if you want to set a nice structure for your bag, you could declare a tuple within the bag, and there you will be able to set all the datatypes you will be passing. Example:

@outputSchema('map_reduce:bag{t:(key:chararray,value:int,start_date:datetime,end_date:datetime)}')

Finally, tuples should be somehow intuitive, they are the easiest structure to use when using jython. Within a tuple you can set as many fields that you want and as many levels as you want as long as you follow the examples above. You could declare a tuple within a tuple, a tuple that has a bag and other values, etc.

I strongly recommend using Java UDFs when trying to perform complex operations or working with complex data types such as JSON structures, arrays and lists. The learning curve can be a little more steep, but once you have passed that, your development will be much faster and also the throughput of your program.

148

answered Sep 19 '22 23:09

Sabaspro

Related questions
                            
                                time.sleep requires integers?
                            
                                How to add custom metadata to OpenCV/numpy image?
                            
                                How to setup for jQuery-File-Upload? How to implement the upload handler?
                            
                                How to avoid the \xc2 character or &nbsp; in my code snippets?
                            
                                Negative boolean options --no-whatever in optparse?
                            
                                Executing mathematical user code on a python web server, what is the simplest secure way?
                            
                                Efficient multiprocessing of massive, brute force maximization in Python 3
                            
                                compilation error in visual studio linked with python26
                            
                                Optimization: Dumping JSON from a Streaming API to Mongo
                            
                                How to have password echoed as asterisks
                            
                                How to string format SQL IN clause with Python
                            
                                Shift + Return to insert linebreak in python
                            
                                Method overloading decorator
                            
                                Reverse tree building (with an odd number of children)
                            
                                What could make a connection.send() block ? (from conn1, conn2 = multiprocessing.Pipe() )
                            
                                Weird namespace pollution when importing submodule in a package's __init__.py
                            
                                Dealing with exception handling and re-queueing in RQ on Heroku
                            
                                Why doesn't matplotlib use the .ttf font that I provide?
                            
                                Subclassing file in python 3
                            
                                Many-to-many in sqlalchemy. Preventing SQLAlchemy from inserting into a table if a tag already exists

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With