What is the output schema to return a dictionary from Python UDF while using Apache PIG.
I have a dictionary of dictionaries, something like this:
dict = {x:{a:1,b:2,c:3}, y:{d:1,e:3,f:9}}
and my output schema looks like
@outputSchema("m:map[im:map[X:float,Y:float]]")
** square brackets because in Pig we use [] for map which this dictionary is converted to.
Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in six languages: Java, Jython, Python, JavaScript, Ruby and Groovy. The most extensive support is provided for Java functions.
Use the ILLUSTRATE operator to review how data is transformed through a sequence of Pig Latin statements. ILLUSTRATE allows you to test your programs on small datasets and get faster turnaround times. ILLUSTRATE is based on an example generator (see Generating Example Data for Dataflow Programs).
Why Do We Need Apache Pig? Programmers who are not so good at Java normally used to struggle working with Hadoop, especially while performing any MapReduce tasks. Apache Pig is a boon for all such programmers. Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex codes in Java.
If you are using the standard jython UDFs and not any other distribution such as the streaming_python provided by mortar data, all you need to do is:
@outputSchema('m:map[]')
The keys will be the same that you have set in python. If you have another dictionaries within your dict you should not worry about it, pig will understand it and use the following syntax:
([first#{third=inner_dict},first#outter_dict])
There is one big disadvantage about passing dict back to pig from a jython UDF, you are only able to set one datatype for all the values in the dict, meaning that if you don't set any datatype pig will use bytearray as the data type and this could be a problem when working with dates or complex structures. For example:
@outputSchema('m:map[chararray]')
Tuples and Bags:
When you want to return a tuple or a bag back to pig from a jython UDFs it is useful to remember that python's lists convert to bags and tuples to tuples. For example:
Lists:
@outputSchema('m:bag{chararray}')
Remember that Pig bags are filled with tuples, so if you want to set a nice structure for your bag, you could declare a tuple within the bag, and there you will be able to set all the datatypes you will be passing. Example:
@outputSchema('map_reduce:bag{t:(key:chararray,value:int,start_date:datetime,end_date:datetime)}')
Finally, tuples should be somehow intuitive, they are the easiest structure to use when using jython. Within a tuple you can set as many fields that you want and as many levels as you want as long as you follow the examples above. You could declare a tuple within a tuple, a tuple that has a bag and other values, etc.
I strongly recommend using Java UDFs when trying to perform complex operations or working with complex data types such as JSON structures, arrays and lists. The learning curve can be a little more steep, but once you have passed that, your development will be much faster and also the throughput of your program.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With