Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use the map datatype in Apache Pig?

Tags:

I'd like to use Apache Pig to build a large key -> value mapping, look things up in the map, and iterate over the keys. However, there does not even seem to be syntax for doing these things; I've checked the manual, wiki, sample code, Elephant book, Google, and even tried parsing the parser source. Every single example loads map literals from a file... and then never uses them. How can you use Pig's maps?

First, there doesn't seem to be a way to load a 2-column CSV file into a map directly. If I have a simple map.csv:

1,2
3,4
5,6

And I try to load it as a map:

m = load 'map.csv' using PigStorage(',') as (M: []);
dump m;

I get three empty tuples:

()
()
()

So I try to load tuples and then generate the map:

m = load 'map.csv' using PigStorage(',') as (key:chararray, val:chararray);
b = foreach m generate [key#val];
ERROR 1000: Error during parsing. Encountered " "[" "[ "" at line 1, column 24.
...

Many variations on the syntax also fail (e.g., generate [$0#$1]).

OK, so I munge my map into Pig's map literal format as map.pig:

[1#2]
[3#4]
[5#6]

And load it up:

m = load 'map.pig' as (M: []);

Now let's load up some keys and try lookups:

k = load 'keys.csv' as (key);
dump k;
3
5
1

c = foreach k generate m#key;  /* Or m[key], or... what? */
ERROR 1000: Error during parsing.  Invalid alias: m in {M: map[ ]}

Hrm, OK, maybe since there are two relations involved, we need a join:

c = join k by key, m by /* ...um, what? */ $0;
dump c;
ERROR 1068: Using Map as key not supported.
c = join k by key, m by m#key;
dump c;
Error 1000: Error during parsing. Invalid alias: m in {M: map[ ]}

Fail. How do I refer to the key (or value) of a map? The map schema syntax doesn't seem to let you even name the key and value (the mailing list says there's no way to assign types).

Finally, I'd just like to be able to find all they keys in my map:

d = foreach m generate ...oh, forget it.

Is Pig's map type half-baked? What am I missing?

like image 882
1frustratedpiggy Avatar asked Nov 01 '10 14:11

1frustratedpiggy


People also ask

How do you load map data in Pig?

Currently pig maps need the key to a chararray (string) that you supply and not a variable which contains a string. so in map#key the key has to be constant string that you supply (eg: map#'keyvalue').

What is map in Pig?

Map comes under the data mode in pig. It is the collection or set of key value pair. Here the texts before the # are keys and texts after the # are values. In pig # is the delimiter that represents the difference in key and value pair. So, every row is a Map which has a set of key value pairs.

What are the data types in Apache Pig?

Pig has three complex data types: maps, tuples, and bags. All of these types can contain data of any type, including other complex types. So it is possible to have a map where the value field is a bag, which contains a tuple where one of the fields is a map.

Does Pig use MapReduce?

Pig is compatible with not only MapReduce but also Tez and Spark processing engines which provides a significant performance improvement. For the uninitiated, Tez can be considered as a performance efficient version of the MapReduce framework.


1 Answers

Currently pig maps need the key to a chararray (string) that you supply and not a variable which contains a string. so in map#key the key has to be constant string that you supply (eg: map#'keyvalue').

The typical use case of this is to load a complex data structure one of the element being a key value pair and later in a foreach statement you can refer to a particular value based on the key you are interested in.

http://pig.apache.org/docs/r0.9.1/basic.html#map-schema

like image 88
jayadev Avatar answered Oct 31 '22 16:10

jayadev