Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load only particular field in PIG?

This is my file:

Col1, Col2, Col3, Col4, Col5

I need only Col2 and Col3.

Currently I'm doing this:

a = load 'input' as (Col1:chararray, 
                     Col2:chararray, 
                     Col3:chararray, 
                     Col4:chararray);
b = foreach a generate Col2, Col3;

Is there a way to do directly load only Col2 and Col3 instead of loading the whole input and then generate required columns?

like image 644
ComputerFellow Avatar asked Dec 31 '13 14:12

ComputerFellow


People also ask

How do you load a dataset in Pig?

Now load the data from the file student_data. txt into Pig by executing the following Pig Latin statement in the Grunt shell. grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

What is flatten in Pig?

The FLATTEN operator looks like a UDF syntactically, but it is actually an operator that changes the structure of tuples and bags in a way that a UDF cannot. Flatten un-nests tuples as well as bags. The idea is the same, but the operation and result is different for each type of structure.

What is eval function in Pig?

Eval functions: AVG(col): computes the average of the numerical values in a single column of a bag. CONCAT(string expression1, string expression2) : Concatenates two expressions of identical type. COUNT(DataBag bag): Computes the number of elements in a bag excluding null values.

What is a tuple in Pig?

A bag is a collection of tuples. A tuple is an ordered set of fields. A field is a piece of data.


1 Answers

Your method of only GENERATEing the columns you want is an effective way to do just what you ask. Remember that all of your data is stored on HDFS, and you're not loading it all into memory when you start your script. You still will have to read those bytes off the disk even if you are not keeping them around for use in your processing, so there is no performance advantage to never loading that data. The advantage comes in never having to send it to a reducer, which you have accomplished with your method.

In cases where Pig can tell that a column won't be used, it will "prune" it immediately, essentially doing for you what you did with your b = foreach a generate Col2, Col3;. This won't happen, however, if you are using a UDF that might access other fields, because Pig doesn't look inside the UDF to see if they get used. For example, suppose Col3 is an int. If you have

b = group a by Col2;
c = foreach b generate group, SUM(a.Col3);

then Pig will automatically prune the 1st and 4th columns for you, since it can see they're never used. However, if you instead did

b = group a by Col2;
c = foreach b generate group, COUNT(a);

then Pig can't prune, because it doesn't see inside the COUNT UDF and doesn't know that the other fields won't be used. When in doubt of whether Pig will do this pruning, you can use the foreach/generate method you already have. And Pig should print a diagnostic message when you start your script listing all the columns it was able to prune out.

If instead your problem is that you don't want to have to provide a full schema when you're interested in just a few columns, you can skip the schema entirely and put it in the GENERATE:

a = load 'input';
b = foreach a generate (chararray) $1 as Col2, (chararray) $2 as Col3;
like image 74
reo katoa Avatar answered Oct 05 '22 18:10

reo katoa