Load only particular field in PIG?

Tags:

This is my file:

Col1, Col2, Col3, Col4, Col5

I need only Col2 and Col3.

Currently I'm doing this:

a = load 'input' as (Col1:chararray, 
                     Col2:chararray, 
                     Col3:chararray, 
                     Col4:chararray);
b = foreach a generate Col2, Col3;

Is there a way to do directly load only Col2 and Col3 instead of loading the whole input and then generate required columns?

644

asked Dec 31 '13 14:12

ComputerFellow

1 Answers

Your method of only GENERATEing the columns you want is an effective way to do just what you ask. Remember that all of your data is stored on HDFS, and you're not loading it all into memory when you start your script. You still will have to read those bytes off the disk even if you are not keeping them around for use in your processing, so there is no performance advantage to never loading that data. The advantage comes in never having to send it to a reducer, which you have accomplished with your method.

In cases where Pig can tell that a column won't be used, it will "prune" it immediately, essentially doing for you what you did with your b = foreach a generate Col2, Col3;. This won't happen, however, if you are using a UDF that might access other fields, because Pig doesn't look inside the UDF to see if they get used. For example, suppose Col3 is an int. If you have

b = group a by Col2;
c = foreach b generate group, SUM(a.Col3);

then Pig will automatically prune the 1st and 4th columns for you, since it can see they're never used. However, if you instead did

b = group a by Col2;
c = foreach b generate group, COUNT(a);

then Pig can't prune, because it doesn't see inside the COUNT UDF and doesn't know that the other fields won't be used. When in doubt of whether Pig will do this pruning, you can use the foreach/generate method you already have. And Pig should print a diagnostic message when you start your script listing all the columns it was able to prune out.

If instead your problem is that you don't want to have to provide a full schema when you're interested in just a few columns, you can skip the schema entirely and put it in the GENERATE:

a = load 'input';
b = foreach a generate (chararray) $1 as Col2, (chararray) $2 as Col3;

answered Oct 05 '22 18:10

reo katoa

Related questions
                            
                                How many partitions does Spark create when a file is loaded from S3 bucket?
                            
                                Does Spark use data locality?
                            
                                How to make R tm corpus of 100 million tweets?
                            
                                Distinct on Multiple columns in Hive
                            
                                Java 8 MapReduce for distributed computing
                            
                                Why 'mapred-site.xml' is not included in the latest Hadoop 2.2.0?
                            
                                Using spark-submit, what is the behavior of the --total-executor-cores option?
                            
                                Spark on Windows - What exactly is winutils and why do we need it?
                            
                                Java daemons launched with multiple -Xmx option (hadoop)
                            
                                How to append to an hdfs file on an extremely small cluster (3 nodes or less)
                            
                                How to use MATLAB code in mapper (Hadoop)?
                            
                                How do you use MapReduce/Hadoop? [closed]
                            
                                Looking for a drop-in replacement for a java.util.Map
                            
                                Join vs COGROUP in PIG
                            
                                How to allow spark to ignore missing input files?
                            
                                Any way to compute statistics on a hive table for all partitions with a single analyze command?
                            
                                Spark 2.2.0 FileOutputCommitter
                            
                                Apache Spark-SQL vs Sqoop benchmarking while transferring data from RDBMS to hdfs
                            
                                Where do I start with distributed computing?
                            
                                How to decompress the hadoop reduce output file end with snappy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Load only particular field in PIG?

Tags:

hadoop

apache-pig

mapreduce

ComputerFellow

People also ask

1 Answers

reo katoa

Recent Activity

Donate For Us