Renaming fields after a JOIN takes time?

Question

In the following code, how much does renaming fields after a join hurt the computation time of the script? Is it optimized in Pig? Or does it really go through every record?

-- tables A: (f1, f2, id)  and B: (g1, g2, id) to be joined by id
C = JOIN A BY id, B by id;
C = FOREACH C GENERATE A::f1 AS f1, A::f2 AS f2, B::id AS id, B::g1 AS g1, B::g2 AS g2;

Does the FOREACH command go through every record of C? If yes, is there a way to optimize?

Thanks.

cyang · Accepted Answer

Don't worry about optimizing this, there may be a slight overhead in renaming the fields, but it won't trigger an addition Map/Reduce job. The field projection will occur in the reducer after your JOIN.

Consider the two pieces of code and the Map Reduce plans given by explain below.

Without Renaming

A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);

C = join A by id, B by id;

store C into 'output';

#--------------------------------------------------
# Map Reduce Plan                                  
#--------------------------------------------------
MapReduce node scope-30
Map Plan
Union[tuple] - scope-31
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
|   |   |
|   |   Project[bytearray][2] - scope-21
|   |
|   |---A: New For Each(false,false,false)[bag] - scope-7
|       |   |
|       |   Project[bytearray][0] - scope-1
|       |   |
|       |   Project[bytearray][1] - scope-3
|       |   |
|       |   Project[bytearray][2] - scope-5
|       |
|       |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
    |   |
    |   Project[bytearray][2] - scope-23
    |
    |---B: New For Each(false,false,false)[bag] - scope-15
        |   |
        |   Project[bytearray][0] - scope-9
        |   |
        |   Project[bytearray][1] - scope-11
        |   |
        |   Project[bytearray][2] - scope-13
        |
        |---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false
----------------

With Renaming

A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);

C = join A by id, B by id;
C = foreach C generate A::f1 as f1,  -- This
                       A::f2 as f2,  -- section
                       B::id as id,  -- is
                       B::g1 as g1,  -- different
                       B::g2 as g2;  --

store C into 'output';

#--------------------------------------------------
# Map Reduce Plan                                  
#--------------------------------------------------
MapReduce node scope-41
Map Plan
Union[tuple] - scope-42
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
|   |   |
|   |   Project[bytearray][2] - scope-21
|   |
|   |---A: New For Each(false,false,false)[bag] - scope-7
|       |   |
|       |   Project[bytearray][0] - scope-1
|       |   |
|       |   Project[bytearray][1] - scope-3
|       |   |
|       |   Project[bytearray][2] - scope-5
|       |
|       |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
    |   |
    |   Project[bytearray][2] - scope-23
    |
    |---B: New For Each(false,false,false)[bag] - scope-15
        |   |
        |   Project[bytearray][0] - scope-9
        |   |
        |   Project[bytearray][1] - scope-11
        |   |
        |   Project[bytearray][2] - scope-13
        |
        |---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
    |   |
    |   Project[bytearray][0] - scope-27
    |   |
    |   Project[bytearray][1] - scope-29
    |   |
    |   Project[bytearray][5] - scope-31
    |   |
    |   Project[bytearray][3] - scope-33
    |   |
    |   Project[bytearray][4] - scope-35
    |
    |---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false
----------------

The difference is in the Reduce plans. Without renaming:

Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false

versus with renaming:

Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
    |   |
    |   Project[bytearray][0] - scope-27
    |   |
    |   Project[bytearray][1] - scope-29
    |   |
    |   Project[bytearray][5] - scope-31
    |   |
    |   Project[bytearray][3] - scope-33
    |   |
    |   Project[bytearray][4] - scope-35
    |
    |---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false

In short, there will be other things you can optimize in your script before worrying about renaming. Since you'll be going through every record anyway because of the join, renaming will just be a cheap extra step.

Renaming fields after a JOIN takes time?

Tags:

apache-pig

Navneet

1 Answers

Without Renaming

With Renaming

cyang

Recent Activity

Donate For Us

Renaming fields after a JOIN takes time?

Tags:

apache-pig

Navneet

1 Answers

Without Renaming

With Renaming

cyang

Related questions

Recent Activity

Donate For Us