Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Renaming fields after a JOIN takes time?

Tags:

apache-pig

In the following code, how much does renaming fields after a join hurt the computation time of the script? Is it optimized in Pig? Or does it really go through every record?

-- tables A: (f1, f2, id)  and B: (g1, g2, id) to be joined by id
C = JOIN A BY id, B by id;
C = FOREACH C GENERATE A::f1 AS f1, A::f2 AS f2, B::id AS id, B::g1 AS g1, B::g2 AS g2;

Does the FOREACH command go through every record of C? If yes, is there a way to optimize?

Thanks.

like image 412
Navneet Avatar asked Jan 16 '23 21:01

Navneet


1 Answers

Don't worry about optimizing this, there may be a slight overhead in renaming the fields, but it won't trigger an addition Map/Reduce job. The field projection will occur in the reducer after your JOIN.

Consider the two pieces of code and the Map Reduce plans given by explain below.

Without Renaming

A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);

C = join A by id, B by id;

store C into 'output';

#--------------------------------------------------
# Map Reduce Plan                                  
#--------------------------------------------------
MapReduce node scope-30
Map Plan
Union[tuple] - scope-31
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
|   |   |
|   |   Project[bytearray][2] - scope-21
|   |
|   |---A: New For Each(false,false,false)[bag] - scope-7
|       |   |
|       |   Project[bytearray][0] - scope-1
|       |   |
|       |   Project[bytearray][1] - scope-3
|       |   |
|       |   Project[bytearray][2] - scope-5
|       |
|       |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
    |   |
    |   Project[bytearray][2] - scope-23
    |
    |---B: New For Each(false,false,false)[bag] - scope-15
        |   |
        |   Project[bytearray][0] - scope-9
        |   |
        |   Project[bytearray][1] - scope-11
        |   |
        |   Project[bytearray][2] - scope-13
        |
        |---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false
----------------

With Renaming

A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);

C = join A by id, B by id;
C = foreach C generate A::f1 as f1,  -- This
                       A::f2 as f2,  -- section
                       B::id as id,  -- is
                       B::g1 as g1,  -- different
                       B::g2 as g2;  --

store C into 'output';

#--------------------------------------------------
# Map Reduce Plan                                  
#--------------------------------------------------
MapReduce node scope-41
Map Plan
Union[tuple] - scope-42
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
|   |   |
|   |   Project[bytearray][2] - scope-21
|   |
|   |---A: New For Each(false,false,false)[bag] - scope-7
|       |   |
|       |   Project[bytearray][0] - scope-1
|       |   |
|       |   Project[bytearray][1] - scope-3
|       |   |
|       |   Project[bytearray][2] - scope-5
|       |
|       |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
    |   |
    |   Project[bytearray][2] - scope-23
    |
    |---B: New For Each(false,false,false)[bag] - scope-15
        |   |
        |   Project[bytearray][0] - scope-9
        |   |
        |   Project[bytearray][1] - scope-11
        |   |
        |   Project[bytearray][2] - scope-13
        |
        |---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
    |   |
    |   Project[bytearray][0] - scope-27
    |   |
    |   Project[bytearray][1] - scope-29
    |   |
    |   Project[bytearray][5] - scope-31
    |   |
    |   Project[bytearray][3] - scope-33
    |   |
    |   Project[bytearray][4] - scope-35
    |
    |---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false
----------------

The difference is in the Reduce plans. Without renaming:

Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false

versus with renaming:

Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
    |   |
    |   Project[bytearray][0] - scope-27
    |   |
    |   Project[bytearray][1] - scope-29
    |   |
    |   Project[bytearray][5] - scope-31
    |   |
    |   Project[bytearray][3] - scope-33
    |   |
    |   Project[bytearray][4] - scope-35
    |
    |---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false

In short, there will be other things you can optimize in your script before worrying about renaming. Since you'll be going through every record anyway because of the join, renaming will just be a cheap extra step.

like image 183
cyang Avatar answered Feb 16 '23 16:02

cyang