Emit multiple pairs in map operation

Question

Let's say I have rows of phone call records the format:

[CallingUser, ReceivingUser, Duration]

If I want to know the total amount of time that a given user has been on the phone (sum of Duration where the User was the CallingUser or the ReceivingUser).

Effectively, for a given record, I would like to create 2 pairs (CallingUser, Duration) and (ReceivingUser, Duration).

What is the most efficient way to do this? I can add 2 RDDs together, but I am unclear if this is a good approach:

#Sample Data:
callData = sc.parallelize([["User1", "User2", 2], ["User1", "User3", 4], ["User2", "User1", 8]  ])


calls = callData.map(lambda record: (record[0], record[2]))

#The potentially inefficient map in question:
calls += callData.map(lambda record: (record[1], record[2]))


reduce = calls.reduceByKey(lambda a, b: a + b)

SoldierOfFortran · Accepted Answer

Use a flatMap() which is good for taking single inputs and generating multiple mapped outputs. Complete with code:

callData = sc.parallelize([["User1", "User2", 2], ["User1", "User3", 4], ["User2", "User1", 8]])

calls = callData.flatMap(lambda record: [(record[0], record[2]), (record[1], record[2])])
print calls.collect()
# prints [('User1', 2), ('User2', 2), ('User1', 4), ('User3', 4), ('User2', 8), ('User1', 8)]

reduce = calls.reduceByKey(lambda a, b: a + b)
print reduce.collect()
# prints [('User2', 10), ('User3', 4), ('User1', 14)]

Emit multiple pairs in map operation

Tags:

apache-spark

pyspark

Jeffrey Marshall

1 Answers

SoldierOfFortran

Recent Activity

Donate For Us

Emit multiple pairs in map operation

Tags:

apache-spark

pyspark

Jeffrey Marshall

1 Answers

SoldierOfFortran

Related questions

Recent Activity

Donate For Us