Class ProdsTransformer:
def __init__(self):
self.products_lookup_hmap = {}
self.broadcast_products_lookup_map = None
def create_broadcast_variables(self):
self.broadcast_products_lookup_map = sc.broadcast(self.products_lookup_hmap)
def create_lookup_maps(self):
// The code here builds the hashmap that maps Prod_ID to another space.
pt = ProdsTransformer ()
pt.create_broadcast_variables()
pairs = distinct_users_projected.map(lambda x: (x.user_id,
pt.broadcast_products_lookup_map.value[x.Prod_ID]))
I get the following error:
"Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063."
Any help with how to deal with the broadcast variables will be great!
By referencing the object containing your broadcast variable in your map
lambda, Spark will attempt to serialize the whole object and ship it to workers. Since the object contains a reference to the SparkContext, you get the error. Instead of this:
pairs = distinct_users_projected.map(lambda x: (x.user_id, pt.broadcast_products_lookup_map.value[x.Prod_ID]))
Try this:
bcast = pt.broadcast_products_lookup_map
pairs = distinct_users_projected.map(lambda x: (x.user_id, bcast.value[x.Prod_ID]))
The latter avoids the reference to the object (pt
) so that Spark only needs to ship the broadcast variable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With