print
type(rdd.take(1))
, but it just says <type 'list'>
.(x,1),(x,2),(y,1),(y,3)
and I use
groupByKey
and got (x,(1,2)),(y,(1,3))
. Is there a way to define
(1,2)
and (1,3)
as values where x and y are keys? Or does a key has to be a single value? I noted that if I use reduceByKey
and sum
function to get the data ((x,3),(y,4))
then it becomes much easier to define this data as a key-value pairPython is a dynamically typed language and PySpark doesn't use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD
operations is that it can be unpacked as follows:
k, v = kv
Typically you would use a two element tuple
due to its semantics (immutable object of fixed size) and similarity to Scala Product
classes. But this is just a convention and nothing stops you from something like this:
key_value.py
class KeyValue(object):
def __init__(self, k, v):
self.k = k
self.v = v
def __iter__(self):
for x in [self.k, self.v]:
yield x
from key_value import KeyValue
rdd = sc.parallelize(
[KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)])
rdd.reduceByKey(add).collect()
## [('bar', 0), ('foo', 3)]
and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__
and __getitem__
magic methods should work as well. Probably the most elegant way to handle this is to use namedtuples
.
Also type(rdd.take(1))
returns a list
of length n
so its type will be always the same.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With