Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark.RDD.first -> UnpicklingError: NEWOBJ class argument has NULL tp_new

Tags:

pyspark

I use python 2.7 with spark 1.5.1 and I get this:

df = sqlContext.read.parquet(".....").cache()
df = df.filter(df.foo == 1).select("a","b","c")
def myfun (row):
    return pyspark.sql.Row(....)
rdd = df.map(myfun).cache()
rdd.first()
==> UnpicklingError: NEWOBJ class argument has NULL tp_new

what's wrong?

like image 695
sds Avatar asked Dec 27 '25 18:12

sds


1 Answers

as usual, the pickling error boiled down to myfun being closed over an unpicklable object.

as usual, the solution is to use mapPartitions:

import pygeoip
def get_geo (rows):
    db = pygeoip.GeoIP("/usr/share/GeoIP/GeoIPCity.dat")
    for row in rows:
        d = row.asDict()
        d["new"] = db.record_by_addr(row.client_ip) if row.client_ip else "noIP"
        yield d
rdd.mapPartitions(get_geo)

instead of map:

import pygeoip
db = pygeoip.GeoIP("/usr/share/GeoIP/GeoIPCity.dat")
def get_geo (row):
    d = row.asDict()
    d["new"] = db.record_by_addr(row.client_ip) if row.client_ip else "noIP"
    return d
rdd.map(get_geo)
like image 194
sds Avatar answered Dec 30 '25 19:12

sds



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!