Converting Pandas DataFrame to Spark DataFrame

Tags:

I had asked a previous question about how to Convert scipy sparse matrix to pyspark.sql.dataframe.DataFrame, and made some progress after reading the answer provided, as well as this article. I eventually came to the following code for converting a scipy.sparse.csc_matrix to a pandas dataframe:

df = pd.DataFrame(csc_mat.todense()).to_sparse(fill_value=0)
df.columns = header

I then tried converting the pandas dataframe to a spark dataframe using the suggested syntax:

spark_df = sqlContext.createDataFrame(df)

However, I get back the following error:

ValueError: cannot create an RDD from type: <type 'list'>

I do not believe it has anything to do with the sqlContext as I was able to convert another pandas dataframe of roughly the same size to a spark dataframe, no problem. Any thoughts?

878

asked Nov 03 '16 21:11

Dirigo

1 Answers

I am not sure if this question is still relevant to the current version of pySpark, but here is the solution I worked out a couple weeks after posting this question. The code is rather ugly and possibly inefficient, but I am posting it here due to the continued interest in this question.:

from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark import SparkConf
from py4j.protocol import Py4JJavaError

myConf = SparkConf(loadDefaults=True)
sc = SparkContext(conf=myConf)
hc = HiveContext(sc)


def chunks(lst, k):
    """Yield k chunks of close to equal size"""
    n = len(lst) / k
    for i in range(0, len(lst), n):
        yield lst[i: i + n]


def reconstruct_rdd(lst, num_parts):
    partitions = chunks(lst, num_parts)
    for part in range(0, num_parts - 1):
        print "Partition ", part, " started..."
        partition = next(partitions)    # partition is a list of lists
        if part == 0:
            prime_rdd = sc.parallelize(partition)
        else:
            second_rdd = sc.parallelize(partition)
            prime_rdd = prime_rdd.union(second_rdd)
        print "Partition ", part, " complete!"
    return prime_rdd


def build_col_name_list(len_cols):
    name_lst = []
    for i in range(1, len_cols):
        idx = "_" + str(i)
        name_lst.append(idx)
    return name_lst


def set_spark_df_header(header, sdf):
    oldColumns = build_col_name_lst(len(sdf.columns))
    newColumns = header
    sdf = reduce(lambda sdf, idx: sdf.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), sdf)
    return sdf


def convert_pdf_matrix_to_sdf(pdf, sdf_header, num_of_parts):
    try:
        sdf = hc.createDataFrame(pdf)
    except ValueError:
        lst = pdf.values.tolist()   #Need to convert to list of list to parallelize
        try:
            rdd = sc.parallelize(lst)
        except Py4JJavaError:
            rdd = reconstruct_rdd(lst, num_of_parts)
            sdf = hc.createDataFrame(rdd)
            sdf = set_spark_df_header(sdf_header, sdf)
    return sdf

184

answered Sep 28 '22 01:09

Dirigo

Related questions
                            
                                Invalid or expired token. Request new token via Tweepy?
                            
                                Generating reachability matrix from a given adjacency matrix
                            
                                binary field download link use in treeview or listview inside one2many field in Odoo
                            
                                Making global variable accessible from every process
                            
                                Unequal misclassification costs in python/sklearn
                            
                                Override the shebang mangling in python setuptools
                            
                                Forcing Tkinter window to stay on top of fullscreen - Windows 10
                            
                                Ignore Lock in MYSQL Database in Sqlalchemy Query
                            
                                How to change the kernel name in Jupyter Notebook running on Windows?
                            
                                Tensorflow: Convert Tensor to numpy array WITHOUT .eval() or sess.run()
                            
                                Fastest Way to Compress Video Size Using Library or Algo
                            
                                How to implement Weighted Binary CrossEntropy on theano?
                            
                                How do you control user access to records in a key-value database?
                            
                                Python scrapy ReactorNotRestartable substitute
                            
                                Python break from if statement to else
                            
                                How to disable sorting by primary key in Django Admin?
                            
                                Python Bokeh table columns and headers don't line up
                            
                                How do I implement the Triplet Loss in Keras?
                            
                                Pandas returning empty groups in groupby
                            
                                Counting bigrams real fast (with or without multiprocessing) - python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Converting Pandas DataFrame to Spark DataFrame

Tags:

python

pandas

dataframe

pyspark

spark-dataframe

Dirigo

People also ask

1 Answers

Dirigo

Recent Activity

Donate For Us