So I want to read the csv files from a directory, as a pyspark dataframe and then append them into single dataframe. Not getting the alternative for this in pyspark, the way we do in pandas. For example in Pandas, we do: <pre class="prettyprint"><code>files=glob.glob(path +'*.csv') df=pd.DataFrame() for f in files: dff=pd.read_csv(f,delimiter=',') df.append(dff) </code></pre> In Pyspark I have tried this but not successful <pre class="prettyprint"><code>schema=StructType([]) union_df = sqlContext.createDataFrame(sc.emptyRDD(),schema) for f in files: dff = sqlContext.read.load(f,format='com.databricks.spark.csv',header='true',inferSchema='true',delimiter=',') df=df.union_All(dff) </code></pre> Would really appreciate any help. Thanks

The schema should be same when using "unionAll" on 2 dataframes. Therefore, the schema of the empty dataframe should be as per the csv schema. For eg: <pre class="prettyprint"><code>schema = StructType([ StructField("v1", LongType(), True), StructField("v2", StringType(), False), StructField("v3", StringType(), False) ]) df = sqlContext.createDataFrame([],schema) </code></pre> Or you can do like this: <pre class="prettyprint"><code>f = files.pop(0) df = sqlContext.read.load(f,format='com.databricks.spark.csv',header='true',inferSchema='true',delimiter=',') for f in files: dff = sqlContext.read.load(f,format='com.databricks.spark.csv',header='true',inferSchema='true',delimiter=',') df=df.union_All(dff) </code></pre>

How can I define an empty dataframe in Pyspark and append the corresponding dataframes with it?

Tags:

pyspark

pyspark-sql

So I want to read the csv files from a directory, as a pyspark dataframe and then append them into single dataframe. Not getting the alternative for this in pyspark, the way we do in pandas.

For example in Pandas, we do:

files=glob.glob(path +'*.csv')

df=pd.DataFrame() 

for f in files:
    dff=pd.read_csv(f,delimiter=',')
    df.append(dff)

In Pyspark I have tried this but not successful

schema=StructType([])
union_df = sqlContext.createDataFrame(sc.emptyRDD(),schema)

for f in files:
    dff = sqlContext.read.load(f,format='com.databricks.spark.csv',header='true',inferSchema='true',delimiter=',')
    df=df.union_All(dff)

Would really appreciate any help.

Thanks

928

asked Apr 10 '17 07:04

Gaurav Chawla

5 Answers

Define schema first, and then you can use unionAll to concatenate new dataframes to the empty one and even run iterations to combine a bunch of dataframes together.

from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType

sc = SparkContext(conf=SparkConf())
spark = SparkSession(sc)     # Need to use SparkSession(sc) to createDataFrame

schema = StructType([
    StructField("column1",StringType(),True),
    StructField("column2",StringType(),True)
])
empty = spark.createDataFrame(sc.emptyRDD(), schema)

empty = empty.unionAll(addOndata)

173

answered Oct 13 '22 11:10

Rong Du

you can get away with an empty DataFrame here. create an empty list and keep adding the child DataFrames to it. Once you're done with adding all the DataFrames that you want to combine, do a reduce using union to the list and it will combine all of them into one DataFrame.

list_of_dfs = []
for i in number_of_dfs:
    list_of_dfs.append(df_i)
combined_df = reduce(DataFrame.union, list_of_dfs)

answered Oct 13 '22 11:10

rishab137

Here is how I do it. I don't create an empty DataFrame.

def concat_spark_iterator(iterator):
    """
    :param iterator: iterator(Spark DataFrame)   
    :return: Concatenated Spark DataFrames
    """

    df = next(iterator)

    for _df in iterator:
        df = df.union(_df)

    return df

answered Sep 23 '22 14:09

MathLal

One way for getting this done as below in spark 2.1:

files=glob.glob(path +'*.csv')

for idx,f in enumerate(files):
    if idx == 0:
        df = spark.read.csv(f,header=True,inferSchema=True)
        dff = df
    else:
        df = spark.read.csv(f,header=True,inferSchema=True)
        dff=dff.unionAll(df)

answered Oct 13 '22 11:10

Nim J

The schema should be same when using "unionAll" on 2 dataframes. Therefore, the schema of the empty dataframe should be as per the csv schema.

For eg:

schema = StructType([
    StructField("v1", LongType(), True), StructField("v2", StringType(), False), StructField("v3", StringType(), False)
])
df = sqlContext.createDataFrame([],schema)

Or you can do like this:

f = files.pop(0)
df = sqlContext.read.load(f,format='com.databricks.spark.csv',header='true',inferSchema='true',delimiter=',')
for f in files:
    dff = sqlContext.read.load(f,format='com.databricks.spark.csv',header='true',inferSchema='true',delimiter=',')
    df=df.union_All(dff)

answered Oct 13 '22 12:10

Abhishek Bansal

Related questions
                            
                                Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?
                            
                                How to name file when saveAsTextFile in spark?
                            
                                Get the max value for each key in a Spark RDD
                            
                                Broadcast hash join - Iterative
                            
                                How to select a same-size stratified sample from a dataframe in Apache Spark?
                            
                                PySpark difference between pyspark.sql.functions.col and pyspark.sql.functions.lit
                            
                                PySpark - Add map function as column
                            
                                PySpark: Subtract Two Timestamp Columns and Give Back Difference in Minutes (Using F.datediff gives back only whole days)
                            
                                Getting specific field from chosen Row in Pyspark DataFrame
                            
                                Converting epoch to datetime in PySpark data frame using udf
                            
                                How to speed up spark df.write jdbc to postgres database?
                            
                                Date difference between consecutive rows - Pyspark Dataframe
                            
                                Py4J error when creating a spark dataframe using pyspark
                            
                                Error:'java.lang.UnsupportedOperationException' for Pyspark pandas_udf documentation code
                            
                                reading a file in hdfs from pyspark
                            
                                PySpark: filtering a DataFrame by date field in range where date is string
                            
                                Pyspark Save dataframe to S3
                            
                                How to get the correlation matrix of a pyspark data frame?
                            
                                how to check if a string column in pyspark dataframe is all numeric
                            
                                How to convert a table into a Spark Dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I define an empty dataframe in Pyspark and append the corresponding dataframes with it?

Tags:

pyspark

pyspark-sql

Gaurav Chawla

People also ask

5 Answers

Rong Du

rishab137

MathLal

Nim J

Abhishek Bansal

Recent Activity

Donate For Us