Questions Linux Laravel Mysql Ubuntu Git Menu

HTML CSS JAVASCRIPT SQL PYTHON PHP BOOTSTRAP JAVA JQUERY R React Kotlin

Reading Excel (.xlsx) file in pyspark

Tags:

apache-spark

pyspark

spark-excel

I am trying to read a .xlsx file from local path in PySpark.

I've written the below code:

from pyspark.shell import sqlContext
from pyspark.sql import SparkSession

spark = SparkSession.builder \
      .master('local') \
      .appName('Planning') \
      .enableHiveSupport() \
      .config('spark.executor.memory', '2g') \
      .getOrCreate()

df = sqlContext.read("C:\P_DATA\tyco_93_A.xlsx").show()

Error:

TypeError: 'DataFrameReader' object is not callable

like image

368

asked Jan 22 '20 07:01

OMG

People also ask

Can we read Excel file in PySpark?

Read an Excel file into a pandas-on-Spark DataFrame or Series. Support both xls and xlsx file extensions from a local filesystem or URL. Support an option to read a single sheet or a list of sheets. The string could be a URL.

2 Answers

You can use pandas to read .xlsx file and then convert that to spark dataframe.

from pyspark.sql import SparkSession
import pandas

spark = SparkSession.builder.appName("Test").getOrCreate()

pdf = pandas.read_excel('excelfile.xlsx', sheet_name='sheetname', inferSchema='true')
df = spark.createDataFrame(pdf)

df.show()

like image

131

answered Nov 14 '22 21:11

Ghost

You could use crealytics package.

Need to add it to spark, either by maven co-ordinates or while starting the spark shell as below.

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.12:0.13.1

For databricks users- need to add it as a library by navigating Cluster - 'clusterName' - Libraries - Install New - Provide 'com.crealytics:spark-excel_2.12:0.13.1' under maven coordinates.

df = spark.read
     .format("com.crealytics.spark.excel")
     .option("dataAddress", "'Sheet1'!")
     .option("header", "true")
     .option("inferSchema", "true")
     .load("C:\P_DATA\tyco_93_A.xlsx")

More options are available in below github page.

https://github.com/crealytics/spark-excel

like image

45

answered Nov 14 '22 22:11

Deva

Sign in to Comment

Related questions
                            
                                Spark sql top n per group
                            
                                org.apache.thrift.transport.TTransportException error while Reading large JSON file in zeppelin scala
                            
                                How to split column of vectors into two columns?
                            
                                Running steps of EMR in parallel
                            
                                How Spark handle data larger than cluster memory
                            
                                Dropping nested column of Dataframe with PySpark
                            
                                Best practice to create SparkSession object in Scala to use both in unittest and spark-submit
                            
                                Add months to date column in Spark dataframe
                            
                                What does "pre-built for Apache Hadoop 2.7 and later" mean?
                            
                                How can I obtain the DAG of an Apache Spark job without running it?
                            
                                Why is no map function for dataframe in pyspark while the spark equivalent has it?
                            
                                How to set spark.driver.memory for Spark/Zeppelin on EMR
                            
                                Is there a way to validate the syntax of raw spark sql query?
                            
                                java.lang.UnsupportedOperationExceptionfieldIndex on a Row without schema is undefined: Exception on row.getAs[String]
                            
                                How to select multiple columns of dataset, given a list of column names?
                            
                                Spark decimal type precision loss
                            
                                Comparison of a `float` to `np.nan` in Spark Dataframe
                            
                                How do I get a spark dataframe to print it's explain plan to a string
                            
                                How to find the max String length of a column in Spark using dataframe?
                            
                                Spark: How to aggregate/reduce records based on time difference?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With