I am trying to read a .xlsx file from local path in PySpark.
I've written the below code:
from pyspark.shell import sqlContext
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local') \
.appName('Planning') \
.enableHiveSupport() \
.config('spark.executor.memory', '2g') \
.getOrCreate()
df = sqlContext.read("C:\P_DATA\tyco_93_A.xlsx").show()
Error:
TypeError: 'DataFrameReader' object is not callable
Read an Excel file into a pandas-on-Spark DataFrame or Series. Support both xls and xlsx file extensions from a local filesystem or URL. Support an option to read a single sheet or a list of sheets. The string could be a URL.
You can use pandas to read .xlsx file and then convert that to spark dataframe.
from pyspark.sql import SparkSession
import pandas
spark = SparkSession.builder.appName("Test").getOrCreate()
pdf = pandas.read_excel('excelfile.xlsx', sheet_name='sheetname', inferSchema='true')
df = spark.createDataFrame(pdf)
df.show()
You could use crealytics package.
Need to add it to spark, either by maven co-ordinates or while starting the spark shell as below.
$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.12:0.13.1
For databricks users- need to add it as a library by navigating Cluster - 'clusterName' - Libraries - Install New - Provide 'com.crealytics:spark-excel_2.12:0.13.1' under maven coordinates.
df = spark.read
.format("com.crealytics.spark.excel")
.option("dataAddress", "'Sheet1'!")
.option("header", "true")
.option("inferSchema", "true")
.load("C:\P_DATA\tyco_93_A.xlsx")
More options are available in below github page.
https://github.com/crealytics/spark-excel
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With