How to read xlsx or xls files as spark dataframe

Question

Can anyone let me know without converting xlsx or xls files how can we read them as a spark dataframe

I have already tried to read with pandas and then tried to convert to spark dataframe but got the error and the error is

Error:

Cannot merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>

Code:

import pandas
import os
df = pandas.read_excel('/dbfs/FileStore/tables/BSE.xlsx', sheet_name='Sheet1',inferSchema='')
sdf = spark.createDataFrame(df)

Andrea Baldino · Accepted Answer

I try to give a general updated version at April 2021 based on the answers of @matkurek and @Peter Pan.

SPARK

You should install on your databricks cluster the following 2 libraries:

Clusters -> select your cluster -> Libraries -> Install New -> Maven -> in Coordinates: com.crealytics:spark-excel_2.12:0.13.5
Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: xlrd

Then, you will be able to read your excel as follows:

sparkDF = spark.read.format("com.crealytics.spark.excel") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("dataAddress", "'NameOfYourExcelSheet'!A1") \
    .load(filePath)

PANDAS

You should install on your databricks cluster the following 2 libraries:

Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: xlrd
Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: openpyxl

Then, you will be able to read your excel as follows:

import pandas
pandasDF = pd.read_excel(io = filePath, engine='openpyxl', sheet_name = 'NameOfYourExcelSheet')

Note that you will have two different objects, in the first scenario a Spark Dataframe, in the second a Pandas Dataframe.

Jorge Abreu · Answer

As mentioned by @matkurek you can read it from excel directly. Indeed, this should be a better practice than involving pandas since then the benefit of Spark would not exist anymore.

You can run the same code sample as defined qbove, but just adding the class needed to the configuration of your SparkSession.

spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \
.getOrCreate()

Then, you can read your excel file.

df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'NameOfYourExcelSheet'!A1") \
.load("your_file"))

How to read xlsx or xls files as spark dataframe

Tags:

python-3.x

azure

databricks

Ravi Kiran

2 Answers

Andrea Baldino

Jorge Abreu

Recent Activity

Donate For Us

How to read xlsx or xls files as spark dataframe

Tags:

python-3.x

azure

databricks

Ravi Kiran

2 Answers

Andrea Baldino

Jorge Abreu

Related questions

Recent Activity

Donate For Us