Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read xlsx or xls files as spark dataframe

Can anyone let me know without converting xlsx or xls files how can we read them as a spark dataframe

I have already tried to read with pandas and then tried to convert to spark dataframe but got the error and the error is

Error:

Cannot merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>

Code:

import pandas
import os
df = pandas.read_excel('/dbfs/FileStore/tables/BSE.xlsx', sheet_name='Sheet1',inferSchema='')
sdf = spark.createDataFrame(df)
like image 637
Ravi Kiran Avatar asked Jun 03 '19 11:06

Ravi Kiran


2 Answers

I try to give a general updated version at April 2021 based on the answers of @matkurek and @Peter Pan.

SPARK

You should install on your databricks cluster the following 2 libraries:

  1. Clusters -> select your cluster -> Libraries -> Install New -> Maven -> in Coordinates: com.crealytics:spark-excel_2.12:0.13.5

  2. Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: xlrd

Then, you will be able to read your excel as follows:

sparkDF = spark.read.format("com.crealytics.spark.excel") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("dataAddress", "'NameOfYourExcelSheet'!A1") \
    .load(filePath)

PANDAS

You should install on your databricks cluster the following 2 libraries:

  1. Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: xlrd

  2. Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: openpyxl

Then, you will be able to read your excel as follows:

import pandas
pandasDF = pd.read_excel(io = filePath, engine='openpyxl', sheet_name = 'NameOfYourExcelSheet') 

Note that you will have two different objects, in the first scenario a Spark Dataframe, in the second a Pandas Dataframe.

like image 107
Andrea Baldino Avatar answered Sep 29 '22 02:09

Andrea Baldino


As mentioned by @matkurek you can read it from excel directly. Indeed, this should be a better practice than involving pandas since then the benefit of Spark would not exist anymore.

You can run the same code sample as defined qbove, but just adding the class needed to the configuration of your SparkSession.

spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \
.getOrCreate()

Then, you can read your excel file.

df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'NameOfYourExcelSheet'!A1") \
.load("your_file"))
like image 37
Jorge Abreu Avatar answered Sep 29 '22 01:09

Jorge Abreu