I want to read read excel files as spark dataframe with 3rd row as a header.The synatax to read excel files as spark dataframe with 1st row as header is :
s_df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.load(path + 'Sales.xlsx')
and the equivalent syntax to read as pandas dataframe with 3rd row as header is :
p_df = pd.read_excel(path + 'Sales.xlsx',header=3)
I want to do the same thing in pyspark that is to read excel files as spark dataframe with 3rd row as header.
Use the dataAddress option to specify the cell/row where the data is located . As you need to skip two rows, your data (including header) starts from row A3.
s_df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema","true") \
.option("dataAddress", "'Sheet1'!A3") \
.load("yourfilepath")
Also, note that if your first two rows are empty, then dataAddress does not have to specified. The leading null rows will be skipped by default.
Check the documentation here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With