Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Pyspark throw : " AnalysisException: `/path/to/adls/mounted/interim_data.delta` is not a Delta table ". even though the file exists...?

I am using databricks on azure, Pyspark reads data that's dumped in azure data lake storage [adls] Every now and then when i try to read the data from adls like so:
spark.read.format('delta').load(`/path/to/adls/mounted/interim_data.delta` )

it throws the following error

AnalysisException: `/path/to/adls/mounted/interim_data.delta` is not a Delta table.

the data necessarily exists
the folder contents and files show up when i run
%fs ls /path/to/adls/mounted/interim_data.delta

right now the only fix is to re run the script that populated the above interim_data.delta table which is not a viable fix

like image 334
Rony Avatar asked Oct 20 '25 09:10

Rony


1 Answers

I am answering my own question...

TLDR: Root cause of the issue: frequent remounting of ADLS

There was this section of the code that remounts the ADLS gen2 to ADB, when other teams ran their script, the remounting took 20-45 seconds, and as the number of scripts that ran in the high concurrency cluster increased, it was a matter of time that one of us ran into the issue, where the scripts tired to read data from the ADLS while it was being mounted...

this is how it turned out to be intermittent...

Why was this remounting hack in place..? this was put in place because, we faced an issue with data not showing up, in adb even though it was visible in ADLS Gen2, and the only way to fix this back then, was to force a remount to make that data visible in ADB.

like image 106
Rony Avatar answered Oct 22 '25 03:10

Rony