I am using databricks on azure,
Pyspark reads data that's dumped in azure data lake storage [adls]
Every now and then when i try to read the data from adls like so:
spark.read.format('delta').load(`/path/to/adls/mounted/interim_data.delta` )
it throws the following error
AnalysisException: `/path/to/adls/mounted/interim_data.delta` is not a Delta table.
the data necessarily exists
the folder contents and files show up when i run
%fs ls /path/to/adls/mounted/interim_data.delta
right now the only fix is to re run the script that populated the above interim_data.delta table which is not a viable fix
I am answering my own question...
TLDR: Root cause of the issue: frequent remounting of ADLS
There was this section of the code that remounts the ADLS gen2 to ADB, when other teams ran their script, the remounting took 20-45 seconds, and as the number of scripts that ran in the high concurrency cluster increased, it was a matter of time that one of us ran into the issue, where the scripts tired to read data from the ADLS while it was being mounted...
this is how it turned out to be intermittent...
Why was this remounting hack in place..? this was put in place because, we faced an issue with data not showing up, in adb even though it was visible in ADLS Gen2, and the only way to fix this back then, was to force a remount to make that data visible in ADB.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With