I am trying to write data to a csv files and store the file on Azure Data Lake Gen2 and run into job aborted error message. This same code used to work fine previously.
Error Message:
org.apache.spark.SparkException: Job aborted.
Code:
import requests
response = requests.get('https://myapiurl.com/v1/data', auth=('user', 'password'))
data = response.json()
from pyspark.sql import *
df=spark.createDataFrame([Row(**i) for i in data])
df.write.format(source).mode("overwrite").save(path) #error line
I summarize the solution below
If you want to access Azure data lake gen2 in Azure databricks, you have two choices to do that.
Mount Azure data lake gen2 as Azure databricks's file system. After doing that, you can read and write files with the path /mnt/<>. And We just need to run the code one time.
a. Create a service principal and assign Storage Blob Data Contributor to the sp in the scope of the Data Lake Storage Gen2 storage account
az login
az ad sp create-for-rbac -n "MyApp" --role "Storage Blob Data Contributor" \
--scopes /subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>
b. code
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<appId>",
"fs.azure.account.oauth2.client.secret": "<clientSecret>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/folder1",
mount_point = "/mnt/flightdata",
extra_configs = configs)
Access directly using the storage account access key.
We can add the code spark.conf.set( "fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-name>") to our script. Then we can read and write files with path abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/.
for example
from pyspark.sql.types import StringType
spark.conf.set(
"fs.azure.account.key.testadls05.dfs.core.windows.net", "<account access key>")
df = spark.createDataFrame(["10", "11", "13"], StringType()).toDF("age")
df.show()
df.coalesce(1).write.format('csv').option('header', True).mode('overwrite').save('abfss://[email protected]/result_csv')

For more details, please refer to here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With