Azure databricks dataframe write gives job abort error

Question

I am trying to write data to a csv files and store the file on Azure Data Lake Gen2 and run into job aborted error message. This same code used to work fine previously.

Error Message:

org.apache.spark.SparkException: Job aborted.

Code:

import requests
response = requests.get('https://myapiurl.com/v1/data', auth=('user', 'password'))
data = response.json()
from pyspark.sql import *
df=spark.createDataFrame([Row(**i) for i in data])  
df.write.format(source).mode("overwrite").save(path) #error line

Jim Xu · Accepted Answer

I summarize the solution below

If you want to access Azure data lake gen2 in Azure databricks, you have two choices to do that.

Mount Azure data lake gen2 as Azure databricks's file system. After doing that, you can read and write files with the path /mnt/<>. And We just need to run the code one time.

a. Create a service principal and assign Storage Blob Data Contributor to the sp in the scope of the Data Lake Storage Gen2 storage account

 az login

 az ad sp create-for-rbac -n "MyApp" --role "Storage Blob Data Contributor" \
--scopes /subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>

b. code

 configs = {"fs.azure.account.auth.type": "OAuth",
  "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id": "<appId>",
  "fs.azure.account.oauth2.client.secret": "<clientSecret>",
  "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
  "fs.azure.createRemoteFileSystemDuringInitialization": "true"}

 dbutils.fs.mount(
    source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/folder1",
    mount_point = "/mnt/flightdata",
    extra_configs = configs)

Access directly using the storage account access key.

We can add the code spark.conf.set( "fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-name>") to our script. Then we can read and write files with path abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/.

for example

 from pyspark.sql.types import StringType
 spark.conf.set(
   "fs.azure.account.key.testadls05.dfs.core.windows.net", "<account access key>")

  df = spark.createDataFrame(["10", "11", "13"], StringType()).toDF("age")
  df.show()
  df.coalesce(1).write.format('csv').option('header', True).mode('overwrite').save('abfss://[email protected]/result_csv')

enter image description here

For more details, please refer to here

Azure databricks dataframe write gives job abort error

Tags:

pyspark

azure-databricks

azure-data-lake-gen2

paone

1 Answers

Jim Xu

Recent Activity

Donate For Us

Azure databricks dataframe write gives job abort error

Tags:

pyspark

azure-databricks

azure-data-lake-gen2

paone

1 Answers

Jim Xu

Related questions

Recent Activity

Donate For Us