Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Azure databricks dataframe write gives job abort error

I am trying to write data to a csv files and store the file on Azure Data Lake Gen2 and run into job aborted error message. This same code used to work fine previously.

Error Message:

org.apache.spark.SparkException: Job aborted.   

Code:

import requests
response = requests.get('https://myapiurl.com/v1/data', auth=('user', 'password'))
data = response.json()
from pyspark.sql import *
df=spark.createDataFrame([Row(**i) for i in data])  
df.write.format(source).mode("overwrite").save(path) #error line
like image 431
paone Avatar asked Dec 18 '25 09:12

paone


1 Answers

I summarize the solution below

If you want to access Azure data lake gen2 in Azure databricks, you have two choices to do that.

  1. Mount Azure data lake gen2 as Azure databricks's file system. After doing that, you can read and write files with the path /mnt/<>. And We just need to run the code one time.

    a. Create a service principal and assign Storage Blob Data Contributor to the sp in the scope of the Data Lake Storage Gen2 storage account

     az login
    
     az ad sp create-for-rbac -n "MyApp" --role "Storage Blob Data Contributor" \
    --scopes /subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>
    

    b. code

     configs = {"fs.azure.account.auth.type": "OAuth",
      "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
      "fs.azure.account.oauth2.client.id": "<appId>",
      "fs.azure.account.oauth2.client.secret": "<clientSecret>",
      "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
      "fs.azure.createRemoteFileSystemDuringInitialization": "true"}
    
     dbutils.fs.mount(
        source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/folder1",
        mount_point = "/mnt/flightdata",
        extra_configs = configs)
    
  2. Access directly using the storage account access key.

    We can add the code spark.conf.set( "fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-name>") to our script. Then we can read and write files with path abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/.

    for example

     from pyspark.sql.types import StringType
     spark.conf.set(
       "fs.azure.account.key.testadls05.dfs.core.windows.net", "<account access key>")
    
      df = spark.createDataFrame(["10", "11", "13"], StringType()).toDF("age")
      df.show()
      df.coalesce(1).write.format('csv').option('header', True).mode('overwrite').save('abfss://[email protected]/result_csv') 
    

    enter image description here

For more details, please refer to here

like image 96
Jim Xu Avatar answered Dec 21 '25 04:12

Jim Xu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!