Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Azure Data Factory - Clean Up Batch Task Files

I'm working with Azure Data Factory v2, using a Batch Account Pool w/ dedicated nodes to do processing. I'm finding over time the Batch Activity fails due to no more space on the D:/ temp drive on the nodes. For each ADF job, it creates a working directory on the node and after the job completes I'm finding it doesn't clean up the files. Wondering if anybody else has encountered this before and what the best solution to implement is.

EDIT: Seems to be a file retention setting in ADF nowadays that wasn't present when I posed the question. For anybody future coming across the same issue that's a possible solution.

like image 786
Mike R Avatar asked Feb 02 '26 20:02

Mike R


2 Answers

Figured out a solution, posting to hopefully help the next person that comes along.

I found the Azure Python SDK for Batch, I created a small script that will iterate through all the pools + nodes on an account and delete any files in the workitems directory that are older than 1 day.

import azure.batch as batch
import azure.batch.operations.file_operations as file_operations
from azure.batch.batch_auth import SharedKeyCredentials
import azure.batch.operations
import msrest.service_client
from datetime import datetime

program_datetime = datetime.utcnow()

batch_account = 'batchaccount001'
batch_url = 'https://batchaccount001.westeurope.batch.azure.com'
batch_key = '<BatchKeyGoesHere>'
batch_credentials = SharedKeyCredentials(batch_account, batch_key)

#Create Batch Client with which to do operations
batch_client = batch.BatchServiceClient(credentials=batch_credentials,
                                        batch_url = batch_url
                                        )

service_client = msrest.service_client.ServiceClient(batch_credentials, batch_client.config)

#List out all the pools
pools = batch_client.pool.list()
pool_list = [p.id for p in pools]

for p in pool_list:
    nodes = batch_client.compute_node.list(p)
    node_list = [n.id for n in nodes]
    for n in node_list:
        pool_id = p
        node_id = n
        print(f'Pool = {pool_id}, Node = {node_id}')
        fo_client = azure.batch.operations.FileOperations(service_client,
                                                          config=batch_client.config,
                                                          serializer=batch_client._serialize,
                                                          deserializer=batch_client._deserialize)
        files = fo_client.list_from_compute_node(pool_id,
                                                 node_id,
                                                 recursive=True,
                                                 file_list_from_compute_node_options=None,
                                                 custom_headers=None,
                                                 raw=False
                                                )

        for file in files:
            # Check to make sure it's not a directory. Directories do not have a last_modified property.
            if not file.is_directory:
                file_datetime = file.properties.last_modified.replace(tzinfo=None)
                file_age_in_seconds = (program_datetime - file_datetime).total_seconds()
                # Delete anything older than a day in the workitems directory.
                if file_age_in_seconds > 86400 and file.name.startswith('workitems'):
                    print(f'{file_age_in_seconds} : {file.name}')
                    fo_client.delete_from_compute_node(pool_id, node_id, file.name)
like image 192
Mike R Avatar answered Feb 05 '26 10:02

Mike R


I'm an engineer with Azure Data Factory. We used an Azure Batch SDK earlier than 2018-12-01.8.0, thus the Batch tasks created via ADF defaulted to an infinite retention period as mentioned earlier. We're rolling out a fix to default the retention period for Batch tasks created through ADF to 30 days going forward, and also introducing a property, retentionTimeInDays in the typeProperties of custom activity, that customers can set in their ADF pipelines to override this default. When this has been rolled out, the documentation at https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-dotnet-custom-activity#custom-activity will be updated with more details. Thank you for your patience.

like image 21
Minh Do Avatar answered Feb 05 '26 10:02

Minh Do



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!