I would like to register a dataset from ADLS Gen2 in my Azure Machine Learning workspace (azureml-core==1.12.0
). Given that service principal information is not required in the Python SDK documentation for .register_azure_data_lake_gen2()
, I successfully used the following code to register ADLS gen2 as a datastore:
from azureml.core import Datastore
adlsgen2_datastore_name = os.environ['adlsgen2_datastore_name']
account_name=os.environ['account_name'] # ADLS Gen2 account name
file_system=os.environ['filesystem']
adlsgen2_datastore = Datastore.register_azure_data_lake_gen2(
workspace=ws,
datastore_name=adlsgen2_datastore_name,
account_name=account_name,
filesystem=file_system
)
However, when I try to register a dataset, using
from azureml.core import Dataset
adls_ds = Datastore.get(ws, datastore_name=adlsgen2_datastore_name)
data = Dataset.Tabular.from_delimited_files((adls_ds, 'folder/data.csv'))
I get an error
Cannot load any data from the specified path. Make sure the path is accessible and contains data.
ScriptExecutionException
was caused byStreamAccessException
. StreamAccessException was caused by AuthenticationException.'AdlsGen2-ReadHeaders'
for '[REDACTED]' on storage failed with status code 'Forbidden' (This request is not authorized to perform this operation using this permission.), client request ID <CLIENT_REQUEST_ID>, request ID <REQUEST_ID>. Error message: [REDACTED] | session_id=<SESSION_ID>
Do I need the to enable the service principal to get this to work? Using the ML Studio UI, it appears that the service principal is required even to register the datastore.
Another issue I noticed is that AMLS is trying to access the dataset here:
https://adls_gen2_account_name.**dfs**.core.windows.net/container/folder/data.csv
whereas the actual URI in ADLS Gen2 is: https://adls_gen2_account_name.**blob**.core.windows.net/container/folder/data.csv
For access control, ADLS has two layers: roles and access control lists, or ACLs. Here are the three primary roles. An owner of a data lake has full control over everything, including permissions for other users. A contributor can read, write, and delete data, but it can't change the permissions of other users.
According to this documentation,you need to enable the service principal.
1.you need to register your application and grant the service principal with Storage Blob Data Reader access.
2.try this code:
adlsgen2_datastore = Datastore.register_azure_data_lake_gen2(workspace=ws,
datastore_name=adlsgen2_datastore_name,
account_name=account_name,
filesystem=file_system,
tenant_id=tenant_id,
client_id=client_id,
client_secret=client_secret
)
adls_ds = Datastore.get(ws, datastore_name=adlsgen2_datastore_name)
dataset = Dataset.Tabular.from_delimited_files((adls_ds,'sample.csv'))
print(dataset.to_pandas_dataframe())
Result:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With