I have a need to be able to run spark on my local machine to access azure wasb and adl urls, but I can't get it to work. I have a stripped down example here: maven pom.xml (Brand-new pom, only the dependencies have been set): <pre class="prettyprint"><code><dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.3.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.3.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.8.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-azure-datalake</artifactId> <version>3.1.0</version> </dependency> <dependency> <groupId>com.microsoft.azure</groupId> <artifactId>azure-storage</artifactId> <version>6.0.0</version> </dependency> <dependency> <groupId>com.microsoft.azure</groupId> <artifactId>azure-data-lake-store-sdk</artifactId> <version>2.2.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-azure</artifactId> <version>3.1.0</version> </dependency> <dependency> <groupId>com.microsoft.azure</groupId> <artifactId>azure-storage</artifactId> <version>7.0.0</version> </dependency> </code></pre> Java code (Doesn't need to be java - could be scala): <pre class="prettyprint"><code>import org.apache.spark.SparkConf; import org.apache.spark.SparkContext; import org.apache.spark.sql.SparkSession; public class App { public static void main(String[] args) { SparkConf config = new SparkConf(); config.setMaster("local"); config.setAppName("app"); SparkSession spark = new SparkSession(new SparkContext(config)); spark.read().parquet("wasb://container@host/path"); spark.read().parquet("adl://host/path"); } } </code></pre> No matter what I try I get: <pre class="prettyprint"><code>Exception in thread "main" java.io.IOException: No FileSystem for scheme: wasb </code></pre> Same for adl. Every doc I can find on this either just says to add the azure-storage dependency, which I have done, or says to use HDInsight. Any thoughts?

I am not sure about the <code>adl</code> haven't tested but for the <code>wasb</code> you need to define the file system to be used in the underlying <code>Hadoop</code> configurations. Since you are using spark 2.3 you can use spark session to create a entry point as <pre class="prettyprint"><code>val spark = SparkSession.builder().appName("read from azure storage").master("local[*]").getOrCreate() </code></pre> Now define the file system <pre class="prettyprint"><code>spark.sparkContext.hadoopConfiguration.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem") spark.sparkContext.hadoopConfiguration.set("fs.azure.account.key.yourAccount.blob.core.windows.net", "yourKey ") </code></pre> Now read the parquet file as <pre class="prettyprint"><code>val baseDir = "wasb[s]://BlobStorageContainer@yourUser.blob.core.windows.net/" val dfParquet = spark.read.parquet(baseDir + "pathToParquetFile") </code></pre> Hope this helps!

Configure standalone spark for azure storage access

Tags:

apache-spark

azure

azure-blob-storage

azure-data-lake

I have a need to be able to run spark on my local machine to access azure wasb and adl urls, but I can't get it to work. I have a stripped down example here:

maven pom.xml (Brand-new pom, only the dependencies have been set):

<dependencies>
<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.11</artifactId>
  <version>2.3.0</version>
</dependency>
  <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>2.3.0</version>
  </dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>2.8.0</version>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-azure-datalake</artifactId>
  <version>3.1.0</version>
</dependency>
<dependency>
  <groupId>com.microsoft.azure</groupId>
  <artifactId>azure-storage</artifactId>
  <version>6.0.0</version>
</dependency>
<dependency>
  <groupId>com.microsoft.azure</groupId>
  <artifactId>azure-data-lake-store-sdk</artifactId>
  <version>2.2.3</version>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-azure</artifactId>
  <version>3.1.0</version>
</dependency>
<dependency>
  <groupId>com.microsoft.azure</groupId>
  <artifactId>azure-storage</artifactId>
  <version>7.0.0</version>
</dependency>

Java code (Doesn't need to be java - could be scala):

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.sql.SparkSession;

public class App {
    public static void main(String[] args) {
        SparkConf config = new SparkConf();
        config.setMaster("local");
        config.setAppName("app");
        SparkSession spark = new SparkSession(new SparkContext(config));
        spark.read().parquet("wasb://container@host/path");
        spark.read().parquet("adl://host/path");
    }
}

No matter what I try I get:

Exception in thread "main" java.io.IOException: No FileSystem for scheme: wasb

Same for adl. Every doc I can find on this either just says to add the azure-storage dependency, which I have done, or says to use HDInsight.

Any thoughts?

555

asked May 21 '18 17:05

absmiths

2 Answers

I figured this out and decided to post a working project since that is always what I look for. It is hosted here:

azure-spark-local-sample

The crux of it though is as @Shankar Koirala suggested:

For WASB, set the property to allow the url scheme to be recognized:

config.set("spark.hadoop.fs.wasb.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");

Then set the property which authorizes access to the account. You will need one of these for each account you need to access. These are generated through the Azure Portal under the Access Keys section of the Storage Account blade.

    config.set("fs.azure.account.key.[storage-account-name].blob.core.windows.net", "[access-key]");

Now for adl, assign the fs scheme as with WASB:

    config.set("spark.hadoop.fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem");
    // I don't know why this would be needed, but I saw it
    // on an otherwise very helpful page . . .
    config.set("spark.fs.AbstractFileSystem.adl.impl", "org.apache.hadoop.fs.adl.Adl");

. . . and finally, set the client access keys in these properties, again for each different account you need to access:

    config.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential");

    /* Client ID is generally the application ID from the azure portal app registrations*/
    config.set("fs.adl.oauth2.client.id", "[client-id]");

    /*The client secret is the key generated through the portal*/
    config.set("fs.adl.oauth2.credential", "[client-secret]");

    /*This is the OAUTH 2.0 TOKEN ENDPOINT under the ENDPOINTS section of the app registrations under Azure Active Directory*/
    config.set("fs.adl.oauth2.refresh.url", "[oauth-2.0-token-endpoint]");

I hope this is helpful, and I wish I could give credit to Shankar for the answer, but I also wanted to get the exact details out there.

answered Oct 22 '22 04:10

absmiths

I am not sure about the adl haven't tested but for the wasb you need to define the file system to be used in the underlying Hadoop configurations.

Since you are using spark 2.3 you can use spark session to create a entry point as

val spark = SparkSession.builder().appName("read from azure storage").master("local[*]").getOrCreate()

Now define the file system

spark.sparkContext.hadoopConfiguration.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.key.yourAccount.blob.core.windows.net", "yourKey ")

Now read the parquet file as

val baseDir = "wasb[s]://[email protected]/"

val dfParquet = spark.read.parquet(baseDir + "pathToParquetFile")

Hope this helps!

answered Oct 22 '22 04:10

koiralo

Related questions
                            
                                Azure Blob Storage Pre-Signed Url Features
                            
                                How to check running status and stop Durable function
                            
                                Azure Devops install Python package from Azure Artifacts inside Docker
                            
                                The MetadataAddress or Authority must use HTTPS unless disabled for development by setting RequireHttpsMetadata=false
                            
                                The Publish URL https://waws-prod-am2-xxx.publish.azurewebsites.windows.net/ is unreachable
                            
                                Azure publish issues - Busy - Restarting - Busy - Cycling
                            
                                Windows Azure Storage Emulator Connection String for ASP.NET MVC?
                            
                                Azure blob storage - auto generate unique blob name
                            
                                Azure Deployment Error: cannot find ClientPerfCountersInstaller.exe
                            
                                Upload files to Azure Blob storage through WebApi without accessing local filesystem
                            
                                Azure VM SQL Server Tempdb on Temporary Storage
                            
                                How does Azure's EventData.PartitionKey decide which partition to write to?
                            
                                How to configure CORS setting for Blob storage in windows azure
                            
                                How do I generate bacpac file from local machine and upload it on Azure blob?
                            
                                PowerShell - Connecting to Azure Active Directory using Microsoft Account
                            
                                MVC Get givenname and surname of Azure Active Directory authenticated user
                            
                                Azure: Waiting Dataset dependencies in manual created pipelines
                            
                                How to safely call Azure Function with function level authorization in Xamarin mobile app?
                            
                                Relationship between Azure SAS allowed services and allowed resource types
                            
                                Easy way to export AppSettings for an Azure Web App?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Configure standalone spark for azure storage access

Tags:

apache-spark

azure

azure-blob-storage

azure-data-lake

absmiths

People also ask

2 Answers

absmiths

koiralo

Recent Activity

Donate For Us