Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Configure standalone spark for azure storage access

I have a need to be able to run spark on my local machine to access azure wasb and adl urls, but I can't get it to work. I have a stripped down example here:

maven pom.xml (Brand-new pom, only the dependencies have been set):

<dependencies>
<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.11</artifactId>
  <version>2.3.0</version>
</dependency>
  <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>2.3.0</version>
  </dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>2.8.0</version>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-azure-datalake</artifactId>
  <version>3.1.0</version>
</dependency>
<dependency>
  <groupId>com.microsoft.azure</groupId>
  <artifactId>azure-storage</artifactId>
  <version>6.0.0</version>
</dependency>
<dependency>
  <groupId>com.microsoft.azure</groupId>
  <artifactId>azure-data-lake-store-sdk</artifactId>
  <version>2.2.3</version>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-azure</artifactId>
  <version>3.1.0</version>
</dependency>
<dependency>
  <groupId>com.microsoft.azure</groupId>
  <artifactId>azure-storage</artifactId>
  <version>7.0.0</version>
</dependency>

Java code (Doesn't need to be java - could be scala):

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.sql.SparkSession;

public class App {
    public static void main(String[] args) {
        SparkConf config = new SparkConf();
        config.setMaster("local");
        config.setAppName("app");
        SparkSession spark = new SparkSession(new SparkContext(config));
        spark.read().parquet("wasb://container@host/path");
        spark.read().parquet("adl://host/path");
    }
}

No matter what I try I get:

Exception in thread "main" java.io.IOException: No FileSystem for scheme: wasb

Same for adl. Every doc I can find on this either just says to add the azure-storage dependency, which I have done, or says to use HDInsight.

Any thoughts?

like image 555
absmiths Avatar asked May 21 '18 17:05

absmiths


People also ask

How do you run spark on Azure?

Run a Spark SQL job Select Create. In this step, create a Spark DataFrame with Seattle Safety Data from Azure Open Datasets, and use SQL to query the data. The following command sets the Azure storage access information. Paste this PySpark code into the first cell and use Shift+Enter to run the code.

How do I create a spark cluster in Azure?

From the top menu, select + Create a resource. Select Analytics > Azure HDInsight to go to the Create HDInsight cluster page. From the drop-down list, select the Azure subscription that's used for the cluster. From the drop-down list, select your existing resource group, or select Create new.


2 Answers

I figured this out and decided to post a working project since that is always what I look for. It is hosted here:

azure-spark-local-sample

The crux of it though is as @Shankar Koirala suggested:

For WASB, set the property to allow the url scheme to be recognized:

config.set("spark.hadoop.fs.wasb.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");

Then set the property which authorizes access to the account. You will need one of these for each account you need to access. These are generated through the Azure Portal under the Access Keys section of the Storage Account blade.

    config.set("fs.azure.account.key.[storage-account-name].blob.core.windows.net", "[access-key]");

Now for adl, assign the fs scheme as with WASB:

    config.set("spark.hadoop.fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem");
    // I don't know why this would be needed, but I saw it
    // on an otherwise very helpful page . . .
    config.set("spark.fs.AbstractFileSystem.adl.impl", "org.apache.hadoop.fs.adl.Adl");

. . . and finally, set the client access keys in these properties, again for each different account you need to access:

    config.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential");

    /* Client ID is generally the application ID from the azure portal app registrations*/
    config.set("fs.adl.oauth2.client.id", "[client-id]");

    /*The client secret is the key generated through the portal*/
    config.set("fs.adl.oauth2.credential", "[client-secret]");

    /*This is the OAUTH 2.0 TOKEN ENDPOINT under the ENDPOINTS section of the app registrations under Azure Active Directory*/
    config.set("fs.adl.oauth2.refresh.url", "[oauth-2.0-token-endpoint]");

I hope this is helpful, and I wish I could give credit to Shankar for the answer, but I also wanted to get the exact details out there.

like image 67
absmiths Avatar answered Oct 22 '22 04:10

absmiths


I am not sure about the adl haven't tested but for the wasb you need to define the file system to be used in the underlying Hadoop configurations.

Since you are using spark 2.3 you can use spark session to create a entry point as

val spark = SparkSession.builder().appName("read from azure storage").master("local[*]").getOrCreate()

Now define the file system

spark.sparkContext.hadoopConfiguration.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.key.yourAccount.blob.core.windows.net", "yourKey ")

Now read the parquet file as

val baseDir = "wasb[s]://[email protected]/"

val dfParquet = spark.read.parquet(baseDir + "pathToParquetFile")

Hope this helps!

like image 22
koiralo Avatar answered Oct 22 '22 04:10

koiralo