Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading data from Azure Blob with Spark

I am having issue in reading data from azure blobs via spark streaming

JavaDStream<String> lines = ssc.textFileStream("hdfs://ip:8020/directory");

code like above works for HDFS, but is unable to read file from Azure blob

https://blobstorage.blob.core.windows.net/containerid/folder1/

Above is the path which is shown in azure UI, but this doesnt work, am i missing something, and how can we access it.

I know Eventhub are ideal choice for streaming data, but my current situation demands to use storage rather then queues

like image 250
duck Avatar asked Jun 11 '16 12:06

duck


People also ask

How do you read data from Azure Blob storage?

Open the File Selector dialog from the Message Analyzer File menu by highlighting Open and then selecting the From Other File Sources command. In the Add Azure Storage Connection dialog, specify Account name and Account key information, as described in Accessing Log Data in Azure Storage BLOB Containers.

How do I connect Azure Blob storage to Pyspark?

In order to access resources from Azure blob you need to add jar files hadoop-azure. jar and azure-storage. jar to spark-submit command when you submitting a job. Also, if you are using Docker or installing the application on a cluster, there is a tip for you as well.

How do I access files from Blob storage?

Open a blob on your local computer Select the blob you wish to open. On the main pane's toolbar, select Open. The blob will be downloaded and opened using the application associated with the blob's underlying file type.


2 Answers

In order to read data from blob storage, there are two things that need to be done. First, you need to tell Spark which native file system to use in the underlying Hadoop configuration. This means that you also need the Hadoop-Azure JAR to be available on your classpath (note there maybe runtime requirements for more JARs related to the Hadoop family):

JavaSparkContext ct = new JavaSparkContext();
Configuration config = ct.hadoopConfiguration();
config.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
config.set("fs.azure.account.key.youraccount.blob.core.windows.net", "yourkey");

Now, call onto the file using the wasb:// prefix (note the [s] is for optional secure connection):

ssc.textFileStream("wasb[s]://<BlobStorageContainerName>@<StorageAccountName>.blob.core.windows.net/<path>");

This goes without saying that you'll need to have proper permissions set from the location making the query to blob storage.

like image 169
Yuval Itzchakov Avatar answered Sep 30 '22 06:09

Yuval Itzchakov


As supplementary, there is a tutorial about HDFS-compatible Azure Blob storage with Hadoop which is very helpful, please see https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage.

Meanwhile, there is an offical sample on GitHub for Spark streaming on Azure. Unfortunately, the sample is written for Scala, but I think it's still helpful for you.

like image 45
Peter Pan Avatar answered Sep 30 '22 08:09

Peter Pan