Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Submit a Spark job from C# and get results

As per title, I would like to request a calculation to a Spark cluster (local/HDInsight in Azure) and get the results back from a C# application.

I acknowledged the existence of Livy which I understand is a REST API application sitting on top of Spark to query it, and I have not found a standard C# API package. Is this the right tool for the job? Is it just missing a well known C# API?

The Spark cluster needs to access Azure Cosmos DB, therefore I need to be able to submit a job including the connector jar library (or its path on the cluster driver) in order for Spark to read data from Cosmos.

like image 701
Stefano d'Antonio Avatar asked Jun 30 '17 13:06

Stefano d'Antonio


2 Answers

As a .NET Spark connector to query data did not seem to exist I wrote one

https://github.com/UnoSD/SparkSharp

It is just a quick implementation, but it does have also a way of querying Cosmos DB using Spark SQL

It's just a C# client for Livy but it should be more than enough.

using (var client = new HdInsightClient("clusterName", "admin", "password"))
using (var session = await client.CreateSessionAsync(config))
{
    var sum = await session.ExecuteStatementAsync<int>("val res = 1 + 1\nprintln(res)");

    const string sql = "SELECT id, SUM(json.total) AS total FROM cosmos GROUP BY id";

    var cosmos = await session.ExecuteCosmosDbSparkSqlQueryAsync<IEnumerable<Result>>
    (
        "cosmosName",
        "cosmosKey",
        "cosmosDatabase",
        "cosmosCollection",
        "cosmosPreferredRegions",
        sql
    );
}
like image 176
Stefano d'Antonio Avatar answered Sep 21 '22 18:09

Stefano d'Antonio


If your just looking for a way to query your spark cluster using SparkSql then this is a way to do it from C#:

https://github.com/Azure-Samples/hdinsight-dotnet-odbc-spark-sql/blob/master/Program.cs

The console app requires an ODBC driver installed. You can find that here:

https://www.microsoft.com/en-us/download/details.aspx?id=49883

Also the console app has a bug: add this line to the code after the part where the connection string is generated. Immediately after this line:

connectionString = GetDefaultConnectionString();

Add this line

connectionString = connectionString + "DSN=Sample Microsoft Spark DSN";

If you change the name of the DSN when you install the spark ODBC Driver you will need to change the name in the above line then.

Since you need to access data from Cosmos DB, you could open a Jupyter Notebook on your cluster and ingest data into spark (create a permanent table of your data there) and then use this console app/your c# app to query that data.

If you have a spark job written in scala/python and need to submit it from a C# app then I guess LIVY is the best way to go. I am unsure if Mobius supports that.

like image 34
stt_code Avatar answered Sep 22 '22 18:09

stt_code