As per title, I would like to request a calculation to a Spark cluster (local/HDInsight in Azure) and get the results back from a C# application.
I acknowledged the existence of Livy which I understand is a REST API application sitting on top of Spark to query it, and I have not found a standard C# API package. Is this the right tool for the job? Is it just missing a well known C# API?
The Spark cluster needs to access Azure Cosmos DB, therefore I need to be able to submit a job including the connector jar library (or its path on the cluster driver) in order for Spark to read data from Cosmos.
As a .NET Spark connector to query data did not seem to exist I wrote one
https://github.com/UnoSD/SparkSharp
It is just a quick implementation, but it does have also a way of querying Cosmos DB using Spark SQL
It's just a C# client for Livy but it should be more than enough.
using (var client = new HdInsightClient("clusterName", "admin", "password"))
using (var session = await client.CreateSessionAsync(config))
{
var sum = await session.ExecuteStatementAsync<int>("val res = 1 + 1\nprintln(res)");
const string sql = "SELECT id, SUM(json.total) AS total FROM cosmos GROUP BY id";
var cosmos = await session.ExecuteCosmosDbSparkSqlQueryAsync<IEnumerable<Result>>
(
"cosmosName",
"cosmosKey",
"cosmosDatabase",
"cosmosCollection",
"cosmosPreferredRegions",
sql
);
}
If your just looking for a way to query your spark cluster using SparkSql then this is a way to do it from C#:
https://github.com/Azure-Samples/hdinsight-dotnet-odbc-spark-sql/blob/master/Program.cs
The console app requires an ODBC driver installed. You can find that here:
https://www.microsoft.com/en-us/download/details.aspx?id=49883
Also the console app has a bug: add this line to the code after the part where the connection string is generated. Immediately after this line:
connectionString = GetDefaultConnectionString();
Add this line
connectionString = connectionString + "DSN=Sample Microsoft Spark DSN";
If you change the name of the DSN when you install the spark ODBC Driver you will need to change the name in the above line then.
Since you need to access data from Cosmos DB, you could open a Jupyter Notebook on your cluster and ingest data into spark (create a permanent table of your data there) and then use this console app/your c# app to query that data.
If you have a spark job written in scala/python and need to submit it from a C# app then I guess LIVY is the best way to go. I am unsure if Mobius supports that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With