Can we connect spark with sql-server? If so, how? I am new to spark, I want to connect the server to spark and work directly from sql-server instead of uploading .txt or .csv file. Please help, Thank you.
Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API. Usable in Java, Scala, Python and R. Apply functions to results of SQL queries.
The Apache Spark connector for SQL Server and Azure SQL is a high-performance connector that enables you to use transactional data in big data analytics and persist results for ad-hoc queries or reporting.
// Spark 2.x
import org.apache.spark.SparkContext
// Create dataframe on top of SQLServer database table
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").option("driver" , "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("url", "jdbc:sqlserver://XXXXX.com:port;databaseName=xxx") \
.option("dbtable", "(SELECT * FROM xxxx) tmp") \
.option("user", "xxx") \
.option("password", "xxx") \
.load()
// show sample records from data frame
jdbcDF.show(5)
Here are some code snippets. A DataFrame is used to create the table t2 and insert data. The SqlContext is used to load the data from the t2 table into a DataFrame. I added the spark.driver.extraClassPath and spark.executor.extraClassPath to my spark-default.conf file.
//Spark 1.4.1
//Insert data from DataFrame
case class Conf(mykey: String, myvalue: String)
val data = sc.parallelize( Seq(Conf("1", "Delaware"), Conf("2", "Virginia"), Conf("3", "Maryland"), Conf("4", "South Carolina") ))
val df = data.toDF()
val url = "jdbc:sqlserver://wcarroll3:1433;database=mydb;user=ReportUser;password=ReportUser"
val table = "t2"
df.insertIntoJDBC(url, table, true)
//Load from database using SqlContext
val url = "jdbc:sqlserver://wcarroll3:1433;database=mydb;user=ReportUser;password=ReportUser"
val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver";
val tbl = { sqlContext.load("jdbc", Map( "url" -> url, "driver" -> driver, "dbtable" -> "t2", "partitionColumn" -> "mykey", "lowerBound" -> "0", "upperBound" -> "100", "numPartitions" -> "1" ))}
tbl.show()
Some issue to consider are:
Make sure firewall ports are open for port 1433. If using Microsoft Azure SQL Server DB, tables require a primary key. Some of the methods create the table, but Spark's code is not creating the primary key so the table creation fails.
Other details to take care: https://docs.databricks.com/spark/latest/data-sources/sql-databases.html
source: https://blogs.msdn.microsoft.com/bigdatasupport/2015/10/22/how-to-allow-spark-to-access-microsoft-sql-server/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With