Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Establishing a connection between R and a Hive (Hadoop) database

Tags:

r

jdbc

hadoop

hive

Does anyone know how to achieve that?

I am assuming that RJDBC would help; but from my (likely naive) understanding, a bit of tweaking is necessary to write or adapt a Hive driver for this.

Relevant documentation:

  • http://wiki.apache.org/hadoop/Hive/HiveClient
  • http://cran.r-project.org/web/packages/RJDBC/RJDBC.pdf

Any help or suggestion is welcome! If no one did this before, I would be happy to code a bit towards a solution but I know next to no Java.

like image 290
crayola Avatar asked May 19 '11 11:05

crayola


1 Answers

R can be interfaced with Hive via RJDBC. However, you'll need a Hive server and drivers.

Hive server:

hive --service hiveserver 1> /dev/null 2> /dev/null &

Drivers: download Toad for Cloud DBs, http://www.toadworld.com/m/freeware/566.aspx, and use drivers included there (unzip the jars and look for the files below).

Below is an R function that you can define to create a connection to a Hive server.

hive_connection <- function(   hostname= 'dlhive01.cloud.msrch', port= 10000, lib_dir ){
  library( RJDBC )

  hive_jars <- c('commons-logging-1.0.4.jar','hadoop-core-0.20.2+737.jar',    'hive-exec-0.7.1-cdh3u1.jar', 'hive-jdbc-0.7.1-cdh3u1.jar',    'hive-metastore-0.7.1-cdh3u1.jar', 'hive-service-0.7.1-cdh3u1.jar', 'libfb303.jar','libthrift.jar', 'log4j-1.2.15.jar', 'slf4j-api-1.6.1.jar', 'slf4j-log4j12-1.6.1.jar' )

  # lib_dir: directory containing the jars above.
  hive_class_path <- sprintf( '%s/%s', lib_dir, hive_jars )

  drv <- JDBC( 'org.apache.hadoop.hive.jdbc.HiveDriver',   classPath=  hive_class_path, "`" )

  server <- sprintf( 'jdbc:hive://%s:%s/default', hostname, port )

  return ( dbConnect( drv, server ) )
}
like image 89
Yakov Keselman Avatar answered Sep 30 '22 09:09

Yakov Keselman