Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read data from Cassandra with R?

Tags:

r

cassandra

I am using R 2.14.1 and Cassandra 1.2.11, I have a separate program which has written data to a single Cassandra table. I am failing to read them from R.

The Cassandra schema is defined like this:

create table chosen_samples (id bigint , temperature double, primary key(id))

I have first tried the RCassandra package (http://www.rforge.net/RCassandra/)

> # install.packages("RCassandra")
> library(RCassandra)
> rc <- RC.connect(host ="192.168.33.10", port = 9160L)
> RC.use(rc, "poc1_samples")
> cs <- RC.read.table(rc, c.family="chosen_samples")

The connection seems to succeed but the parsing of the table into data frame fails:

> cs
Error in data.frame(..dfd. = c("@\"ffffff", "@(<cc><cc><cc><cc><cc><cd>",  : 
  duplicate row.names: 

I have also tried using JDBC connector, as described here: http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive

> # install.packages("RJDBC")
> library(RJDBC)
> cassdrv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver", "/Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar", "`")

But this one fails like this:

Error in .jfindClass(as.character(driverClass)[1]) : class not found

Even though the location to the java driver is correct

$ ls /Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar
/Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar
like image 533
Svend Avatar asked Feb 24 '14 17:02

Svend


4 Answers

You have to download apache-cassandra-2.0.10-bin.tar.gz and cassandra-jdbc-1.2.5.jar and cassandra-all-1.1.0.jar.

There is no need to install Cassandra on your local machine; just put the cassandra-jdbc-1.2.5.jar and the cassandra-all-1.1.0.jar files in the lib directory of unziped apache-cassandra-2.0.10-bin.tar.gz. Then you can use

 library(RJDBC)
 drv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver", 
              list.files("D:/apache-cassandra-2.0.10/lib",
              pattern="jar$",full.names=T))

That is working on my unix but not on my windows machine. Hope that helps.

like image 84
fc9.30 Avatar answered Oct 02 '22 16:10

fc9.30


This question is old now, but since it's the one of the top hits for R and Cassandra I thought I'd leave a simple solution here, as I found frustratingly little up-to-date support for what I thought would be a fairly common task.

Sparklyr makes this pretty easy to do from scratch now, as it exposes a java context so the Spark-Cassandra-Connector can be used directly. I've wrapped up the bindings in this simple package, crassy, but it's not necessary to use.

I mostly made it to demystify the config around how to make sparklyr load the connector, and as the syntax for selecting a subset of columns is a little unwieldy (assuming no Scala knowledge).

Column selection and partition filtering are supported. These were the only features I thought were necessary for general Cassandra use cases, given CQL can't be submitted directly to the cluster.

I've not found a solution to submitting more general CQL queries which doesn't involve writing custom scala, however there's an example of how this can work here.

like image 21
Akhil Nair Avatar answered Oct 02 '22 16:10

Akhil Nair


Right, I found an (admittedly ugly) way, simply by calling python from R, parsing the NA manually and re-assigning the data-frames names in R, like this

# install.packages("rPython")
# (don't forget to "pip install cql")
library(rPython)
python.exec("import sys")
# adding libraries from virtualenv 
python.exec("sys.path.append('/Users/svend/dev/pyVe/playground/lib/python2.7/site-packages/')")
python.exec("import cql")

python.exec("connection=cql.connect('192.168.33.10', cql_version='3.0.0')")
python.exec("cursor = connection.cursor()")
python.exec("cursor.execute('use poc1_samples')")
python.exec("cursor.execute('select * from chosen_samples' )")

# coding python None into NA (rPython seem to just return nothing )
python.exec("rep = lambda x : '__NA__' if x is None else x")
python.exec( "def getData(): return [rep(num) for line in cursor for num in line ]" )
data <- python.call("getData")
df <- as.data.frame(matrix(unlist(data), ncol=15, byrow=T))

names(df) <- c("temperature", "maxTemp", "minTemp",
"dewpoint", "elevation", "gust", "latitude", "longitude",
"maxwindspeed", "precipitation", "seelevelpressure", "visibility", "windspeed")

# and decoding NA's    
parsena <- function (x) if (x=="__NA__") NA else x
df <- as.data.frame(lapply(df,  parsena))

Anybody has a better idea?

like image 35
Svend Avatar answered Oct 02 '22 16:10

Svend


I had the same error message when executing Rscript with RJDBC connection via batch file (R 3.2.4, Teradata driver). Also, when run in RStudio it worked fine in the second run but not first.

What helped was explicitly call:

library(rJava)
.jinit()
like image 33
marb1984 Avatar answered Oct 02 '22 17:10

marb1984