Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python read Cassandra data into pandas

Tags:

What is the proper and fastest way to read Cassandra data into pandas? Now I use the following code but it's very slow...

import pandas as pd  from cassandra.cluster import Cluster from cassandra.auth import PlainTextAuthProvider from cassandra.query import dict_factory  auth_provider = PlainTextAuthProvider(username=CASSANDRA_USER, password=CASSANDRA_PASS) cluster = Cluster(contact_points=[CASSANDRA_HOST], port=CASSANDRA_PORT,     auth_provider=auth_provider)  session = cluster.connect(CASSANDRA_DB) session.row_factory = dict_factory  sql_query = "SELECT * FROM {}.{};".format(CASSANDRA_DB, CASSANDRA_TABLE)  df = pd.DataFrame()  for row in session.execute(sql_query):     df = df.append(pd.DataFrame(row, index=[0]))  df = df.reset_index(drop=True).fillna(pd.np.nan) 

Reading 1000 rows takes 1 minute, and I have a "bit more"... If I run the same query eg. in DBeaver, I get the whole results (~40k rows) within a minute.

Thank you!!!

like image 256
ragesz Avatar asked Dec 20 '16 16:12

ragesz


People also ask

How does Python read data from Cassandra?

Fastest way to read Cassandra data into pandas with automatic iteration of pages. Create dictionary and add each to it by automatically iterating all pages. Then, create dataframe with this dictionary. Show activity on this post.

How does Python connect to Cassandra database?

Cassandra has its own query language called Cassandra Query Language (CQL). CQL queries can be executed from inside a CQLASH shell – similar to MySQL or SQLite shell. The CQL syntax appears similar to standard SQL. Python module for working with Cassandra database is called Cassandra Driver.

What type of data model does Cassandra use?

Cassandra is a NoSQL database, which is a key-value store. Some of the features of Cassandra data model are as follows: Data in Cassandra is stored as a set of rows that are organized into tables. Tables are also called column families.


1 Answers

I got the answer at the official mailing list (it works perfectly):

Hi,

try to define your own pandas row factory:

def pandas_factory(colnames, rows):     return pd.DataFrame(rows, columns=colnames)  session.row_factory = pandas_factory session.default_fetch_size = None  query = "SELECT ..." rslt = session.execute(query, timeout=None) df = rslt._current_rows 

That's the way i do it - an it should be faster...

If you find a faster method - i'm interested in :)

Michael

like image 108
ragesz Avatar answered Sep 22 '22 08:09

ragesz