Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transforming a Cassandra OrderedMapSerializedKey to a Python dictionary

I have a column in Cassandra composed of a map of lists which when queried with the Python driver it returns an OrderedMapSerializedKey structure. This structure is a map of lists. I would like to put the whole query into pandas.

To extract data from that OrderedMapSerializedKey structure, meaning to get the key and and use it as the label for a new column and keeping only the first element of the list as the value I use the approach mentioned here with some complex/dirty manipulation in the factory before returning the built DataFrame.

A similar problem was asked here, without really an answer.

Is there a better way to turn such an OrderedMapSerializedKey structure into a Python dictionary that can be readily loaded into a pandas DataFrame?

like image 938
Ignacio Vergara Kausel Avatar asked Sep 04 '17 12:09

Ignacio Vergara Kausel


2 Answers

I think an ultimate solution could be to store OrderedMapSerializedKey Cassandra structure as a dict in your dataframe column then you could transfer this value / column to anyone you want. Ultimate because you may not know the actual keys in Cassandra rows (maybe different keys are inserted into rows).

So here the solution I've tested, you only have to improve the pandas_factory funciton:


EDIT:

In previous solution I replaced only the first (0th) row of Cassandra dataset (rows are list of tuples where every tuple is a row in Cassandra)

from cassandra.util import OrderedMapSerializedKey

def pandas_factory(colnames, rows):

    # Convert tuple items of 'rows' into list (elements of tuples cannot be replaced)
    rows = [list(i) for i in rows]

    # Convert only 'OrderedMapSerializedKey' type list elements into dict
    for idx_row, i_row in enumerate(rows):

        for idx_value, i_value in enumerate(i_row):

            if type(i_value) is OrderedMapSerializedKey:

                rows[idx_row][idx_value] = dict(rows[idx_row][idx_value])

    return pd.DataFrame(rows, columns=colnames)

You have to insert some automatic check whether there is minimum one value before / after the Cassandra map field or manually modify above script accordingly.

Nice day!

like image 62
ragesz Avatar answered Oct 18 '22 20:10

ragesz


Following strategy aims at separating stages of data conversion and pandas ingestion.

To obtain a list of dictionaries from a cassandra request, you have to use a specific row_factory :

from cassandra.query import (
    dict_factory,
    SimpleStatement
    )

from cassandra.cluster import (
    Cluster,
    ExecutionProfile,
    EXEC_PROFILE_DEFAULT
    )

profile = ExecutionProfile(
    row_factory=dict_factory
    )

hosts = ["127.0.0.1"]
port = 9042

cluster = Cluster(
    hosts,
    port=port,
    execution_profiles={EXEC_PROFILE_DEFAULT: profile}
    )

Then get the data using that cluster :

src_keyspace = "your_keyspace"
src_tbl = "your_table"
N_ROWS = 100

with cluster.connect(src_keyspace) as cass_session:

    res = cass_session.execute(
        SimpleStatement("SELECT * FROM {} LIMIT {}".format(src_tbl,
                                                           N_ROWS))
        )

Then, convert the remaining OrderedMapSerializedKey to dict :

    rows_as_dict = [
        { key: (val if not isinstance(val, OrderedMapSerializedKey)
                else dict(val)) for key, val in row.items() }
                    for row in res.current_rows
                    ]

Then simply use pandas.DataFrame.from_dict

like image 1
jojo2000 Avatar answered Oct 18 '22 22:10

jojo2000