I have a column in Cassandra composed of a map of lists which when queried with the Python driver it returns an OrderedMapSerializedKey structure. This structure is a map of lists. I would like to put the whole query into pandas.
To extract data from that OrderedMapSerializedKey structure, meaning to get the key and and use it as the label for a new column and keeping only the first element of the list as the value I use the approach mentioned here with some complex/dirty manipulation in the factory before returning the built DataFrame.
A similar problem was asked here, without really an answer.
Is there a better way to turn such an OrderedMapSerializedKey structure into a Python dictionary that can be readily loaded into a pandas DataFrame?
I think an ultimate solution could be to store OrderedMapSerializedKey
Cassandra structure as a dict
in your dataframe column then you could transfer this value / column to anyone you want. Ultimate because you may not know the actual keys in Cassandra rows (maybe different keys are inserted into rows).
So here the solution I've tested, you only have to improve the pandas_factory
funciton:
EDIT:
In previous solution I replaced only the first (0th) row of Cassandra dataset (rows
are list of tuples where every tuple is a row in Cassandra)
from cassandra.util import OrderedMapSerializedKey
def pandas_factory(colnames, rows):
# Convert tuple items of 'rows' into list (elements of tuples cannot be replaced)
rows = [list(i) for i in rows]
# Convert only 'OrderedMapSerializedKey' type list elements into dict
for idx_row, i_row in enumerate(rows):
for idx_value, i_value in enumerate(i_row):
if type(i_value) is OrderedMapSerializedKey:
rows[idx_row][idx_value] = dict(rows[idx_row][idx_value])
return pd.DataFrame(rows, columns=colnames)
You have to insert some automatic check whether there is minimum one value before / after the Cassandra map field or manually modify above script accordingly.
Nice day!
Following strategy aims at separating stages of data conversion and pandas ingestion.
To obtain a list of dictionaries from a cassandra request, you have to use a specific row_factory
:
from cassandra.query import (
dict_factory,
SimpleStatement
)
from cassandra.cluster import (
Cluster,
ExecutionProfile,
EXEC_PROFILE_DEFAULT
)
profile = ExecutionProfile(
row_factory=dict_factory
)
hosts = ["127.0.0.1"]
port = 9042
cluster = Cluster(
hosts,
port=port,
execution_profiles={EXEC_PROFILE_DEFAULT: profile}
)
Then get the data using that cluster :
src_keyspace = "your_keyspace"
src_tbl = "your_table"
N_ROWS = 100
with cluster.connect(src_keyspace) as cass_session:
res = cass_session.execute(
SimpleStatement("SELECT * FROM {} LIMIT {}".format(src_tbl,
N_ROWS))
)
Then, convert the remaining OrderedMapSerializedKey
to dict
:
rows_as_dict = [
{ key: (val if not isinstance(val, OrderedMapSerializedKey)
else dict(val)) for key, val in row.items() }
for row in res.current_rows
]
Then simply use pandas.DataFrame.from_dict
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With