Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set/get Pandas dataframes into Redis using pyarrow

Using

dd = {'ID': ['H576','H577','H578','H600', 'H700'],
      'CD': ['AAAAAAA', 'BBBBB', 'CCCCCC','DDDDDD', 'EEEEEEE']}
df = pd.DataFrame(dd)

Pre Pandas 0.25, this below worked.

set:  redisConn.set("key", df.to_msgpack(compress='zlib'))
get:  pd.read_msgpack(redisConn.get("key"))

Now, there are deprecated warnings..

FutureWarning: to_msgpack is deprecated and will be removed in a future version.
It is recommended to use pyarrow for on-the-wire transmission of pandas objects.

The read_msgpack is deprecated and will be removed in a future version.
It is recommended to use pyarrow for on-the-wire transmission of pandas objects.

How does pyarrow work? And, how do I get pyarrow objects into and back from Redis.

reference: How to set/get pandas.DataFrame to/from Redis?

like image 499
Merlin Avatar asked Sep 16 '19 02:09

Merlin


People also ask

Does pandas use PyArrow?

To interface with pandas, PyArrow provides various conversion routines to consume pandas structures and convert back to them. While pandas uses NumPy as a backend, it has enough peculiarities (such as a different type system, and support for null values) that this is a separate topic from NumPy Integration.

Is PyArrow faster than pandas?

To summarize, if your apps save/load data from disk frequently, then it's a wise decision to leave these operations to PyArrow. Heck, it's 7 times faster for the identical file format. Imagine we introduced Parquet file format to the mix.

How to convert pandas Dataframe to pyspark Dataframe?

where spark is the SparkSession object. Example 1: Create a DataFrame and then Convert using spark.createDataFrame () method Example 2: Create a DataFrame and then Convert using spark.createDataFrame () method In this method, we are using Apache Arrow to convert Pandas to Pyspark DataFrame.

Should I use pyarrow or read_msgpack?

It is recommended to use pyarrow for on-the-wire transmission of pandas objects. The read_msgpack is deprecated and will be removed in a future version. It is recommended to use pyarrow for on-the-wire transmission of pandas objects. How does pyarrow work? And, how do I get pyarrow objects into and back from Redis.

Can msgpack be used to serialize objects into Redis?

That leaves good & trusty msgpack to serialize objects such that they can be pushed/stored into redis. Thanks for contributing an answer to Stack Overflow!

How to get the maximum price from a Dataframe in Python?

Apply the code to go from the DataFrame to SQL: You can then find the maximum price among all the products using this query: Finally, get back from SQL to the DataFrame: Putting all the code components together: Once you run the code in Python, you’ll get the product with the maximum price:


3 Answers

Here's a full example to use pyarrow for serialization of a pandas dataframe to store in redis

apt-get install python3 python3-pip redis-server
pip3 install pandas pyarrow redis

and then in python

import pandas as pd
import pyarrow as pa
import redis

df=pd.DataFrame({'A':[1,2,3]})
r = redis.Redis(host='localhost', port=6379, db=0)

context = pa.default_serialization_context()
r.set("key", context.serialize(df).to_buffer().to_pybytes())
context.deserialize(r.get("key"))
   A
0  1
1  2
2  3

I just submitted PR 28494 to pandas to include this pyarrow example in the docs.

Reference docs:

  • https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_msgpack.html
  • https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization
  • https://arrow.apache.org/docs/python/memory.html#pyarrow-buffer
  • https://stackoverflow.com/a/37957490/4126114
like image 74
Shadi Avatar answered Oct 08 '22 08:10

Shadi


Here is how I do it since default_serialization_context is deprecated and things are a bit simpler:

import pyarrow as pa
import redis

pool = redis.ConnectionPool(host='localhost', port=6379, db=0)
r = redis.Redis(connection_pool=pool)

def storeInRedis(alias, df):
    df_compressed = pa.serialize(df).to_buffer().to_pybytes()
    res = r.set(alias,df_compressed)
    if res == True:
        print(f'{alias} cached')

def loadFromRedis(alias):
    data = r.get(alias)
    try:
        return pa.deserialize(data)
    except:
        print("No data")


storeInRedis('locations', locdf)

loadFromRedis('locations')
like image 38
ety Avatar answered Oct 08 '22 10:10

ety


If you would like to compress the data in Redis, you can use the built in support for parquet & gzip

def openRedisCon():
   pool = redis.ConnectionPool(host=REDIS_HOST, port=REDIS_PORT, db=0)
   r = redis.Redis(connection_pool=pool)
   return r

def storeDFInRedis(alias, df):
    """Store the dataframe object in Redis
    """

    buffer = io.BytesIO()
    df.to_parquet(buffer, compression='gzip')
    buffer.seek(0) # re-set the pointer to the beginning after reading
    r = openRedisCon()
    res = r.set(alias,buffer.read())

def loadDFFromRedis(alias, useStale: bool = False):
    """Load the named key from Redis into a DataFrame and return the DF object
    """

    r = openRedisCon()

    try:
        buffer = io.BytesIO(r.get(alias))
        buffer.seek(0)
        df = pd.read_parquet(buffer)
        return df
    except:
        return None


like image 43
rossco Avatar answered Oct 08 '22 09:10

rossco