Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When should I use pandas' Categorical dtype?

My question concerns optimizing memory usage for pandas Series. The docs note,

The memory usage of a Categorical is proportional to the number of categories plus the length of the data. In contrast, an object dtype is a constant times the length of the data.

My understanding is that pandas Categorical data is effectively a mapping to unique (downcast) integers that represent categories, where the integers themselves occupy (presumably) fewer bytes than the strings that make up the object dtype.

My question: is there any rule-of-thumb for when using pd.Categorical will not save memory over object? How direct is the aforementioned proportionality, and doesn't it also depend on the length of each element (string) in the Series?

In the test below, pd.Categorical seems to win by a long shot.

import string

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

np.random.seed(444)
%matplotlib inline

def mem_usage(obj, index=False, total=True, deep=True):
    """Memory usage of pandas Series or DataFrame."""
    # Ported from https://www.dataquest.io/blog/pandas-big-data/
    usg = obj.memory_usage(index=index, deep=deep)
    if isinstance(obj, pd.DataFrame) and total:
        usg = usg.sum()
    # Bytes to megabytes
    return usg / 1024 ** 2

catgrs = tuple(string.printable)

lengths = np.arange(1, 10001, dtype=np.uint16)
sizes = []
for length in lengths:
    obj = pd.Series(np.random.choice(catgrs, size=length))
    cat = obj.astype('category')
    sizes.append((mem_usage(obj), mem_usage(cat)))
sizes = np.array(sizes)

fig, ax = plt.subplots()
ax.plot(sizes)
ax.set_ylabel('Size (MB)')
ax.set_xlabel('Series length')
ax.legend(['object dtype', 'category dtype'])
ax.set_title('Memory usage of object vs. category dtype')

enter image description here

Albeit, for n<125, pd.Categorical is slightly larger.

fig, ax = plt.subplots()
ax.plot(sizes[:200])
ax.set_ylabel('Size (MB)')
ax.set_xlabel('Series length')
ax.legend(['object dtype', 'category dtype'])
ax.set_title('Memory usage of object vs. category dtype')

enter image description here

like image 784
Brad Solomon Avatar asked Nov 24 '25 14:11

Brad Solomon


2 Answers

is there any rule-of-thumb for when using pd.Categorical will not save memory over object?

I would say : "no" And to be extreme: "opinion based" In both cases, a strict use case context should be given.


What happens if your categorical data is built from complex objects, or long strings, or very big numbers (hashes) ? or from a pd.cut complex function ? What is total nb of rows of your data ? 1_000 ? 100_000 ? 10_000_000_000 ? What is the ratio category/nb_of_data ? is it stable from one data set to the other ?


The following code tries to observe where the 2 curves intersect with different categorical data string length values and different ratios nb_categories / nb of data.

import hashlib
import itertools
import math
import sys
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm

np.random.seed(444)


def mem_usage(obj, index=False, total=True, deep=True):
    """Memory usage of pandas Series or DataFrame."""
    # Ported from https://www.dataquest.io/blog/pandas-big-data/
    usg = obj.memory_usage(index=index, deep=deep)
    if isinstance(obj, pd.DataFrame) and total:
        usg = usg.sum()
    # Bytes to megabytes
    return usg / 1024 ** 2


def do_test(max_dataset_length):
    lengths = np.arange(1, max_dataset_length, dtype=np.uint16)
    obj_vals = []
    cat_vals = []
    for length in tqdm(lengths, desc="generate samples"):
        obj = pd.Series(np.random.choice(categories, size=length))
        cat = obj.astype('category')
        obj_vals.append(mem_usage(obj))
        cat_vals.append(mem_usage(cat))
    objs_arr = np.array(obj_vals)
    cats_arr = np.array(cat_vals)
    arr = objs_arr - cats_arr
    first_zero_index = (arr > 0).argmax(axis=0)  # index of intersection between the 2 data sets
    return first_zero_index


results_path = Path.cwd() / "results.csv"

if results_path.exists():
    results_df = pd.read_csv(results_path, index_col=0)
    results_df.to_csv(results_path)
else:
    i = 1
    l_cat_nbs = [10, 100, 500]  # , 1_000, 2_000, 5_000]
    # l_hash_func = [hashlib.sha1, hashlib.sha3_512]
    l_cat_data_ratios = [0.1, 0.2, 0.5, 0.7, 0.9]
    l_hash_sizes = [4, 16, 32, 64, 128]
    measure_cat_str_len = []
    measure_cat_size = []
    measure_nb_samples = []
    measure_ratios = []
    measure_threshold = []
    measure_nb_cats = []
    hash_function = hashlib.sha3_512
    all_hashes = [hash_function(bytes(e)).hexdigest() for e in
                  tqdm(np.random.random_sample(max(l_cat_nbs)), desc="generate hashes")]
    for nb_cat, hash_size in itertools.product(  # hash_function
            l_cat_nbs,
            l_hash_sizes
            # l_hash_func
    ):

        categories = [e[:hash_size] for e in all_hashes]
    example_hash = categories[0]
    original_cat_size = sys.getsizeof(example_hash)
    original_cat_str_len = len(example_hash)
    print(f"{hash_function.__name__} => {example_hash} - len={original_cat_str_len} - size={original_cat_size}")
    cat_df = pd.DataFrame({"hash": categories})
    cat_df["cat"] = cat_df["hash"].astype('category')
    print(f"Category mem size={sys.getsizeof(cat_df['cat'].dtype)}")
    for ratio in l_cat_data_ratios:
        max_length = int(math.floor(nb_cat / ratio))
        if threshold := do_test(max_dataset_length=max_length):
            measure_nb_cats.append(nb_cat)
            measure_cat_str_len.append(original_cat_str_len)
            measure_cat_size.append(original_cat_size)
            measure_nb_samples.append(max_length)
            measure_ratios.append(ratio)
            measure_threshold.append(threshold)

    results_df = pd.DataFrame(
            {"original cat str len"     : measure_cat_str_len,
             "original cat data size"   : measure_cat_size,
             "nb samples"               : measure_nb_samples,
             "nb cat / nb samples ratio": measure_ratios,
             "nb samples threshold"     : measure_threshold,
             "nb cat"                   : measure_nb_cats
             }
    )
    results_df.to_csv(results_path)

results_df["nb cat / nb samples ratio"] = results_df["nb cat / nb samples ratio"].astype('category')
g = sns.FacetGrid(
        results_df,
        col="nb cat / nb samples ratio",
)
g.map(sns.lineplot, "original cat data size", "nb samples threshold")
g.map(sns.scatterplot, "original cat data size", "nb samples threshold")
plt.show()

  • original cat data size : memory size of the original categorical data (hash hex string of various sizes)

Which gives:

enter image description here


Categorical is just a way of expressing a kind of enum with some additional data and properties (see the link you gave : https://pandas.pydata.org/docs/user_guide/categorical.html => semantics, ordering, ...).

When nb_cat ~= nb_elements, which seems to be the worst case scenario according to pandas' doc, then one can start to wonder :

what's the category useful for when there is 1 category per data value ?

colA colA_catagorical
1    1
2    2
3    3
...  ...
n    n
# eeeeew !

The rest is "in between", hence my "no"/"opinion based" answer.

Only a specific benchmark on your own use case may give some useful insight, and some sense of scalability of the possible implementation choices.

The pandas' doc speaks specifically of the string => categorical transformation, because it is probably the most common case where some raw string input (CSV, JSON, from external API, tool, lab measurment machine, ...) is transformed into a DataFrame, and some columns are indeed categorical.

The other cases maybe pd.cut, pd.qcut and similar, but how likely is the result predictable for optimization purposes ?

like image 194
LoneWanderer Avatar answered Nov 26 '25 06:11

LoneWanderer


Very loosely, you can expect to get memory savings when using a Categorical on Series/arrays that contain strings with a low cardinality.

If you want to go a bit deeper, your intuition is correct - a Categorical is essentially a mapping of integral keys to the unique string values. So if your input was something like:

["abc", "abc", "def"]

A categorical would behind the scenes replace that with an array of [0, 0, 1] along with a mapping of {0: "abc", 1: "def"}

In the majority of use cases, [0, 0, 1] would be represented as a numpy array with dtype=np.int8. In cases where you have more than 255 unique values (2**8-1) the dtype of that array would be sized appropriately, of course at the expense of more memory usage.

Numpy works with contiguous arrays in memory, so ignoring the overhead of the Python numpy object itself your int8 array is condensed to occupy 3 * 8 = 24 bytes of memory. Alongside the array you now have a mapping of codes. You can inspect these two elements using the cat accessor of

>>> ser = pd.Series(["abc", "abc", "def"], dtype="category")
>>> ser.cat.codes
0    0
1    0
2    1
dtype: int8
>>> ser.cat.categories
Index(['abc', 'def'], dtype='object')

In a simplistic world, we could compare the size of our now internally managed numpy array + mapping to the original input data and see if we've gotten any savings. However, things still aren't quite that simple...

Python itself employs a similar concept to creating categoricals which is often referred to as string interning. You can see this in action below:

>>> text1 = "abc"
>>> text2 = "abc"
>>> id(text1)
93971347966496
>>> id(text2)
93971347966496
>>> sys.getsizeof(text1)
52
>>> sys.getsizeof(text2)
52

These strings have effectively been interned, meaning they both share the same memory address. This behavior is dependent on both your Python version (examples shown are 3.11) as well as the length of the string. Note that the next example isn't interned for me:

>>> text1 = "Hello, World!"
>>> text2 = "Hello, World!"
>>> id(text1)
139869410630000
>>> id(text2)
139869410629360
>>> sys.getsizeof(text1)
62
>>> sys.getsizeof(text2)
62

What is interesting is that sys.getsizeof in both cases returns the same value for text1/text2 even though we know the first example set was interned whereas the second was not. This means that you would likely overstate your memory usage in the first case if you calculated your overall usage as 52 + 52 bytes, but would likely be spot on in the second case calculating your usage as 62 + 62 bytes.

There's also the caveat of getting a true representation of memory usage from containers in the Python runtime. Note the following example:

>>> sys.getsizeof(["this is a one element list"])
64
>>> sys.getsizeof(["this is a one element list"][0])
75

As you can see, the value reported by sys.getsizeof on the list object does not deeply inspect the size of its contents, which is documented behavior:

https://docs.python.org/3/library/sys.html#sys.getsizeof

I highlight these issues just to show that the calculations here are not as scientific as you might hope. More tooling could of course be developed to handle this (and may even already exist) but its worth asking yourself where the tradeoff lies between getting an exact answer and something that is good enough.

So with that, I'd reiterate my suggestion that using Categorical makes sense when you have a lot of values with low cardinality. In cases where there is no clear winner between the two, I don't think the difference would matter all that much to the runtime of your program and getting an exact answer may not be worth the time and effort

like image 40
Will Ayd Avatar answered Nov 26 '25 05:11

Will Ayd



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!