Finding output cells causing large file size in jupyter notebook

Question

I have a jupyter notebook which has ~400 cells. The total file size is 8MB so I'd like to suppress the output cells that have a large size so as to reduce the overall file size.

There are quite a few possible output cells that could be causing this (mainly matplotlib and seaborn plots) so to avoid spending time on trial and error, is there a way of finding the size of each output cell? I'd like to keep as many output plots as possible as I'll be pushing the work to github for others to see.

Wayne · Accepted Answer

My idea with nbformat to iterate on the cells in your notebook and check which one had the larger base64 storage spelled out with code for running in a cell in a Jupyter notebook cell to get the code cell numbers listed largest to smallest (it will fetch a notebook example first to have something to try it on; after establishing it works, substitute that file name after placing this script alongside your own .ipynb notebook file):

############### Get test notebook ########################################
import os
notebook_example = "matplotlib3d-scatter-plots.ipynb"
if not os.path.isfile(notebook_example):
    !curl -OL https://raw.githubusercontent.com/fomightez/3Dscatter_plot-binder/master/matplotlib3d-scatter-plots.ipynb
### Use nbformat to get estimate of output size from code cells. #########
import nbformat as nbf
ntbk = nbf.read(notebook_example, nbf.NO_CONVERT)
size_estimate_dict = {}
for cell in ntbk.cells:
    if cell.cell_type == 'code':
        size_estimate_dict[cell.execution_count] = len(str(cell.outputs))
out_size_info = [k for k, v in sorted(size_estimate_dict.items(), key=lambda item: item[1],reverse=True)]
out_size_info

(To have a place to easily run that code go here and click on the launch binder button. When the session spins up, open a new notebook and paste in the code and run it. Static form of the notebook is here.)

Example I tried didn't include Plotly, but it seemed to do similar using a notebook with all Plotly plots. I don't know how it will handle a mix though. It may not sort perfectly if different kinds.
Hopefully, this gives you an idea though how to do what you wondered. The code example could be further expanded to use the retrieved size estimates to have nbformat make a copy of the input notebook without the output showing for, say, the top ten largest code cells.

Sejmou · Answer

I had a similar problem and have created my own script based on Wayne's answer. You can pass it the path to a Jupyter notebook and it will print the code cells with the largest produced output ordered by the size.

For easier reference, the cell number, size of the output it produced and the first few lines lines of its code are printed. You can skip through the code cells from largest to smallest output by hitting enter :)

Please be aware that you will need to run this script from the command line (otherwise the input() part won't work)

import nbformat as nbf
from typing import TypedDict


class CodeCellMeta(TypedDict):
    cell_num: int
    output_size_bytes: int
    first_lines: list[str]


def get_code_cell_metadata(nb_path: str):
    ntbk = nbf.read(nb_path, nbf.NO_CONVERT)
    cell_metas: list[CodeCellMeta] = []
    for i, cell in enumerate(ntbk.cells):
        cell_num = i + 1
        if cell.cell_type == "code":
            meta: CodeCellMeta = {
                "output_size_bytes": len(str(cell.outputs)),
                "cell_num": cell_num,
                "first_lines": cell.source.split("
")[:5],
            }
            cell_metas.append(meta)

    return cell_metas


def human_readable_size(size_bytes: int) -> str:
    size_current_unit: float = size_bytes
    for unit in ["B", "KB", "MB", "GB", "TB"]:
        if size_current_unit < 1024:
            return f"{size_current_unit:.2f} {unit}"
        size_current_unit /= 1024.0
    return f"{size_current_unit:.2f} PB"


def show_large_cells(nb_path: str):
    code_cell_meta = get_code_cell_metadata(nb_path)

    cell_meta_by_size_est = sorted(
        code_cell_meta, key=lambda x: x["output_size_bytes"], reverse=True
    )

    bytes_remaining = sum([el["output_size_bytes"] for el in cell_meta_by_size_est])

    for i, el in enumerate(cell_meta_by_size_est):
        print(f"Cell #{el['cell_num']}: {human_readable_size(el['output_size_bytes'])}")
        print("
".join(el["first_lines"]))
        print("
")
        bytes_remaining -= el["output_size_bytes"]

        if i != len(cell_meta_by_size_est) - 1:
            input(
                f"Remaining cell outputs account for {human_readable_size(bytes_remaining)} total. Hit enter to view info for next cell."
            )
        else:
            print("No more cells to view.")


if __name__ == "__main__":
    import sys

    try:
        nb_path = sys.argv[1]
        if not nb_path.endswith(".ipynb"):
            raise ValueError("Please provide a path to a Jupyter notebook file.")
    except IndexError:
        raise ValueError("Please provide a path to a Jupyter notebook file.")

    show_large_cells(nb_path)

Finding output cells causing large file size in jupyter notebook

Tags:

python

matplotlib

jupyter-notebook

seaborn

memory-profiling

Tom B.

2 Answers

Wayne

Sejmou

Recent Activity

Donate For Us

Finding output cells causing large file size in jupyter notebook

Tags:

python

matplotlib

jupyter-notebook

seaborn

memory-profiling

Tom B.

2 Answers

Wayne

Sejmou

Related questions

Recent Activity

Donate For Us