Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: How to store cProfile output in a pandas DataFrame?

There already exist some posts discussing python profiling using cProfile, as well as the challenges of analyzing the output due to the fact that the output file restats from the sample code below is not a plain text file. The snippet below is only a sample from docs.python.org/2/library/profile, and not directly reproducible.

import cProfile
import re
cProfile.run('re.compile("foo|bar")', 'restats')

There is one discussion here: Profile a python script using cProfile into an external file, and on the docs.python.org there are more details on how to analyze the output using pstats.Stats (still only a sample, and not reproducible):

import pstats
p = pstats.Stats('restats')
p.strip_dirs().sort_stats(-1).print_stats()

I may be missing some very important details here, but I would really like to store the output in a pandas DataFrame and do further analysis from there.

I thought it would be pretty simple since the output in iPython from running cProfile.run() looks fairly tidy:

In[]:
cProfile.run('re.compile("foo|bar")'

Out[]:

enter image description here

Any suggestions on how to get this into a pandas DataFrame in the same format?

like image 653
vestland Avatar asked Jun 01 '17 08:06

vestland


4 Answers

I know this already has an answer, but for anyone who doesn't want to go to the trouble of downloading another module, here's a rough and ready script that should get close:

%%capture profile_results    ## uses %%capture magic to send stdout to variable
cProfile.run("your_function( **run_parms )")

Run the above first, to populate profile_results with the contents of stout, which contains the usual printed output of cProfile.

## Parse the stdout text and split it into a table
data=[]
started=False

for l in profile_results.stdout.split("\n"):
    if not started:
        if l=="   ncalls  tottime  percall  cumtime  percall filename:lineno(function)":
            started=True
            data.append(l)
    else:
        data.append(l)
content=[]
for l in data:
    fs = l.find(" ",8)
    content.append(tuple([l[0:fs] , l[fs:fs+9], l[fs+9:fs+18], l[fs+18:fs+27], l[fs+27:fs+36], l[fs+36:]]))
prof_df = pd.DataFrame(content[1:], columns=content[0])

It wont win any prizes for elegance or pleasant style, but it does force that table of results into a filterable dataframe format.

prof_df 

enter image description here

like image 51
Thomas Kimber Avatar answered Sep 17 '22 12:09

Thomas Kimber


It looks like https://github.com/ssanderson/pstats-view might do what you want (albeit with unnecessary dependencies related to visualising the data and making it interactive):

>>> from pstatsviewer import StatsViewer
>>> sv = StatsViewer("/path/to/profile.stats")
>>> sv.timings.columns
Index(['lineno', 'ccalls', 'ncalls', 'tottime', 'cumtime'], dtype='object')
like image 22
joeln Avatar answered Sep 17 '22 12:09

joeln


In case people don't want to use %%capture or go through a CSV, below a cobbled-together solution, in this case comparing multiple cProfiles in the same folder by (1) sorting each cProfile by cumulative time and (2) adding only the top result (pstats.Stats(f, stream = p_output).sort_stats("cumulative").print_stats(1)) from each .prof to a data frame (along with a portion of the .prof filename to identify which profile the measurement came from).

See here for some of the original code (which does use CSVs as the intermediary).

import io
import pstats
import pandas as pd
import glob

all_files = glob.glob(profiledir + "/*.prof")

li = []

for f in all_files:
    
    p_output = io.StringIO()

    prof_stats = pstats.Stats(f, stream = p_output).sort_stats("cumulative").print_stats(1)

    p_output = p_output.getvalue()
    p_output = 'ncalls' + p_output.split('ncalls')[-1]
    result = '\n'.join([','.join(line.rstrip().split(None,5)) for line in p_output.split('\n')])

    df = pd.read_csv(io.StringIO(result), sep=",", header=0)
    df['module name'] = f.split(' ')[0].split('\\')[1] # differs depending on your file naming convention
    li.append(df) 

df = pd.concat(li, axis=0, ignore_index=True)
like image 23
grg Avatar answered Sep 18 '22 12:09

grg


You can use this function to accomplish this task

def convert_to_df(path, offset=6):
    """
    path: path to file
    offset: line number from where the columns start
    """
    with open(path, "r") as f:
        core_profile = f.readlines()
    core_profile = core_profile[offset:]
    cols = core_profile[0].split()
    n = len(cols[:-1])
    data = [_.split() for _ in core_profile[1:]]
    data = [_ if len(_)==n+1 else _[:n]+[" ".join(_[n+1:])] for _ in data]
    data_ = pd.DataFrame(data, columns=cols)
    return data_
like image 22
Akhil Vempali Avatar answered Sep 18 '22 12:09

Akhil Vempali