Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determine all files read into and written from an ipython notebook

This is a generalization to this question: Way to extract pickles coming in and out of ipython / jupyter notebook

At the highest level, I'm looking for a way to automatically summarize what goes on in an ipython notebook. One way I see of simplifying the problem is treat all the data manipulations that on inside the notebook as a blackbox, and only focus on what its inputs and outputs are. So, is there a way given the filepaths to an ipython notebook how can you easily determine all the different files/websites it reads into memory and subsequently also all the files that it later writes/dumps? I'm thinking maybe there could be a function that scans the file, parses it for inputs and outputs, and saves it into a dictionary for easy access:

summary_dict = summerize_file_io(ipynb_filepath)

print summary_dict["inputs"] 
> ["../Resources/Data/company_orders.csv", "http://special_company.com/company_financials.csv" ]

print summary_dict["outputs"]
> ["orders_histogram.jpg","data_consolidated.pickle"]

I'm wondering how to do this easily beyond just pickle objects to include different formats like: txt, csv, jpg, png, etc... and also which may involve reading data directly from the web into the notebook itself.

like image 917
Afflatus Avatar asked Feb 17 '17 02:02

Afflatus


People also ask

How do I extract the output from a Jupyter Notebook?

Download Jupyter Notebook as PDF The Jupyter Notebook has an option to export the notebook to many formats. It can be accessed by clicking File -> Download as -> PDF via LaTeX (or PDF via HTML - not visible in the screenshot).

How do you view all methods in a Jupyter Notebook?

Jupyter Notebook can show that documentation of the function you are calling. Press Shift+Tab to view the documentation.

How do I find files in Jupyter Notebook?

You can click on the name of directory in the Jupyter Notebook Dashboard to navigate into that directory and see the contents. You can return to the parent directory of your current directory in the Jupyter Notebook session by clicking on the folder icon on the top menu bar.

What does %% capture do?

Capturing Output With %%capture IPython has a cell magic, %%capture , which captures the stdout/stderr of a cell. With this magic you can discard these streams or store them in a variable. By default, %%capture discards these streams. This is a simple way to suppress unwanted output.


1 Answers

You can check what files you have opened or modified by patching the builtin open as JRG suggested and you should extend this functionality to patch any functions you use to connect to websites if you want to track that as well.

import builtins


modified = {}
old_open = builtins.open


def new_open(name, mode='r', *args, **kwargs):
    modified[name] = mode
    return old_open(name, mode=mode, *args, **kwargs)


# patch builtin open
builtins.open = new_open


# check modified
def whats_modified():
    print('Session has opened/modified the following files:')
    for name in sorted(modified):
        mode = modified[name]
        print(mode.ljust(8) + name)

It we execute this in the interpreter (or use it as a module), we can see what we've modified and how we opened it.

In [4]: with open('ex.txt') as file:
   ...:     print('ex.txt:', file.read())
   ...:     
ex.txt: some text.



In [5]: with open('other.txt', 'w') as file:
   ...:     file.write('Other text.\n')
   ...:     

In [6]: whats_modified()
Session has opened/modified the following files:
r       ex.txt
w       other.txt

This is somewhat limited though, as the mode will be overwritten when a file is reopened, but that can be fixed with some extra checks performed in new_open.

like image 165
Tankobot Avatar answered Nov 06 '22 06:11

Tankobot