Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to reset Tensorboard when it tries to reuse a killed Windows PID

Apologies if two days' frustration leaks through...

Problem: can't reliably run Tensorboard in jupyter notebook (actually, in Jupyter Lab) with

%tensorboard --logdir {logdir}

and if I kill the tensorboard process and start again in the notebook it says it is reusing the dead process and port, but the process is dead and netstat -ano | findstr :6006` shows nothing, so the port looks closed too.

Question: How in the name of $deity do I get tensorboard to restart from scratch and forget what it thinks it knows about processes, ports etc.? If I could do that I could hack away at residual path etc. issues...

Known issues already addressed (I think): need to escape backslashes in Python string to get proper path and other OS gremlins; avoid spaces in path, ensure correct capitalisation...

Environment: Win 64-bit Home with Anaconda and Tensforflow-GPU 2 installed via conda install - TF is working and writes data to the specified path given via the call back

tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1) # logdir is the full path

But I'm damned if I can start Tensorboard reliably within the notebook.

I found that if I started an Anaconda command window and invoked tensorboard from there tensorboard started ok...

(TF2GPU_Anaconda) C:\Users\Julian>tensorboard --logdir "a:\tensorboard\20200102-112749"
2020-01-02 11:53:58.478848: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.0.0 at http://localhost:6006/ (Press CTRL+C to quit)

It was accessibly in Chrome at localhost:6006 as stated (specifically http://localhost:6006/#scalars&run=20200102-112749%5Ctrain) (i'll ignore the other problems with tensorboard such as refresh failures on scalars, odd message on graph, etc.) and

%tensorboard --logdir {logdir}

then shows tensorboard in the notebook and in the separate chrome tab.

However! whilst tensorboard reports in the notebook that it is reusing the old dead PID it is in fact on a completely different new PID

What have I been doing wrong, and how do I reset tensorboard completely?

PS the last (successful!) invocation was in fact with

%tensorboard --logdir {makeWindowsCmdPath('A:\\tensorboard\\20200102-112749')}

where makeWindowsCmdPath is defined as

def makeWindowsCmdPath(path):
    return '\"' + str(path) + '\"'

UPDATE 2020-01-03 A MWE of eventual success has been uploaded in a comment at Github in response to an issue that includes the PID referencing errors of tensorboard

like image 209
Julian Moore Avatar asked Jan 02 '20 12:01

Julian Moore


People also ask

How do you stop TensorBoard in Jupyter notebook?

to close, I just do: close the tensorflow tab on my browser. on jupyter notebook, I click on interrupt kernel.

How do you open the TensorBoard in Jupyter notebook?

In the navigation pane, click Workspaces, then select Jupyter and launch a new workspace. From the Files tab in the workspace, click New > Tensorboard. You can access TensorBoard from the Running tab.


1 Answers

Hey—sorry to hear that you’re running into issues. It’s entirely plausible that everything that you describe is both accurate and my fault. :-)

How in the name of $deity do I get tensorboard to restart from scratch and forget what it thinks it knows about processes, ports etc.? If I could do that I could hack away at residual path etc. issues...

There is a directory called .tensorboard-info in your temp directory that maintains a best-effort registry of the TensorBoard jobs that we think are running. When TensorBoard launches (in any manner, including with %tensorboard), it writes an “info file” to that directory, and when you use %tensorboard we first check to see if a “compatible instance” (same working directory and CLI args) is still running, and if so reuse it instead. When a TensorBoard instance shuts down cleanly, it removes its own info file. The idea is that as long as TensorBoard is shut down cleanly we should always have an accurate record of which processes are live, and since this registry is in a temp directory any errors due to hard shutdowns will be short-lived.

But this is where I erred: coming from the POSIX world and not being very familiar with Windows application development, I didn’t realize that the Windows temp directory is not actually automatically deleted, ever. Therefore, any bookkeeping errors persist indefinitely.

So, the answer to your question is, “remove the .tensorboard-info directory located under tempfile.gettempdir()” (preferably when you don’t have any actively running TensorBoard instances).

There are ways that we can plausibly work around this in TensorBoard core: see https://github.com/tensorflow/tensorboard/issues/2483 for a start, and I’ve also considered amortized approaches like letting each TensorBoard instance perform some cleanup of other instances at start time. We haven’t yet gotten around to implementing these.

Let me know if this is helpful or if it fails to address your question.

like image 162
wchargin Avatar answered Sep 20 '22 08:09

wchargin