How to reset Tensorboard when it tries to reuse a killed Windows PID

Tags:

Apologies if two days' frustration leaks through...

Problem: can't reliably run Tensorboard in jupyter notebook (actually, in Jupyter Lab) with

%tensorboard --logdir {logdir}

and if I kill the tensorboard process and start again in the notebook it says it is reusing the dead process and port, but the process is dead and netstat -ano | findstr :6006` shows nothing, so the port looks closed too.

Question: How in the name of $deity do I get tensorboard to restart from scratch and forget what it thinks it knows about processes, ports etc.? If I could do that I could hack away at residual path etc. issues...

Known issues already addressed (I think): need to escape backslashes in Python string to get proper path and other OS gremlins; avoid spaces in path, ensure correct capitalisation...

Environment: Win 64-bit Home with Anaconda and Tensforflow-GPU 2 installed via conda install - TF is working and writes data to the specified path given via the call back

tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1) # logdir is the full path

But I'm damned if I can start Tensorboard reliably within the notebook.

I found that if I started an Anaconda command window and invoked tensorboard from there tensorboard started ok...

(TF2GPU_Anaconda) C:\Users\Julian>tensorboard --logdir "a:\tensorboard\20200102-112749"
2020-01-02 11:53:58.478848: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.0.0 at http://localhost:6006/ (Press CTRL+C to quit)

It was accessibly in Chrome at localhost:6006 as stated (specifically http://localhost:6006/#scalars&run=20200102-112749%5Ctrain) (i'll ignore the other problems with tensorboard such as refresh failures on scalars, odd message on graph, etc.) and

%tensorboard --logdir {logdir}

then shows tensorboard in the notebook and in the separate chrome tab.

However! whilst tensorboard reports in the notebook that it is reusing the old dead PID it is in fact on a completely different new PID

What have I been doing wrong, and how do I reset tensorboard completely?

PS the last (successful!) invocation was in fact with

%tensorboard --logdir {makeWindowsCmdPath('A:\\tensorboard\\20200102-112749')}

where makeWindowsCmdPath is defined as

def makeWindowsCmdPath(path):
    return '\"' + str(path) + '\"'

UPDATE 2020-01-03 A MWE of eventual success has been uploaded in a comment at Github in response to an issue that includes the PID referencing errors of tensorboard

209

asked Jan 02 '20 12:01

Julian Moore

1 Answers

Hey—sorry to hear that you’re running into issues. It’s entirely plausible that everything that you describe is both accurate and my fault. :-)

How in the name of $deity do I get tensorboard to restart from scratch and forget what it thinks it knows about processes, ports etc.? If I could do that I could hack away at residual path etc. issues...

There is a directory called .tensorboard-info in your temp directory that maintains a best-effort registry of the TensorBoard jobs that we think are running. When TensorBoard launches (in any manner, including with %tensorboard), it writes an “info file” to that directory, and when you use %tensorboard we first check to see if a “compatible instance” (same working directory and CLI args) is still running, and if so reuse it instead. When a TensorBoard instance shuts down cleanly, it removes its own info file. The idea is that as long as TensorBoard is shut down cleanly we should always have an accurate record of which processes are live, and since this registry is in a temp directory any errors due to hard shutdowns will be short-lived.

But this is where I erred: coming from the POSIX world and not being very familiar with Windows application development, I didn’t realize that the Windows temp directory is not actually automatically deleted, ever. Therefore, any bookkeeping errors persist indefinitely.

So, the answer to your question is, “remove the .tensorboard-info directory located under tempfile.gettempdir()” (preferably when you don’t have any actively running TensorBoard instances).

There are ways that we can plausibly work around this in TensorBoard core: see https://github.com/tensorflow/tensorboard/issues/2483 for a start, and I’ve also considered amortized approaches like letting each TensorBoard instance perform some cleanup of other instances at start time. We haven’t yet gotten around to implementing these.

Let me know if this is helpful or if it fails to address your question.

162

answered Sep 20 '22 08:09

wchargin

Related questions
                            
                                When to use Absolute Path vs Relative Path in Python
                            
                                Django ManyToMany through with multiple databases
                            
                                Get the types of the keys in a dictionary
                            
                                Convert dict of nested lists to list of tuples
                            
                                What is the type hint for a function [duplicate]
                            
                                Dataframe transpose with pyspark in Apache Spark
                            
                                What is the purpose of adding to INSTALLED_APPS in Django?
                            
                                Unsupported media type Django API
                            
                                What is the difference between python native data structure "DICTIONARY" and "Redis" database?
                            
                                Where does pip download .whl files?
                            
                                Given input size: (128x1x1). Calculated output size: (128x0x0). Output size is too small
                            
                                What does train_on_batch() do in keras model?
                            
                                sudo privileges within python virtualenv
                            
                                Combining asyncio with a multi-worker ProcessPoolExecutor
                            
                                pandas - 'dataframe' object has no attribute 'str'
                            
                                How to apply string methods to multiple columns of a dataframe
                            
                                Windows: Python SSL certificate verify failed
                            
                                How to embed a website within ipython notebook
                            
                                Joining two pandas dataframes based on multiple conditions
                            
                                Integer File Descriptor "0" in open()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to reset Tensorboard when it tries to reuse a killed Windows PID

Tags:

python

windows

jupyter-notebook

jupyter-lab

tensorboard

Julian Moore

People also ask

1 Answers

wchargin

Recent Activity

Donate For Us