Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why tensorflow just outputs killed

Tags:

tensorflow

When I run my tensorflow app, it just outputs "killed". How do I debug this?

source code

root@8e4a3a65184e:~/tensorflow# python sample_cnn.py 
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': 1, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_steps': None, '_model_dir': 'data/convnet_model', '_save_summary_steps': 100}
INFO:tensorflow:Create CheckpointSaverHook.
2017-08-17 12:56:53.160481: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-17 12:56:53.160536: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-17 12:56:53.160545: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-17 12:56:53.160550: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-17 12:56:53.160555: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Killed
like image 332
reachlin Avatar asked Aug 17 '17 13:08

reachlin


1 Answers

When I run your code I get the same behavior, after typing dmesg you'll see a trace like, which confirms what gdelab was hinting at:

[38607.234089] python3 invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
[38607.234090] python3 cpuset=/ mems_allowed=0
[38607.234094] CPU: 3 PID: 1420 Comm: python3 Tainted: G           O    4.9.0-3-amd64 #1 Debian 4.9.30-2+deb9u2
[38607.234094] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.2.4 03/29/2017
[38607.234096]  0000000000000000 ffffffffa9f28414 ffffa50090317cf8 ffff940effa5f040
[38607.234097]  ffffffffa9dfe050 0000000000000000 0000000000000000 0101ffffa9d82dd0
[38607.234098]  e09c7db7f06d0ac2 00000000ffffffff 0000000000000000 0000000000000000
[38607.234100] Call Trace:
[38607.234104]  [<ffffffffa9f28414>] ? dump_stack+0x5c/0x78
[38607.234106]  [<ffffffffa9dfe050>] ? dump_header+0x78/0x1fd
[38607.234108]  [<ffffffffa9d8047a>] ? oom_kill_process+0x21a/0x3e0
[38607.234109]  [<ffffffffa9d800fd>] ? oom_badness+0xed/0x170
[38607.234110]  [<ffffffffa9d80911>] ? out_of_memory+0x111/0x470
[38607.234111]  [<ffffffffa9d85b4f>] ? __alloc_pages_slowpath+0xb7f/0xbc0
[38607.234112]  [<ffffffffa9d85d8e>] ? __alloc_pages_nodemask+0x1fe/0x260
[38607.234113]  [<ffffffffa9dd7c3e>] ? alloc_pages_vma+0xae/0x260
[38607.234115]  [<ffffffffa9db39ba>] ? handle_mm_fault+0x111a/0x1350
[38607.234117]  [<ffffffffa9c5fd84>] ? __do_page_fault+0x2a4/0x510
[38607.234118]  [<ffffffffaa207658>] ? page_fault+0x28/0x30
...
[38607.234158] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
...
[38607.234332] [ 1396]  1000  1396  4810969  3464995    6959      21        0             0 python3
[38607.234332] Out of memory: Kill process 1396 (python3) score 568 or sacrifice child
[38607.234357] Killed process 1396 (python3) total-vm:19243876kB, anon-rss:13859980kB, file-rss:0kB, shmem-rss:0kB
[38607.720757] oom_reaper: reaped process 1396 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Which basically means python was starting too consume too much memory and the kernel decided to kill the process. If you add some prints in your code you'll see that mnist_classifier.train() is the function which is active. However some dumb tests (as removing the logging and lowering the steps, did not seem to help here).

like image 62
amo-ej1 Avatar answered Sep 25 '22 00:09

amo-ej1