Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tensorflow 2.0 keras save model to hdfs: Can't decrement id ref count

I have mounted an hdfs drive by hdfs fuse, thus I can access the hdfs by path /hdfs/xxx.

After training a model by keras, I want to save it to /hdfs/model.h5 by model.save("/hdfs/model.h5").

I get the following error:

2020-02-26T10:06:51.83869705Z   File "h5py/_objects.pyx", line 193, in h5py._objects.ObjectID.__dealloc__
2020-02-26T10:06:51.838791107Z RuntimeError: Can't decrement id ref count (file write failed: time = Wed Feb 26 10:06:51 2020
2020-02-26T10:06:51.838796288Z , filename = '/hdfs/model.h5', file descriptor = 3, errno = 95, error message = 'Operation not supported', buf = 0x7f20d000ddc8, total write size = 512, bytes this sub-write = 512, bytes actually written = 18446744073709551615, offset = 298264)
2020-02-26T10:06:51.838802442Z Exception ignored in: 'h5py._objects.ObjectID.__dealloc__'
2020-02-26T10:06:51.838807122Z Traceback (most recent call last):
2020-02-26T10:06:51.838811833Z   File "h5py/_objects.pyx", line 193, in h5py._objects.ObjectID.__dealloc__
2020-02-26T10:06:51.838816793Z RuntimeError: Can't decrement id ref count (file write failed: time = Wed Feb 26 10:06:51 2020
2020-02-26T10:06:51.838821942Z , filename = '/hdfs/model.h5', file descriptor = 3, errno = 95, error message = 'Operation not supported', buf = 0x7f20d000ddc8, total write size = 512, bytes this sub-write = 512, bytes actually written = 18446744073709551615, offset = 298264)
2020-02-26T10:06:51.838827917Z Traceback (most recent call last):
2020-02-26T10:06:51.838832755Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 117, in save_model_to_hdf5
2020-02-26T10:06:51.838838098Z     f.flush()
2020-02-26T10:06:51.83885453Z   File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py", line 452, in flush
2020-02-26T10:06:51.838859816Z     h5f.flush(self.id)
2020-02-26T10:06:51.838864401Z   File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
2020-02-26T10:06:51.838869302Z   File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
2020-02-26T10:06:51.838874126Z   File "h5py/h5f.pyx", line 146, in h5py.h5f.flush
2020-02-26T10:06:51.838879016Z RuntimeError: Can't flush cache (file write failed: time = Wed Feb 26 10:06:51 2020
2020-02-26T10:06:51.838885827Z , filename = '/hdfs/model.h5', file descriptor = 3, errno = 95, error message = 'Operation not supported', buf = 0x4e5b018, total write size = 4, bytes this sub-write = 4, bytes actually written = 18446744073709551615, offset = 34552)

But I can directly write a file to the same path by

with open("/hdfs/a.txt") as f:
    f.write("1")

Also I've figured out a tricky workaround and it worked...

model.save("temp.h5")
move("temp.h5", "/hdfs/model.h5")

So maybe the problem is about keras api? It can only save the model locally but cannot save to an hdfs path.

Any idea how to fix the problem?

like image 947
Hao Tan Avatar asked Dec 20 '25 02:12

Hao Tan


1 Answers

In my particular case, this exact error came from a full drive. A background backup had filled the hard drive up with a temporary file.

In previous production systems, for long running tasks that continuously generated files, I would monitor the amount of free drive space every minute. If it got low, it would alert so it could be fixed, and if it got to 1GB free the process would would exit. Better to exit with 1GB free rather than 0GB free, which can really break everything else running on the system.

It's a choice between sacrificing the entire system, or one task which is destined to fail within minutes.

like image 196
Contango Avatar answered Dec 21 '25 21:12

Contango



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!