Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load checkpoints across different versions of pytorch (1.3.1 and 1.6.x) using ppc64le and x86?

As I outlined here I am stuck using old versions of pytorch and torchvision due to hardware e.g. using ppc64le IBM architectures.

For this reason, I am having issues when sending and receiving checkpoints between different computers, clusters and my personal mac. I wonder if there is any way to load models in a way to avoid this issue? e.g. perhaps saving models in with a old and new format when using 1.6.x. Of course for the 1.3.1 to 1.6.x is impossible but at leat I was hoping something would work.

Any advice? Of course my ideal solution is that I don't have to worry about it and I can always load and save my checkpoints and everything I usually pickle uniformly across all my hardware.


The first error I got was a zip jit error:

RuntimeError: /home/miranda9/data/f.pt is a zip archive (did you mean to use torch.jit.load()?)

so I used that (and other pickle libraries):

# %%
import torch
from pathlib import Path


def load(path):
    import torch
    import pickle
    import dill

    path = str(path)
    try:
        db = torch.load(path)
        f = db['f']
    except Exception as e:
        db = torch.jit.load(path)
        f = db['f']
        #with open():
        # db = pickle.load(open(path, "r+"))
        # db = dill.load(open(path, "r+"))
        #raise ValueError(f'FAILED: {e}')
    return db, f

p = "~/data/f.pt"
path = Path(p).expanduser()

db, f = load(path)

Din, nb_examples = 1, 5
x = torch.distributions.Normal(loc=0.0, scale=1.0).sample(sample_shape=(nb_examples, Din))

y = f(x)

print(y)
print('Success!\a')

but I get complains of different pytorch versions which I am forced to use:

Traceback (most recent call last):
  File "hal_pg.py", line 27, in <module>
    db, f = load(path)
  File "hal_pg.py", line 16, in load
    db = torch.jit.load(path)
  File "/home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/jit/__init__.py", line 239, in load
    cpp_module = torch._C.import_ir_module(cu, f, map_location, _extra_files)
RuntimeError: version_number <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /opt/anaconda/conda-bld/pytorch-base_1581395437985/work/caffe2/serialize/inline_container.cc:131, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 1. Your PyTorch installation may be too old. (init at /opt/anaconda/conda-bld/pytorch-base_1581395437985/work/caffe2/serialize/inline_container.cc:131)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbc (0x7fff7b527b9c in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1d98 (0x7fff1d293c78 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x88 (0x7fff1d2950d8 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: torch::jit::import_ir_module(std::shared_ptr<torch::jit::script::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x64 (0x7fff1e624664 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #4: <unknown function> + 0x70e210 (0x7fff7c0ae210 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x28efc4 (0x7fff7bc2efc4 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #26: <unknown function> + 0x25280 (0x7fff84b35280 in /lib64/libc.so.6)
frame #27: __libc_start_main + 0xc4 (0x7fff84b35474 in /lib64/libc.so.6)

any ideas how to make everything consistent across the clusters? I can't even open the pickle files.


maybe this is just impossible with the current pytorch version I am forced to use :(

RuntimeError: version_number <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /opt/anaconda/conda-bld/pytorch-base_1581395437985/work/caffe2/serialize/inline_container.cc:131, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 1. Your PyTorch installation may be too old. (init at /opt/anaconda/conda-bld/pytorch-base_1581395437985/work/caffe2/serialize/inline_container.cc:131)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbc (0x7fff83ba7b9c in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1d98 (0x7fff25993c78 in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x88 (0x7fff259950d8 in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: torch::jit::import_ir_module(std::shared_ptr<torch::jit::script::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x64 (0x7fff26d24664 in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #4: <unknown function> + 0x70e210 (0x7fff8472e210 in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x28efc4 (0x7fff842aefc4 in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #23: <unknown function> + 0x25280 (0x7fff8d335280 in /lib64/libc.so.6)
frame #24: __libc_start_main + 0xc4 (0x7fff8d335474 in /lib64/libc.so.6)

using code:

from pathlib import Path

import torch

path = '/home/miranda9/data/dataset/'
path = Path(path).expanduser() / 'fi_db.pt'
path = str(path)

# db = torch.load(path)
# torch.jit.load(path)
db = torch.jit.load(str(path))

print(db)

related links:

  • How to load checkpoints across different versions of pytorch (1.3.1 and 1.6.x) using ppc64le and x86?
  • https://discuss.pytorch.org/t/how-to-load-checkpoints-across-different-versions-of-pytorch-1-3-1-and-1-6-x-using-ppc64le-and-x86/97829
  • related gitissue: https://github.com/pytorch/pytorch/issues/43766
  • reddit: https://www.reddit.com/r/pytorch/comments/jvza7v/how_to_load_checkpoints_across_different_versions/
like image 841
Charlie Parker Avatar asked Sep 30 '20 15:09

Charlie Parker


People also ask

How do I know what version of PyTorch I have?

Using Python Code The output prints the installed PyTorch version along with the CUDA version. For example, 1.9. 0+cu102 means the PyTorch version is 1.9. 0, and the CUDA version is 10.2.


1 Answers

I believe what the developers intend is passing a flag for saving as a pickle. Just a default behavior change.

For previously checkpointed files reload the zip file saved weights in the newer env(with pytorch>=1.6), and then checkpoint again as a pickle (no need to re-train);

update your code and add flag from next time

Deprecation from ver 1.6 :

We have switched torch.save to use a zip file-based format by default rather than the old Pickle-based format. torch.load has retained the ability to load the old format, but use of the new format is recommended. The new format is:

more friendly for inspection and building tooling for manipulating the save files fixes a long-standing issue wherein serialization (getstate, setstate) functions on Modules that depended on serialized Tensor values were getting the wrong data the same as the TorchScript serialization format, making serialization more consistent across PyTorch

Usage is as follows:

m = MyMod()
torch.save(m.state_dict(), 'mymod.pt') # Saves a zipfile to mymod.pt

To use the old format, pass the flag _use_new_zipfile_serialization=False

m = MyMod()
torch.save(m.state_dict(), 'mymod.pt', _use_new_zipfile_serialization=False) # Saves pickle
like image 146
Saleem Ahmed Avatar answered Oct 09 '22 09:10

Saleem Ahmed