Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unable to mmap 1024 bytes - Cannot allocate memory - even though there is more than enough ram

I'm currently working on a seminar paper on nlp, summarization of sourcecode function documentation. I've therefore created my own dataset with ca. 64000 samples (37453 is the size of the training dataset) and I want to fine tune the BART model. I use for this the package simpletransformers which is based on the huggingface package. My dataset is a pandas dataframe. An example of my dataset:

enter image description here

My code:

train_df = pd.read_csv(train_path, index_col=0)
train_df.rename(columns={'text':'input_text', 'summary':'target_text'}, inplace=True)

# Logging
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# Hyperparameters
model_args = Seq2SeqArgs()

model_args.num_train_epochs = 10
# bart-base = 32, bart-large-cnn = 16
model_args.train_batch_size = 16
# model_args.no_save = True
# model_args.evaluate_generated_text = True
model_args.evaluate_during_training = True
model_args.evaluate_during_training_verbose = True

model_args.overwrite_output_dir = True
model_args.save_model_every_epoch = False
model_args.save_eval_checkpoints = False
model_args.save_optimizer_and_scheduler = False
model_args.save_steps = -1
best_model_dir = 'drive/MyDrive/outputs/bart-large-cnn/best_model/'
model_args.best_model_dir = best_model_dir
   
# Initialize model
model = Seq2SeqModel(
    encoder_decoder_type="bart",
    encoder_decoder_name="facebook/bart-base",
    args=model_args,
    use_cuda=True,
)


# Train the model
model.train_model(
    train_df, 
    # eval_data=eval_df, 
    # matches=count_matches,
)

everything is fine so far BUT I get this error when I start the training.

enter image description here

Here the error from a run I did on a colab notebook:

Exception in thread Thread-14:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 470, in _handle_results
    task = get()
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/reductions.py", line 287, in rebuild_storage_fd
    storage = cls._new_shared_fd(fd, size)
RuntimeError: unable to mmap 1024 bytes from file <filename not specified>: Cannot allocate memory (12)

One would think that I simply have not enough memory but this were my System Monitor ca. 3 sec. after the error:

enter image description here

and this was the lowest my available or free memory get in the time between starting the training and getting the error:

enter image description here

After a lot of tuning I found out that for some reason every thing works fine when I train the model only with a dataset of the size of max. 21000. I doesn't madder if I train the "base" version or the "large-cnn" version of the BART model. I just depends on size of my dataset. The error always occurs in the "Creating features from dataset file at cache_dir/" time.

So what have I already tried:

  • I added a lot of swap memory (as you can see in the screenshot of my System Monitor)

  • reduced the numbers of workers to 1

  • I increased the hard- as well as the softmax of my systems open files limit (-n) to 86000

enter image description here

I also tried to train the model in a google colab notebook but I had the same issue; if the dataset size gets over ca. 21000 the training fails. Even after I doubled the memory of my colab session but still keeping the datset size just a little bit over the 21000 limit.

Desktop:

transformers 4.6.0

simpletransformers 0.61.4

ubuntu 20.04.2 LTS

After trying to solve this by myself for literally weeks I would me more than happy if anyone of you guys have an idea how I can solve this :)

(I am aware of this post mmap returns can not allocate memory, even though there is enough even though there is enough unfortunately it couldn't solve my problem. My vm.max_map_count is at 860000)

like image 262
ThyR4n Avatar asked Nov 01 '25 21:11

ThyR4n


2 Answers

While I do not know how to deal with this problem directly, I had a somewhat similar issue(and solved). The difference is:

  • I use fairseq
  • I can run my code on google colab with 1 GPU
  • Got RuntimeError: unable to mmap 280 bytes from file </torch_40419_282117887>: Cannot allocate memory (12) immediately when I tried to run it on multiple GPUs.

From the other people's code, I found that he uses python -m torch.distributed.launch -- ... to run fairseq-train, and I added it to my bash script and the RuntimeError is gone and training is going.

So I guess if you can run with 21000 samples, you may use torch.distributed to make whole data into small batches and distribute them to several workers.

like image 99
ds ASDW Avatar answered Nov 03 '25 11:11

ds ASDW


So I just found a simple workaround. You can just set use_multiprocessing of the model to False:

model_args.use_multiprocessing = False

Now I can run with my whole dataset.

like image 23
ThyR4n Avatar answered Nov 03 '25 12:11

ThyR4n



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!