Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

change specific indexes in string to same value python

Goal

Reading in a massive binary file approx size 1.3GB and change certain bits and then writing it back to a separate file (cannot modify original file).

Method

When I read in the binary file it gets stored in a massive string encoded in hex format which is immutable since I am using python.

My algorithm loops through the entire file and stores in a list all the indexes of the string that need to be modified. The catch is that all the indexes in the string need to be modified to the same value. I cannot do this in place due to immutable nature. I cannot convert this into a list of chars because that blows up my memory constraints and takes a hell lot of time. The viable thing to do is to store it in a separate string, but due to the immutable nature I have to make a ton of string objects and keep on concatenating to them.

I used some ideas from https://waymoot.org/home/python_string/ however it doesn't give me a good performance. Any ideas, the goal is to copy an existing super long string exactly into another except for certain placeholders determined by the values in the index List ?

like image 637
john smith Avatar asked Feb 01 '16 17:02

john smith


1 Answers

So, to be honest, you shouldn't be reading your file into a string. You shouldn't especially be writing anything but the bytes you actually change. That is just a waste of resources, since you only seem to be reading linearly through the file, noting the down the places that need to be modified.

On all OSes with some level of mmap support (that is, Unixes, among them Linux, OS X, *BSD and other OSes like Windows), you can use Python's mmap module to just open the file in read/write mode, scan through it and edit it in place, without the need to ever load it to RAM completely and then write it back out. Stupid example, converting all 12-valued bytes by something position-dependent:

Note: this code is mine, and not MIT-licensed. It's for text-enhancement purposes and thus covered by CC-by-SA. Thanks SE for making this stupid statement necessary.

import mmap
with open("infilename", "r") as in_f:
  in_view = mmap.mmap(in_f.fileno(), 0) ##length = 0: complete file mapping
  length = in_view.size()
  with open("outfilename", "w") as out_f
    out_view = mmap.mmap(out_f.fileno(), length)
    for i in range(length):
       if in_view[i] == 12:
         out_view[i] = in_view[i] + i % 10
       else:
         out_view[i] = in_view[i]
like image 147
Marcus Müller Avatar answered Oct 18 '22 08:10

Marcus Müller