Recently, I was handling a large text file (~10GB) and trying to replace some characters in Python.
I tried these two versions:
f = open('myFile.txt', 'r')
filedata = f.read()
filedata = filedata.replace(',', ' ').replace('-', ' ').replace('_', ' ')
f = open('myFile.txt', 'r')
filedata = f.read()
filedata = filedata.replace(',', ' ')
filedata = filedata.replace('-', ' ')
filedata = filedata.replace('_', ' ')
When I tried the first one, the process was killed during the replace method. However, the process was not killed when I used the second one. (Screenshot.)
>>> f = open('myFile.txt', 'r')
... filedata = f.read()
... filedata = filedata.replace(',', ' ').replace('-', ' ').replace('_', ' ')
Killed
>>> f = open('myFile.txt', 'r')
... filedata = f.read()
... filedata = filedata.replace(',', ' ')
... filedata = filedata.replace('-', ' ')
... filedata = filedata.replace('_', ' ')
... print("Success!")
Success!
I don't think there is a significant difference in time and space complexity. Does anyone know what is going on under the hood?
This isn't a problem with chained calls generally, but in this case it's because you maintain:
filedata = f.read()
That original reference around.
So:
filedata = filedata.replace(',', ' ').replace('-', ' ').replace('_', ' ')
The original str read from the file has to stay in memory along with each subsequent .replace result until the assignment happens at the end, where its reference count finally reaches 0. A single replace, when the operation doesn't change the resulting size of the string, will require twice as much memory, because the method utilizes a reference to the original string and the new string at the same time. So at the point where you are on your second replace, you would have to have the original string, the the once-replaced string, and the new, twice-replaced string in memory.
On the other hand,
filedata = filedata.replace(',', ' ')
filedata = filedata.replace('-', ' ')
filedata = filedata.replace('_', ' ')
Here, each step requires at most 2 times the amount of memory of the original string, since the assignment causes the reference count of the original to be garbage collected before going on to a subsequent .replace, and importantly, the original doesn't stay in memory.
If what I say is true, then the following should work:
filedata = f.read().replace(',', ' ').replace('-', ' ').replace('_', ' ')
But the pythonic way to do this is to avoid .replace altogether in this instance, because you are doing multiple, single replacements.
For that, you should use str.translate.
filedata = f.read()
table = {ord(','): ' ', ord('-'): ' ', ord('_'): ' '}
filedata = fildata.translate(table)
import tracemalloc
tracemalloc.start()
result = "abcdefghij"*1_000_000
result = (
result.replace('a', '*')
.replace('b', '*')
.replace('c', '*')
)
size, peak = tracemalloc.get_traced_memory()
print(f"{size=}, {peak=}")
del result
tracemalloc.reset_peak()
result = "abcdefghij"*1_000_000
result = result.replace('a', '*')
result = result.replace('b', '*')
result = result.replace('c', '*')
size, peak = tracemalloc.get_traced_memory()
print(f"{size=}, {peak=}")
del result
tracemalloc.reset_peak()
result = ("abcdefghij"*1_000_000).replace('a', '*').replace('b', '*').replace('c', '*')
size, peak = tracemalloc.get_traced_memory()
print(f"{size=}, {peak=}")
The above outputs what I would expect:
size=10000625, peak=30000723
size=10000681, peak=20000730
size=10000681, peak=20000730
Let:
>>> import dis
>>> def chain(s):
... return s.replace(',', ' ').replace('-', ' ').replace('_', ' ')
>>> def line(s):
... s = s.replace(',', ' ')
... s = s.replace('-', ' ')
... s = s.replace('_', ' ')
... return s
Here are the bytecodes:
>>> dis.dis(chain)
2 0 LOAD_FAST 0 (s)
2 LOAD_METHOD 0 (replace)
4 LOAD_CONST 1 (',')
6 LOAD_CONST 2 (' ')
8 CALL_METHOD 2
10 LOAD_METHOD 0 (replace)
12 LOAD_CONST 3 ('-')
14 LOAD_CONST 2 (' ')
16 CALL_METHOD 2
18 LOAD_METHOD 0 (replace)
20 LOAD_CONST 4 ('_')
22 LOAD_CONST 2 (' ')
24 CALL_METHOD 2
26 RETURN_VALUE
>>> dis.dis(line)
2 0 LOAD_FAST 0 (s)
2 LOAD_METHOD 0 (replace)
4 LOAD_CONST 1 (',')
6 LOAD_CONST 2 (' ')
8 CALL_METHOD 2
10 STORE_FAST 0 (s)
3 12 LOAD_FAST 0 (s)
14 LOAD_METHOD 0 (replace)
16 LOAD_CONST 3 ('-')
18 LOAD_CONST 2 (' ')
20 CALL_METHOD 2
22 STORE_FAST 0 (s)
4 24 LOAD_FAST 0 (s)
26 LOAD_METHOD 0 (replace)
28 LOAD_CONST 4 ('_')
30 LOAD_CONST 2 (' ')
32 CALL_METHOD 2
34 STORE_FAST 0 (s)
5 36 LOAD_FAST 0 (s)
38 RETURN_VALUE
The only difference between the two is the interleaving of the two STORE_FAST and LOAD_FAST pairs:
10 STORE_FAST 0 (s)
3 12 LOAD_FAST 0 (s)
So, as @juanpa.arrivillaga describes in his answer, the only remaining difference must be related to memory usage. If the program is currently holding explicit references to a variable, that memory cannot be freed, even if it will not be later used. This is what occurs for chain, providing that the caller maintains an explicit reference to s.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With