Replacing the empty strings in a string

Tags:

I accidentally found that in python, an operation of the form

string1.join(string2)

Can be equivalently expressed as

string2.replace('', string1)[len(string1):-len(string1)]

Furthermore, after trying timeit with a few different sized inputs, this weird way to join seems to be more than twice as fast.

Why should the join method be slower?
Is replacing the empty string like this a safe/well-defined thing to do?

467

asked Jan 21 '13 02:01

wim

2 Answers

So first of all, let's break down why this works.

>>> string1 = "foo"
>>> string2 = "bar"
>>> string1.join(string2)
'bfooafoor'

This is the operation of putting string1 between every item (character) of string2.

So replacing the empty string does something kind of interesting, it counts the gap between empty characters as the empty string and therefore does essentially the same task, except with an extra separator at the start and end:

>>> string2.replace('', string1)
'foobfooafoorfoo'

So slicing out these produces the same result as str.join():

>>> string2.replace('', string1)[len(string1):-len(string1)]
'bfooafoor'

Obviously, this solution is much, much less readable than str.join(), and so I'd always recommend against it. str.join() has also been developed to be efficient on all platforms. Replacing the empty string might be far less efficient on some versions of Python (I don't know if that's the case, but it's a possibility - just as repeated concatenation is reasonably fast in CPython, but that's not necessarily the case elsewhere.)

I can't even find anything in the documentation that suggests that this behaviour of replacing the empty string should function this way, the docs for str.replace() simply say:

Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

I see no reason why we should presume that the gaps in between letters should count as an occurrence of the empty string (arguably, you could fit infinite empty strings anywhere in the string), and as such, relying on this behaviour might be a bad idea.

This operation is also pretty rare - it's more common to have a sequence of strings to join together - joining individual characters of a string isn't something I have personally had to do often.

Interestingly, this x.replace("", y) appears to be special cased in the Python source:

2347 /* Algorithms for different cases of string replacement */
2348
2349 /* len(self)>=1, from="", len(to)>=1, maxcount>=1 */
2350 Py_LOCAL(PyStringObject *)
2351 replace_interleave(PyStringObject *self,
2352 const char *to_s, Py_ssize_t to_len,
2353 Py_ssize_t maxcount)
2354 {
...

It may well be this special casing causes it to perform well. Again, as it's not mentioned in the documentation, this is an implementation detail, and assuming it will work as quickly (or at all) in other Python versions would be a mistake.

123

answered Sep 20 '22 23:09

Gareth Latty

As Lattyware mentioned, for empty string replacement, its a special case, replace_interleave, its a straight forward loop where, alternate character from source and from string are copied to the resultant string. The Loop is coded to be as fast as possible.

count = self_len+1;

count -= 1;
Py_MEMCPY(result_s, to_s, to_len);
result_s += to_len;
for (i=0; i<count; i++) {
    *result_s++ = *self_s++;
    Py_MEMCPY(result_s, to_s, to_len);
    result_s += to_len;
}

/* Copy the rest of the original string */
Py_MEMCPY(result_s, self_s, self_len-i);

The Join method has also a Loop, but there are areas of improvements (through I have not found all aspects for the reason to have been coded the following way) and reasons for the bottleneck.

char *sep = PyString_AS_STRING(self);
seq = PySequence_Fast(orig, "");
/* Catenate everything. */
p = PyString_AS_STRING(res);
for (i = 0; i < seqlen; ++i) {
    size_t n;
    item = PySequence_Fast_GET_ITEM(seq, i);
    n = PyString_GET_SIZE(item);
    Py_MEMCPY(p, PyString_AS_STRING(item), n);
    p += n;
    if (i < seqlen - 1) {
        Py_MEMCPY(p, sep, seplen);
        p += seplen;
    }
}

As You may see here, Inside a Loop

Each Item of the String is Indexed
Size of the Item is determined
Indexed Item is Converted to String

The above three operations, even though it may be in-lined have considerable overhead. Note This Also explains, why using a List have different result compared to using a STring, as observed by Blended

Also comparing both the loops,

The Former

Can Easily be auto vectorized
Cache Friendly.

Final Note

The str.join was written keeping in mind for all forms of iterable and sequences and not just string, and without going in much details, its quite expected that a generalized routine may not perform as fast as a specialized routine to serve a particular form of data.

answered Sep 17 '22 23:09

Abhijit

Related questions
                            
                                Getting a hyperlink URL from an Excel document
                            
                                does closing a file opened with os.fdopen close the os-level fd?
                            
                                Sorting a dictionary of tuples in Python
                            
                                Can Super deal with multiple inheritance?
                            
                                "Open with..." a file on Windows, with a python application
                            
                                How can I prevent csv.DictWriter() or writerow() rounding my floats?
                            
                                Zephyr ASDL (Abstract Syntax Description Language)
                            
                                numpy ndarray hashability
                            
                                pygame vs tkinter.
                            
                                Is it possible to create .mobi files with python? if so, how?
                            
                                Getting a users entire twitter timeline with tweepy
                            
                                Animating 3d scatterplot in matplotlib
                            
                                Coding sigma formula?
                            
                                Rasterizing multiple elements in matplotlib
                            
                                Modify numpy array section in-place using boolean indexing
                            
                                Safely storing encrypted credentials in django
                            
                                OpenCV Python single (rather than multiple) blob tracking?
                            
                                clone process support in python
                            
                                How to perform time limited response download with python requests?
                            
                                How do I get the raw representation of a string in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Replacing the empty strings in a string

Tags:

performance

python

string

wim

People also ask

2 Answers

Gareth Latty

Abhijit

Recent Activity

Donate For Us