Normalize whitespace with Python

Question

I'm building a data extract using scrapy and want to normalize a raw string pulled out of an HTML document. Here's an example string:

  Sapphire RX460 OC  2/4GB

Notice two groups of two whitespaces preceeding the string literal and between OC and 2.

Python provides trim as described in How do I trim whitespace with Python? But that won't handle the two spaces between OC and 2, which I need collapsed into a single space.

I've tried using normalize-space() from XPath while extracting data with my scrapy Selector and that works but the assignment verbose with strong rightward drift:

product_title = product.css('h3').xpath('normalize-space((text()))').extract_first()

Is there an elegant way to normalize whitespace using Python? If not a one-liner, is there a way I can break the above line into something easier to read without throwing an indentation error, e.g.

product_title = product.css('h3')
    .xpath('normalize-space((text()))')
    .extract_first()

Tom Karzes · Accepted Answer

You can use:

" ".join(s.split())

where s is your string.

hd1 · Answer

Instead of using regex's for this, a more efficient solution is to use the join/split option, observe:

>>> timeit.Timer((lambda:' '.join(' Sapphire RX460 OC  2/4GB'.split()))).timeit()
0.7263979911804199

>>> def f():
        return re.sub(" +", ' ', "  Sapphire RX460 OC  2/4GB").split()

>>> timeit.Timer(f).timeit()
4.163465976715088

Normalize whitespace with Python

Tags:

python

string

xpath

scrapy

vhs

2 Answers

Tom Karzes

hd1

Recent Activity

Donate For Us

Normalize whitespace with Python

Tags:

python

string

xpath

scrapy

vhs

2 Answers

Tom Karzes

hd1

Related questions

Recent Activity

Donate For Us