I'm building a data extract using scrapy and want to normalize a raw string pulled out of an HTML document. Here's an example string:
Sapphire RX460 OC 2/4GB
Notice two groups of two whitespaces preceeding the string literal and between OC
and 2
.
Python provides trim as described in How do I trim whitespace with Python? But that won't handle the two spaces between OC
and 2
, which I need collapsed into a single space.
I've tried using normalize-space()
from XPath while extracting data with my scrapy Selector and that works but the assignment verbose with strong rightward drift:
product_title = product.css('h3').xpath('normalize-space((text()))').extract_first()
Is there an elegant way to normalize whitespace using Python? If not a one-liner, is there a way I can break the above line into something easier to read without throwing an indentation error, e.g.
product_title = product.css('h3')
.xpath('normalize-space((text()))')
.extract_first()
You can use:
" ".join(s.split())
where s
is your string.
Instead of using regex's for this, a more efficient solution is to use the join/split option, observe:
>>> timeit.Timer((lambda:' '.join(' Sapphire RX460 OC 2/4GB'.split()))).timeit()
0.7263979911804199
>>> def f():
return re.sub(" +", ' ', " Sapphire RX460 OC 2/4GB").split()
>>> timeit.Timer(f).timeit()
4.163465976715088
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With