I have a list of urls as follows:
urls = [
www.example.com?search?q=Term&page=0,
www.example.com?search?q=Term&page=1,
www.example.com?search?q=Term&page=2
]
Where Term might be whatever term we want: Europe, London, etc..
My part of code (among the whole code) is the following:
for url in urls:
file_name = url.replace('http://www.example.com/search?q=','').replace('=','').replace('&','')
file_name = file_name+('.html')
which results in:
Termpage0.html
Termpage1.html
and so on..
How can I strip the Term in the list of urls to result as:
page0.html
page1.html
and so on?
You could use urllib.parse to parse the URL and then the query part. Benefit of this approach is that it will work the same if order of query parts are changed or new parts are added:
from urllib import parse
urls = [
'www.example.com?search?q=Term&page=0',
'www.example.com?search?q=Term&page=1',
'www.example.com?search?q=Term&page=2'
]
for url in urls:
parts = parse.urlparse(url)
query = parse.parse_qs(parts.query)
print('page{}.html'.format(query['page'][0]))
Output:
page0.html
page1.html
page2.html
In above urlparse returns ParseResult object that contains URL components:
>>> from urllib import parse
>>> parts = parse.urlparse('www.example.com/search?q=Term&page=0')
>>> parts
ParseResult(scheme='', netloc='', path='www.example.com/search', params='', query='q=Term&page=0', fragment='')
Then parse_qs will return dict of query parameters where values are lists:
>>> query = parse.parse_qs(parts.query)
>>> query
{'page': ['0'], 'q': ['Term']}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With