Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strip random characters from url

I have a list of urls as follows:

urls = [
www.example.com?search?q=Term&page=0,
www.example.com?search?q=Term&page=1,
www.example.com?search?q=Term&page=2
]

Where Term might be whatever term we want: Europe, London, etc..

My part of code (among the whole code) is the following:

for url in urls:
  file_name = url.replace('http://www.example.com/search?q=','').replace('=','').replace('&','')
  file_name = file_name+('.html')

which results in:

Termpage0.html
Termpage1.html
and so on..

How can I strip the Term in the list of urls to result as:

page0.html
page1.html
and so on?
like image 644
Yannis Dran Avatar asked Dec 20 '25 00:12

Yannis Dran


1 Answers

You could use urllib.parse to parse the URL and then the query part. Benefit of this approach is that it will work the same if order of query parts are changed or new parts are added:

from urllib import parse

urls = [
    'www.example.com?search?q=Term&page=0',
    'www.example.com?search?q=Term&page=1',
    'www.example.com?search?q=Term&page=2'
]

for url in urls:
    parts = parse.urlparse(url)
    query = parse.parse_qs(parts.query)
    print('page{}.html'.format(query['page'][0]))

Output:

page0.html
page1.html
page2.html

In above urlparse returns ParseResult object that contains URL components:

>>> from urllib import parse
>>> parts = parse.urlparse('www.example.com/search?q=Term&page=0')
>>> parts
ParseResult(scheme='', netloc='', path='www.example.com/search', params='', query='q=Term&page=0', fragment='')

Then parse_qs will return dict of query parameters where values are lists:

>>> query = parse.parse_qs(parts.query)
>>> query
{'page': ['0'], 'q': ['Term']}
like image 105
niemmi Avatar answered Dec 21 '25 12:12

niemmi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!