I have a huge list of urls that are all like this:
http://www.example.com/site/section1/VAR1/VAR2
Where VAR1 and VAR2 are the dynamic elements of the url. What I want to do is to extract from this url string only the VAR1. I've tried to use urlparse but the output look like this:
ParseResult(scheme='http', netloc='www.example.com', path='/site/section1/VAR1/VAR2', params='', query='', fragment='')
You can split the line by space. and then use the os module to get the filename from the path. For example. +1.
Source code: Lib/urllib/parse.py. This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.”
You can remember this in general. Different sections of the url can be obtained using urlparse
. Here you can obtain the path
by urlparse(url).path
and then obtain the desired variable by split()
function
>>> from urlparse import urlparse
>>> url = 'http://www.example.com/site/section1/VAR1/VAR2'
>>> urlparse(url)
ParseResult(scheme='http', netloc='www.example.com', path='/site/section1/VAR1/VAR2', params='', query='', fragment='')
>>> urlparse(url).path
'/site/section1/VAR1/VAR2'
>>> urlparse(url).path.split('/')[-2]
'VAR1'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With