How do I get specific path sections from a url? For example, I want a function which operates on this: <pre class="prettyprint"><code>http://www.mydomain.com/hithere?image=2934 </code></pre> and returns "hithere" or operates on this: <pre class="prettyprint"><code>http://www.mydomain.com/hithere/something/else </code></pre> and returns the same thing ("hithere") I know this will probably use urllib or urllib2 but I can't figure out from the docs how to get only a section of the path.

Extract the path component of the URL with urlparse: <pre class="prettyprint"><code>>>> import urlparse >>> path = urlparse.urlparse('http://www.example.com/hithere/something/else').path >>> path '/hithere/something/else' </code></pre> Split the path into components with os.path.split: <pre class="prettyprint"><code>>>> import os.path >>> os.path.split(path) ('/hithere/something', 'else') </code></pre> The dirname and basename functions give you the two pieces of the split; perhaps use dirname in a while loop: <pre class="prettyprint"><code>>>> while os.path.dirname(path) != '/': ... path = os.path.dirname(path) ... >>> path '/hithere' </code></pre>

Note in Python3 import has changed to <code>from urllib.parse import urlparse</code> See documentation. Here is an example: <pre class="prettyprint"><code>>>> from urllib.parse import urlparse >>> url = 's3://bucket.test/my/file/directory' >>> p = urlparse(url) >>> p ParseResult(scheme='s3', netloc='bucket.test', path='/my/file/directory', params='', query='', fragment='') >>> p.scheme 's3' >>> p.netloc 'bucket.test' >>> p.path '/my/file/directory' </code></pre>

Python: Get URL path sections

Tags:

python

url

How do I get specific path sections from a url? For example, I want a function which operates on this:

http://www.mydomain.com/hithere?image=2934

and returns "hithere"

or operates on this:

http://www.mydomain.com/hithere/something/else

and returns the same thing ("hithere")

I know this will probably use urllib or urllib2 but I can't figure out from the docs how to get only a section of the path.

222

asked Oct 25 '11 18:10

zakdances

4 Answers

Extract the path component of the URL with urlparse:

>>> import urlparse
>>> path = urlparse.urlparse('http://www.example.com/hithere/something/else').path
>>> path
'/hithere/something/else'

Split the path into components with os.path.split:

>>> import os.path
>>> os.path.split(path)
('/hithere/something', 'else')

The dirname and basename functions give you the two pieces of the split; perhaps use dirname in a while loop:

>>> while os.path.dirname(path) != '/':
...     path = os.path.dirname(path)
... 
>>> path
'/hithere'

answered Nov 12 '22 02:11

Josh Lee

Python 3.4+ solution:

from urllib.parse import unquote, urlparse
from pathlib import PurePosixPath

url = 'http://www.example.com/hithere/something/else'

PurePosixPath(
    unquote(
        urlparse(
            url
        ).path
    )
).parts[1]

# returns 'hithere' (the same for the URL with parameters)

# parts holds ('/', 'hithere', 'something', 'else')
#               0    1          2            3

answered Nov 12 '22 04:11

Navin

The best option is to use the posixpath module when working with the path component of URLs. This module has the same interface as os.path and consistently operates on POSIX paths when used on POSIX and Windows NT based platforms.

Sample Code:

#!/usr/bin/env python3

import urllib.parse
import sys
import posixpath
import ntpath
import json

def path_parse( path_string, *, normalize = True, module = posixpath ):
    result = []
    if normalize:
        tmp = module.normpath( path_string )
    else:
        tmp = path_string
    while tmp != "/":
        ( tmp, item ) = module.split( tmp )
        result.insert( 0, item )
    return result

def dump_array( array ):
    string = "[ "
    for index, item in enumerate( array ):
        if index > 0:
            string += ", "
        string += "\"{}\"".format( item )
    string += " ]"
    return string

def test_url( url, *, normalize = True, module = posixpath ):
    url_parsed = urllib.parse.urlparse( url )
    path_parsed = path_parse( urllib.parse.unquote( url_parsed.path ),
        normalize=normalize, module=module )
    sys.stdout.write( "{}\n  --[n={},m={}]-->\n    {}\n".format( 
        url, normalize, module.__name__, dump_array( path_parsed ) ) )

test_url( "http://eg.com/hithere/something/else" )
test_url( "http://eg.com/hithere/something/else/" )
test_url( "http://eg.com/hithere/something/else/", normalize = False )
test_url( "http://eg.com/hithere/../else" )
test_url( "http://eg.com/hithere/../else", normalize = False )
test_url( "http://eg.com/hithere/../../else" )
test_url( "http://eg.com/hithere/../../else", normalize = False )
test_url( "http://eg.com/hithere/something/./else" )
test_url( "http://eg.com/hithere/something/./else", normalize = False )
test_url( "http://eg.com/hithere/something/./else/./" )
test_url( "http://eg.com/hithere/something/./else/./", normalize = False )

test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False )
test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False,
    module = ntpath )

Code output:

http://eg.com/hithere/something/else
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
  --[n=False,m=posixpath]-->
    [ "hithere", "something", "else", "" ]
http://eg.com/hithere/../else
  --[n=True,m=posixpath]-->
    [ "else" ]
http://eg.com/hithere/../else
  --[n=False,m=posixpath]-->
    [ "hithere", "..", "else" ]
http://eg.com/hithere/../../else
  --[n=True,m=posixpath]-->
    [ "else" ]
http://eg.com/hithere/../../else
  --[n=False,m=posixpath]-->
    [ "hithere", "..", "..", "else" ]
http://eg.com/hithere/something/./else
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else
  --[n=False,m=posixpath]-->
    [ "hithere", "something", ".", "else" ]
http://eg.com/hithere/something/./else/./
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else/./
  --[n=False,m=posixpath]-->
    [ "hithere", "something", ".", "else", ".", "" ]
http://eg.com/see%5C/if%5C/this%5C/works
  --[n=False,m=posixpath]-->
    [ "see\", "if\", "this\", "works" ]
http://eg.com/see%5C/if%5C/this%5C/works
  --[n=False,m=ntpath]-->
    [ "see", "if", "this", "works" ]

Notes:

On Windows NT based platforms os.path is ntpath
On Unix/Posix based platforms os.path is posixpath
ntpath will not handle backslashes (\) correctly (see last two cases in code/output) - which is why posixpath is recommended.
remember to use urllib.parse.unquote
consider using posixpath.normpath
The semantics of multiple path separators (/) is not defined by RFC 3986. However, posixpath collapses multiple adjacent path separators (i.e. it treats ///, // and / the same)
Even though POSIX and URL paths have similar syntax and semantics, they are not identical.

Normative References:

IEEE Std 1003.1, 2013 - Vol. 1: Base Definitions - Section 4.12: Pathname Resolution
The GNU C Library Reference Manual - Section 11.2: File Names
IETF RFC 3986: Uniform Resource Identifier (URI): Generic Syntax - Section 3.3: Path
IETF RFC 3986: Uniform Resource Identifier (URI): Generic Syntax - Section 6: Normalization and Comparison
Wikipedia: URL normalization

answered Nov 12 '22 04:11

Iwan Aucamp

Note in Python3 import has changed to from urllib.parse import urlparse See documentation. Here is an example:

>>> from urllib.parse import urlparse
>>> url = 's3://bucket.test/my/file/directory'
>>> p = urlparse(url)
>>> p
ParseResult(scheme='s3', netloc='bucket.test', path='/my/file/directory', params='', query='', fragment='')
>>> p.scheme
's3'
>>> p.netloc
'bucket.test'
>>> p.path
'/my/file/directory'

answered Nov 12 '22 02:11

Aziz Alto

Related questions
                            
                                pytz - Converting UTC and timezone to local time
                            
                                Restart python-script from within itself
                            
                                No module named 'openpyxl' - Python 3.4 - Ubuntu
                            
                                "sys.getsizeof(int)" returns an unreasonably large value?
                            
                                Django Get All Users
                            
                                How can I make my Python code stay under 80 characters a line? [closed]
                            
                                Importing images from a directory (Python) to list or dictionary [closed]
                            
                                defaultdict(None)
                            
                                merging dictionaries in ansible
                            
                                Keras accuracy does not change
                            
                                How can I bundle other files when using cx_freeze?
                            
                                Elegant way to make all dirs in a path
                            
                                How can I subtract or add 100 years to a datetime field in the database in Django?
                            
                                Why can't Python import Image from PIL?
                            
                                Monkey-patch Python class
                            
                                Appending items to a list of lists in python [duplicate]
                            
                                What does a semicolon do?
                            
                                adding extra axis ticks using matplotlib
                            
                                How to fill dataframe Nan values with empty list [] in pandas?
                            
                                Pandas make new column from string slice of another column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With