I'm quite new to python. I'm trying to parse a file of URLs to leave only the domain name. some of the urls in my log file begin with http:// and some begin with www.Some begin with both. This is the part of my code which strips the http:// part. What do I need to add to it to look for both http and www. and remove both? <pre class="prettyprint"><code>line = re.findall(r'(https?://\S+)', line) </code></pre> Currently when I run the code only http:// is stripped. if I change the code to the following: <pre class="prettyprint"><code>line = re.findall(r'(https?://www.\S+)', line) </code></pre> Only domains starting with both are affected. I need the code to be more conditional. TIA edit... here is my full code... <pre class="prettyprint"><code>import re import sys from urlparse import urlparse f = open(sys.argv[1], "r") for line in f.readlines(): line = re.findall(r'(https?://\S+)', line) if line: parsed=urlparse(line[0]) print parsed.hostname f.close() </code></pre> I mistagged by original post as regex. it is indeed using urlparse.

You can do without regexes here. <pre class="prettyprint"><code>with open("file_path","r") as f: lines = f.read() lines = lines.replace("http://","") lines = lines.replace("www.", "") # May replace some false positives ('www.com') urls = [url.split('/')[0] for url in lines.split()] print '\n'.join(urls) </code></pre> Example file input: <pre class="prettyprint"><code>http://foo.com/index.html http://www.foobar.com www.bar.com/?q=res www.foobar.com </code></pre> Output: <pre class="prettyprint"><code>foo.com foobar.com bar.com foobar.com </code></pre> Edit: There could be a tricky url like foobarwww.com, and the above approach would strip the www. We will have to then revert back to using regexes. Replace the line <code>lines = lines.replace("www.", "")</code> with <code>lines = re.sub(r'(www.)(?!com)',r'',lines)</code>. Of course, every possible TLD should be used for the not-match pattern.

Find http:// and or www. and strip from domain. leaving domain.com

Tags:

python

url

urlparse

I'm quite new to python. I'm trying to parse a file of URLs to leave only the domain name.

some of the urls in my log file begin with http:// and some begin with www.Some begin with both.

This is the part of my code which strips the http:// part. What do I need to add to it to look for both http and www. and remove both?

line = re.findall(r'(https?://\S+)', line)

Currently when I run the code only http:// is stripped. if I change the code to the following:

line = re.findall(r'(https?://www.\S+)', line)

Only domains starting with both are affected. I need the code to be more conditional. TIA

edit... here is my full code...

import re
import sys
from urlparse import urlparse

f = open(sys.argv[1], "r")

for line in f.readlines():
 line = re.findall(r'(https?://\S+)', line)
 if line:
  parsed=urlparse(line[0])
  print parsed.hostname
f.close()

I mistagged by original post as regex. it is indeed using urlparse.

429

asked Jan 31 '13 12:01

Paul Tricklebank

2 Answers

It might be overkill for this specific situation, but i'd generally use urlparse.urlsplit (Python 2) or urllib.parse.urlsplit (Python 3).

from urllib.parse import urlsplit  # Python 3
from urlparse import urlsplit  # Python 2
import re

url = 'www.python.org'

# URLs must have a scheme
# www.python.org is an invalid URL
# http://www.python.org is valid

if not re.match(r'http(s?)\:', url):
    url = 'http://' + url

# url is now 'http://www.python.org'

parsed = urlsplit(url)

# parsed.scheme is 'http'
# parsed.netloc is 'www.python.org'
# parsed.path is None, since (strictly speaking) the path was not defined

host = parsed.netloc  # www.python.org

# Removing www.
# This is a bad idea, because www.python.org could 
# resolve to something different than python.org

if host.startswith('www.'):
    host = host[4:]

171

answered Sep 28 '22 13:09

Markus Unterwaditzer

You can do without regexes here.

with open("file_path","r") as f:
    lines = f.read()
    lines = lines.replace("http://","")
    lines = lines.replace("www.", "") # May replace some false positives ('www.com')
    urls = [url.split('/')[0] for url in lines.split()]
    print '\n'.join(urls)

Example file input:

http://foo.com/index.html
http://www.foobar.com
www.bar.com/?q=res
www.foobar.com

Output:

foo.com
foobar.com
bar.com
foobar.com

Edit:

There could be a tricky url like foobarwww.com, and the above approach would strip the www. We will have to then revert back to using regexes.

Replace the line lines = lines.replace("www.", "") with lines = re.sub(r'(www.)(?!com)',r'',lines). Of course, every possible TLD should be used for the not-match pattern.

answered Sep 28 '22 14:09

sidi

Related questions
                            
                                Python3 'Cannot import name 'cached_property'
                            
                                Python: AttributeError: 'module' object has no attribute 'AddReference'?
                            
                                macOS Python with numpy faster than Julia in training neural network
                            
                                Airflow Python Script with execution_date in op_kwargs
                            
                                How to stop execution of python script in visual studio code?
                            
                                ERROR: Could not build wheels for opencv-python which use PEP 517 and cannot be installed directly
                            
                                Beginner looking for beautiful and instructional Python code [closed]
                            
                                How to concisely cascade through multiple regex statements in Python
                            
                                Change file creation date
                            
                                Can you help me solve this SUDS/SOAP issue?
                            
                                How to slice a 2D Python Array? Fails with: "TypeError: list indices must be integers, not tuple"
                            
                                Add advanced features to a tkinter Text widget
                            
                                How to resolve DNS in Python?
                            
                                How to capture pygame screen?
                            
                                Why do Python unicode strings require special treatment for UTF-8 BOM?
                            
                                How to intercept a method call which doesn't exist?
                            
                                add a number to all odd or even indexed elements in numpy array without loops
                            
                                finding top k largest keys in a dictionary python
                            
                                Extracting selected columns from a table using BeautifulSoup
                            
                                How do we get TXT, CNAME and SOA records from dnspython?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With