Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - parse IPv4 addresses from string (even when censored)

Objective: Write Python 2.7 code to extract IPv4 addresses from string.

String content example:


The following are IP addresses: 192.168.1.1, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. and these censorship methods could apply to any of the dots (Ex: 192[.]168[.]1[.]1).


As you can see from the above, I am struggling to find a way to parse through a txt file that may contain IPs depicted in multiple forms of "censorship" (to prevent hyper-linking).

I'm thinking that a regex expression is the way to go. Maybe say something along the lines of; any grouping of four ints 0-255 or 000-255 separated by anything in the 'separators list' which would consist of periods, brackets, parenthesis, or any of the other aforementioned examples. This way, the 'separators list' could be updated at as needed.

Not sure if this is the proper way to go or even possible so, any help with this is greatly appreciated.


Update: Thanks to recursive's answer below, I now have the following code working for the above example. It will...

  • find the IPs
  • place them into a list
  • clean them of the spaces/braces/etc
  • and replace the uncleaned list entry with the cleaned one.

Caveat: The code below does not account for incorrect/non-valid IPs such as 192.168.0.256 or 192.168.1.2.3 Currently, it will drop the trailing 6 and 3 from the aforementioned. If its first octet is invalid (ex:256.10.10.10) it will drop the leading 2 (resulting in 56.10.10.10).

import re

def extractIPs(fileContent):
    pattern = r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)([ (\[]?(\.|dot)[ )\]]?(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3})"
    ips = [each[0] for each in re.findall(pattern, fileContent)]   
    for item in ips:
        location = ips.index(item)
        ip = re.sub("[ ()\[\]]", "", item)
        ip = re.sub("dot", ".", ip)
        ips.remove(item)
        ips.insert(location, ip) 
    return ips

myFile = open('***INSERT FILE PATH HERE***')
fileContent = myFile.read()

IPs = extractIPs(fileContent)
print "Original file content:\n{0}".format(fileContent)
print "--------------------------------"
print "Parsed results:\n{0}".format(IPs)
like image 231
nephos Avatar asked Jun 26 '13 18:06

nephos


People also ask

Can I extract an IPv4 address from a string?

Show activity on this post. Closed 6 years ago. I am kindof stuck in extracting an IPv4 address from a String. String may contain an IPv4 address in the middle, and if it does - there will a space before and after the IPv4 address.

What is a ValueError in IPv4?

A ValueError is raised if address does not represent a valid IPv4 or IPv6 address, or if the network has host bits set. Return an IPv4Interface or IPv6Interface object depending on the IP address passed as argument. address is a string or integer representing the IP address.

Does string contain an IPv4 address in the middle?

String may contain an IPv4 address in the middle, and if it does - there will a space before and after the IPv4 address. I am trying to build a regex for the above cases, they look fairly straightforward, and I am not able to incorporate all the regex checks. Can someone help me in the right direction? To Summarize : Show activity on this post.

What constitutes a valid IPv4 address?

The following constitutes a valid IPv4 address: 1 A string in decimal-dot notation, consisting of four decimal integers in the inclusive range 0–255, separated by dots (e. 2 An integer that fits into 32 bits. 3 An integer packed into a bytes object of length 4 (most significant octet first). More ...


2 Answers

Here is a regex that works:

import re
pattern = r"((([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])[ (\[]?(\.|dot)[ )\]]?){3}([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]))"
text = "The following are IP addresses: 192.168.1.1, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. "
ips = [match[0] for match in re.findall(pattern, text)]
print ips

# output: ['192.168.1.1', '8.8.8.8', '101.099.098.000', '192.168.1[.]1', '192.168.1(.)1', '192.168.1[dot]1', '192.168.1(dot)1', '192 .168 .1 .1', '192. 168. 1. 1']

The regex has a few main parts, which I will explain here:

  • ([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])
    This matches the numerical parts of the ip address. | means "or". The first case handles numbers from 0 to 199 with or without leading zeroes. The second two cases handle numbers over 199.
  • [ (\[]?(\.|dot)[ )\]]?
    This matches the "dot" parts. There are three sub-components:
    • [ (\[]? The "prefix" for the dot. Either a space, an open paren, or open square brace. The trailing ? means that this part is optional.
    • (\.|dot) Either "dot" or a period.
    • [ )\]]? The "suffix". Same logic as the prefix.
  • {3} means repeat the previous component 3 times.
  • The final element is another number, which is the same as the first, except it is not followed by a dot.
like image 148
recursive Avatar answered Oct 17 '22 11:10

recursive


Description

This regex will match each of four octets of a what looks like an IP address. Each of the octets will be placed into it's own capture group for collection.

(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])

enter image description here

Given the following sample text this regex will match all 10 embedded IP strings in their entirety including the first one. Working example: http://www.rubular.com/r/1MbGZOhuj5

The following are IP addresses: 192.168.1.222, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. and these censorship methods could apply to any of the dots (Ex: 192[.]168[.]1[.]1).

The resulting matches could be iterated over and a properly formatted IP string could be constructed by joining the 4 capture groups with a dot.

like image 31
Ro Yo Mi Avatar answered Oct 17 '22 11:10

Ro Yo Mi