Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find all occurrences of integer within text in Python

my purpose of this code is to extract all the integers from the text and sum them up together.

I have been looking for solutions to pluck out all the integers in a line of text. I saw some solutions suggesting to use \D and \b, I just got started with regular expression and still unfamiliar with how it can fit into my code. Please help :(

import re
import urllib2

data = urllib2.urlopen("http://python-data.dr-chuck.net/regex_sum_179860.txt")
aList = []

for word in data:
    data = (str(w) for w in data)
    s = re.findall(r'[\d]+', word)
    if len(s) != 1: continue
    num = int(s[0])
    aList.append(num)

print aList
like image 903
Kelvinlimjk Avatar asked Dec 16 '15 15:12

Kelvinlimjk


People also ask

How do you find all the integers in a string in Python?

To find numbers from a given string in Python we can easily apply the isdigit() method. In Python the isdigit() method returns True if all the digit characters contain in the input string and this function extracts the digits from the string. If no character is a digit in the given string then it will return False.

How do you find multiple occurrences of a character in a string in Python?

The finditer function of the regex library can help us perform the task of finding the occurrences of the substring in the target string and the start function can return the resultant index of each of them.

How do you find the occurrences of a word in a string in Python?

Python String count() The count() method returns the number of occurrences of a substring in the given string.

How do you find all occurrences of a character in a string?

1. Using indexOf() and lastIndexOf() method. The String class provides an indexOf() method that returns the index of the first appearance of a character in a string. To get the indices of all occurrences of a character in a String, you can repeatedly call the indexOf() method within a loop.


2 Answers

  1. You need call read of the return value of the urllib2.urlopen; The return value of urllib2.urlopen is not a string, but a connection object (file-like object)
  2. Just apply re.findall to the data.
  3. Square brackets around \d are not necessary.

import re
import urllib2

data = urllib2.urlopen("http://python-data.dr-chuck.net/regex_sum_179860.txt").read()
int_list = map(int, re.findall(r'\d+', data))

>>> int_list
[3524, 9968, 6177, 3133, 6508, 7940, 3738, 1112, 6179, 4570, 6127, 9150,
 9883, 418, 3538, 2992, 8527, 1150, 2049, 2834, 2630, 3840, 2638, 3800,
 9144, 5866, 6742, 588, 6918, 7802, 8229, 7947, 8992, 1339, 2119, 846,
 3820, 4070, 9356, 9708, 3238, 9380, 5572, 9491, 3038, 7434, 7771, 288,
 8632, 3962, 9136, 8106, 7295, 3699, 4136, 3459, 8120, 6018, 8963, 5779,
 3635, 3984, 4850, 9633, 2588, 7631, 9591, 1067, 7182, 1301, 8041, 1361,
 5425, 8326, 7094, 8155, 2581, 7199, 6125, 42]
like image 52
falsetru Avatar answered Sep 17 '22 22:09

falsetru


You can do it line by line, call findall using the pattern "\d+" for one or more digits and extending your output list:

import re
import urllib2

data = urllib2.urlopen("http://python-data.dr-chuck.net/regex_sum_179860.txt")
r = re.compile("\d+")
l = []
for line in data:
    l.extend(map(int,r.findall(line)))

Output:

[3524, 9968, 6177, 3133, 6508, 7940, 3738, 1112, 6179, 4570, 6127, 9150, 9883, 418, 3538, 2992, 8527, 1150, 2049, 2834, 2630, 3840, 2638,  3800, 9144, 5866, 6742, 588, 6918, 7802, 8229, 7947, 8992, 1339, 
2119,  846, 3820, 4070, 9356, 9708, 3238, 9380, 5572, 9491, 3038, 
7434, 7771, 288, 8632, 3962, 9136, 8106, 7295, 3699, 4136, 3459, 8120,
6018, 8963, 5779, 3635, 3984, 4850, 9633, 2588, 7631, 9591, 1067, 
7182, 1301, 8041, 1361, 5425, 8326, 7094, 8155, 2581, 7199, 6125, 42]

You could also use str.isdigit:

l = []
for line in data:
     l.extend(map(int,(w for w in line.split() if w.isdigit())))

If you just want to sum the numbers, you don't need to store all the numbers at all:

print(sum(sum(map(int,(w for w in line.split() if w.isdigit()))) for line in data))

Output:

435239

Or using a regex:

 print(sum(sum(map(int,r.findall(line))) for line in data))

Probably irrelevant in your case but if you wanted to avoid any intermediary lists using python2 you could use itertools.imap:

from itertools import imap
print(sum(sum(imap(int,r.findall(line))) for line in data))
like image 23
Padraic Cunningham Avatar answered Sep 16 '22 22:09

Padraic Cunningham