Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find which lines in a file contain certain characters

Is there a way to find out if a string contains any one of the characters in a set with python?

It's straightforward to do it with a single character, but I need to check and see if a string contains any one of a set of bad characters.

Specifically, suppose I have a string:

s = 'amanaplanacanalpanama~012345'

and I want to see if the string contains any vowels:

bad_chars = 'aeiou'

and do this in a for loop for each line in a file:

if [any one or more of the bad_chars] in s:
    do something

I am scanning a large file so if there is a faster method to this, that would be ideal. Also, not every bad character has to be checked---so long as one is encountered that is enough to end the search.

I'm not sure if there is a builtin function or easy way to implement this, but I haven't come across anything yet. Any pointers would be much appreciated!

like image 667
BFTM Avatar asked May 03 '12 22:05

BFTM


1 Answers

any((c in badChars) for c in yourString)

or

any((c in yourString) for c in badChars)  # extensionally equivalent, slower

or

set(yourString) & set(badChars)  # extensionally equivalent, slower

"so long as one is encountered that is enough to end the search." - This will be true if you use the first method.

You say you are concerned with performance: performance should not be an issue unless you are dealing with a huge amount of data. If you encounter issues, you can try:


Regexes

edit Previously I had written a section here on using regexes, via the re module, programatically generating a regex that consisted of a single character-class [...] and using .finditer, with the caveat that putting a simple backslash before everything might not work correctly. Indeed, after testing it, that is the case, and I would definitely not recommend this method. Using this would require reverse engineering the entire (slightly complex) sub-grammar of regex character classes (e.g. you might have characters like \ followed by w, like ] or [, or like -, and merely escaping some like \w may give it a new meaning).


Sets

Depending on whether the str.__contains__ operation is O(1) or O(N), it may be justifiable to first convert your text/lines into a set to ensure the in operation is O(1), if you have many badChars:

badCharSet = set(badChars)
any((c in badChars) for c in yourString)

(it may be possible to make that a one-liner any((c in set(yourString)) for c in badChars), depending on how smart the python compiler is)


Do you really need to do this line-by-line?

It may be faster to do this once for the entire file O(#badchars), than once for every line in the file O(#lines*#badchars), though the asymptotic constants may be such that it won't matter.

like image 125
ninjagecko Avatar answered Sep 27 '22 19:09

ninjagecko