Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

most efficient way to go about identifying sub-strings in a string in python?

i need to search a fairly lengthy string for CPV (common procurement vocab) codes.

at the moment i'm doing this with a simple for loop and str.find()

the problem is, if the CPV code has been listed in a slightly different format, this algorithm won't find it.

what's the most efficient way of searching for all the different iterations of the code within the string? Is it simply a case of reformatting each of the up to 10,000 CPV codes and using str.find() for each instance?

An example of different formatting could be as follows

30124120-1 
301241201 
30124120 - 1
30124120 1
30124120.1

etc.

Thanks :)

like image 903
significance Avatar asked Nov 19 '25 22:11

significance


2 Answers

Try a regular expression:

>>> cpv = re.compile(r'([0-9]+[-\. ]?[0-9])')
>>> print cpv.findall('foo 30124120-1 bar 21966823.1 baz')
['30124120-1', '21966823.1']

(Modify until it matches the CPVs in your data closely.)

like image 134
Fred Foo Avatar answered Nov 21 '25 12:11

Fred Foo


Try using any of the functions in re (regular expressions for Python). See the docs for more info.

You can craft a regular expression to accept a number of different formats for these codes, and then use re.findall or something similar to extract the information. I'm not certain what a CPV is so I don't have a regular expression for it (though maybe you could see if Google has any?)

like image 38
Rafe Kettler Avatar answered Nov 21 '25 13:11

Rafe Kettler



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!