Say I have the following strings (inputs) in python:
1) "$ 1,350,000"
2) "1.35 MM $"
3) "$ 1.35 M"
4) 1350000
(now it is a numeric value)
Obviously the number is the same although the string representation is different. How can I achieve a string matching or in other words classify them as equal strings?
One way would be to model -using regular expressions- the possible patterns. However there might be a case that I haven't thought of.
Does someone see a NLP solution to this problem?
Thanks
This is not an NLP problem, just a job for regexes, plus some code to ignore order, and lookup a dictionary of known abbreviations(/ontology) like "MM".
Here's some working code:
def parse_numeric_string(s):
if isinstance(s, int): s = str(s)
amount = None
currency = ''
multiplier = 1.0
for token in s.split(' '):
token = token.lower()
if token in ['$','€','£','¥']:
currency = token
# Extract multipliers from their string names/abbrevs
if token in ['million','m','mm']:
multiplier = 1e6
# ... or you could use a dict:
# multiplier = {'million': 1e6, 'm': 1e6...}.get(token, 1.0)
# Assume anything else is some string format of number/int/float/scientific
try:
token = token.replace(',', '')
amount = float(token)
except:
pass # Process your parse failures...
# Return a tuple, or whatever you prefer
return (currency, amount * multiplier)
parse_numeric_string("$ 1,350,000")
parse_numeric_string("1.35 MM $")
parse_numeric_string("$ 1.35 M")
parse_numeric_string(1350000)
,
and .
as thousands separator and decimal point can be switched, or '
as (Arabic) thousands separator. There's also a third-party Python package 'parse', e.g. parse.parse('{fn}', '1,350,000')
(it's the reverse of format()
) USD1.3m
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With