I have a nested list of strings which I would like to extract them the date. The date format is:
Two numbers (from
01to12) hyphen tree letters (a valid month) hyphen two numbers, for example:08-Jan—07or03-Oct—01
I tried to use the following regex:
r'\d{2}(—|-)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2,4}'
Then I tested it as follows:
import pandas as pd
df = pd.DataFrame({'blobs':['6-Feb- 1 4 Facebook’s virtual-reality division created a 3-EBÚ7 11 network of 500 free demo stations in Best Buy stores to give people a taste of VR using the Oculus Rift 90 GT 48 headset. But according to a Wednesday report from Business Insider, about 200 of the demo stations will close after low interest from consumers. 17-Feb-2014',
'I think in a store environment getting people to sit down and go through that experience of getting a headset on and getting set up is quite a difficult thing to achieve,” said Geoff Blaber, a CCS Insight analyst. 29—Oct-2012 Blaber 32 FAX 2978 expects that it will get easier when companies can convince 18-Oct-12 credit cards. '
]})
df
Then:
df['blobs'].str.extractall(r'\d{2}(—|-)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2,4}')
Nevertheless, they are not working. The previous regex doesn't give me anything (i.e. just hypens -):
Col
0 NaN
1 -
2 -
3 NaN
4 NaN
5 -
...
n -
How can I fix them in order to get?:
Col
0 6-Feb-14, 17-Feb-2014
1 29—Oct-2012, 18-Oct-12
UPDATE
I also tried to:
import re
df['col'] = df.blobs.apply(lambda x: re.findall('\d{2}(—|-)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2,4}',x))
s = df.apply(lambda x: pd.Series(x['col']),axis=1).stack().reset_index(level=1, drop=True)
s.name = "col"
df = df.drop('col')
df
Nevertheless I also got:
ValueError Traceback (most recent call last)
<ipython-input-4-5e9a34bd159f> in <module>()
3 s = df.apply(lambda x: pd.Series(x['col']),axis=1).stack().reset_index(level=1, drop=True)
4 s.name = "col"
----> 5 df = df.drop('col')
6 df
/usr/local/lib/python3.5/site-packages/pandas/core/generic.py in drop(self, labels, axis, level, inplace, errors)
1905 new_axis = axis.drop(labels, level=level, errors=errors)
1906 else:
-> 1907 new_axis = axis.drop(labels, errors=errors)
1908 dropped = self.reindex(**{axis_name: new_axis})
1909 try:
/usr/local/lib/python3.5/site-packages/pandas/indexes/base.py in drop(self, labels, errors)
3260 if errors != 'ignore':
3261 raise ValueError('labels %s not contained in axis' %
-> 3262 labels[mask])
3263 indexer = indexer[~mask]
3264 return self.delete(indexer)
ValueError: labels ['col'] not contained in axis
When you use Series.str.extract or Series.str.extractall, the captured substrings are returned, not the whole matches. So, you need to make sure you capture (i.e. add ( and ) around) the part of pattern you need to grab.
Now, several expected matches in your rows make it more difficult to do with extractall, it seems you may use Series.str.findall that may return the whole matches if no capturing group is defined in the pattern.
Use
rx = r'\b\d{1,2}[-–—](?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[-–—](?:\d{4}|\d{2})\b'
df['Col'] = df['blobs'].str.findall(rx).apply(','.join)
The .apply(','.join) will convert lists to comma-separated strings in Col column.
The pattern means:
\b - a word boundary\d{1,2} - 1 or 2 digits[-–—] - a hyphen, em- or en-dash(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) - any of the 12 month shortened names[-–—] - a hyphen, em- or en-dash(?:\d{4}|\d{2}) - 4 or 2 digits\b - a word boundaryIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With