I am working with a pandas dataframe. I am trying to split a column after the date and time from the rest of the string.
df
data
0 Oct 22 12:56:52 server1
1 Oct 22 12:56:52 server2
2 Oct 22 12:56:53 server2
3 Oct 22 12:56:54 server2
4 Oct 22 12:56:56 comp2
Desired output:
df
date machine
0 Oct 22 12:56:52 server1
1 Oct 22 12:56:52 server2
2 Oct 22 12:56:53 server2
3 Oct 22 12:56:54 server2
4 Oct 22 12:56:56 comp2
If I try something like df["data"].str.extract('^(.*? [0-9]{2}) (.*)$') it just strips everything after the 22(day)
You can also pass the exact form of date\time:
df['data'].str.extract('(\w* \d* \d*:\d*:\d*) (.*)')
output:
0 1
0 Oct 22 12:56:52 server1
1 Oct 22 12:56:52 server2
2 Oct 22 12:56:53 server2
3 Oct 22 12:56:54 server2
4 Oct 22 12:56:56 comp2
Using positive lookbehind to split on {semicolon}{two numbers}{space}:
Details:
(?<=) is positive lookbehind (check if anything is before the string):\d{2} is pattern: {semicolon}{two numbers}
\s is white spaceConclusion: we split on a whitespace but only if its preceeded by the pattern described above.
s = df['data'].str.split('(?<=:\d{2})\s')
df['date'] = s.str[0]
df['machine'] = s.str[1]
df = df.drop(columns='data')
Or as piRSquared & jezrael suggest in the comments, in a one-liner:
df['date'], df['machine'] = zip(*df.pop('data').str.split('(?<=:\d{2})\s'))
Output
date machine
0 Oct 22 12:56:52 server1
1 Oct 22 12:56:52 server2
2 Oct 22 12:56:53 server2
3 Oct 22 12:56:54 server2
4 Oct 22 12:56:56 comp2
This depends on the data format always being 15 characters.
Also, since we're going to have to drop a column 'data' anyway, I thought it would be better to simply create a dataframe from scratch.
pd.DataFrame([[s[:15], s[16:]] for s in df.data], columns=['date', 'machine'])
date machine
0 Oct 22 12:56:52 server1
1 Oct 22 12:56:52 server2
2 Oct 22 12:56:53 server2
3 Oct 22 12:56:54 server2
4 Oct 22 12:56:56 comp2
rsplitDepends on 'machine' name never having spaces.
This works because the string accessor provided by pandas.Series.str is an iterable and can be used in an assignment statement similar to x, y = (1, 2)
Also note that I unapologetically took the idea to use pop in this instance from @jezrael
df['date'], df['machine'] = df.pop('data').str.rsplit(n=1).str
df
date machine
0 Oct 22 12:56:52 server1
1 Oct 22 12:56:52 server2
2 Oct 22 12:56:53 server2
3 Oct 22 12:56:54 server2
4 Oct 22 12:56:56 comp2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With