I am working with a pandas dataframe. I am trying to split a column after the date and time from the rest of the string.
df
data
0 Oct 22 12:56:52 server1
1 Oct 22 12:56:52 server2
2 Oct 22 12:56:53 server2
3 Oct 22 12:56:54 server2
4 Oct 22 12:56:56 comp2
Desired output:
df
date machine
0 Oct 22 12:56:52 server1
1 Oct 22 12:56:52 server2
2 Oct 22 12:56:53 server2
3 Oct 22 12:56:54 server2
4 Oct 22 12:56:56 comp2
If I try something like df["data"].str.extract('^(.*? [0-9]{2}) (.*)$')
it just strips everything after the 22(day)
You can also pass the exact form of date\time:
df['data'].str.extract('(\w* \d* \d*:\d*:\d*) (.*)')
output:
0 1
0 Oct 22 12:56:52 server1
1 Oct 22 12:56:52 server2
2 Oct 22 12:56:53 server2
3 Oct 22 12:56:54 server2
4 Oct 22 12:56:56 comp2
Using positive lookbehind
to split on {semicolon}{two numbers}{space}
:
Details:
(?<=)
is positive lookbehind (check if anything is before the string):\d{2}
is pattern: {semicolon}{two numbers}
\s
is white spaceConclusion: we split on a whitespace
but only if its preceeded by the pattern described above.
s = df['data'].str.split('(?<=:\d{2})\s')
df['date'] = s.str[0]
df['machine'] = s.str[1]
df = df.drop(columns='data')
Or as piRSquared
& jezrael
suggest in the comments, in a one-liner:
df['date'], df['machine'] = zip(*df.pop('data').str.split('(?<=:\d{2})\s'))
Output
date machine
0 Oct 22 12:56:52 server1
1 Oct 22 12:56:52 server2
2 Oct 22 12:56:53 server2
3 Oct 22 12:56:54 server2
4 Oct 22 12:56:56 comp2
This depends on the data format always being 15 characters.
Also, since we're going to have to drop a column 'data'
anyway, I thought it would be better to simply create a dataframe from scratch.
pd.DataFrame([[s[:15], s[16:]] for s in df.data], columns=['date', 'machine'])
date machine
0 Oct 22 12:56:52 server1
1 Oct 22 12:56:52 server2
2 Oct 22 12:56:53 server2
3 Oct 22 12:56:54 server2
4 Oct 22 12:56:56 comp2
rsplit
Depends on 'machine'
name never having spaces.
This works because the string accessor provided by pandas.Series.str
is an iterable and can be used in an assignment statement similar to x, y = (1, 2)
Also note that I unapologetically took the idea to use pop
in this instance from @jezrael
df['date'], df['machine'] = df.pop('data').str.rsplit(n=1).str
df
date machine
0 Oct 22 12:56:52 server1
1 Oct 22 12:56:52 server2
2 Oct 22 12:56:53 server2
3 Oct 22 12:56:54 server2
4 Oct 22 12:56:56 comp2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With