Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas split after month day time from rest of string

I am working with a pandas dataframe. I am trying to split a column after the date and time from the rest of the string.

df
   data
0  Oct 22 12:56:52 server1
1  Oct 22 12:56:52 server2
2  Oct 22 12:56:53 server2
3  Oct 22 12:56:54 server2
4  Oct 22 12:56:56 comp2

Desired output:

df
   date              machine
0  Oct 22 12:56:52   server1
1  Oct 22 12:56:52   server2
2  Oct 22 12:56:53   server2
3  Oct 22 12:56:54   server2
4  Oct 22 12:56:56   comp2

If I try something like df["data"].str.extract('^(.*? [0-9]{2}) (.*)$') it just strips everything after the 22(day)

like image 509
sectechguy Avatar asked Oct 25 '19 13:10

sectechguy


3 Answers

You can also pass the exact form of date\time:

df['data'].str.extract('(\w* \d* \d*:\d*:\d*) (.*)')

output:

                 0        1
0  Oct 22 12:56:52  server1
1  Oct 22 12:56:52  server2
2  Oct 22 12:56:53  server2
3  Oct 22 12:56:54  server2
4  Oct 22 12:56:56    comp2
like image 102
Quang Hoang Avatar answered Oct 01 '22 04:10

Quang Hoang


Using positive lookbehind to split on {semicolon}{two numbers}{space}:

Details:

  • (?<=) is positive lookbehind (check if anything is before the string)
  • :\d{2} is pattern: {semicolon}{two numbers}
  • \s is white space

Conclusion: we split on a whitespace but only if its preceeded by the pattern described above.

s = df['data'].str.split('(?<=:\d{2})\s')

df['date'] = s.str[0]
df['machine'] = s.str[1]
df = df.drop(columns='data')

Or as piRSquared & jezrael suggest in the comments, in a one-liner:

df['date'], df['machine'] = zip(*df.pop('data').str.split('(?<=:\d{2})\s'))

Output

              date  machine
0  Oct 22 12:56:52  server1
1  Oct 22 12:56:52  server2
2  Oct 22 12:56:53  server2
3  Oct 22 12:56:54  server2
4  Oct 22 12:56:56    comp2
like image 24
Erfan Avatar answered Oct 01 '22 05:10

Erfan


comprehension

This depends on the data format always being 15 characters.
Also, since we're going to have to drop a column 'data' anyway, I thought it would be better to simply create a dataframe from scratch.

pd.DataFrame([[s[:15], s[16:]] for s in df.data], columns=['date', 'machine'])

              date  machine
0  Oct 22 12:56:52  server1
1  Oct 22 12:56:52  server2
2  Oct 22 12:56:53  server2
3  Oct 22 12:56:54  server2
4  Oct 22 12:56:56    comp2

rsplit

Depends on 'machine' name never having spaces.

This works because the string accessor provided by pandas.Series.str is an iterable and can be used in an assignment statement similar to x, y = (1, 2)

Also note that I unapologetically took the idea to use pop in this instance from @jezrael

df['date'], df['machine'] = df.pop('data').str.rsplit(n=1).str

df

              date  machine
0  Oct 22 12:56:52  server1
1  Oct 22 12:56:52  server2
2  Oct 22 12:56:53  server2
3  Oct 22 12:56:54  server2
4  Oct 22 12:56:56    comp2
like image 32
piRSquared Avatar answered Oct 01 '22 04:10

piRSquared