How to extract part of a string in pandas dataframe cell and create a new column with that string inside it

Question

I have a dataframe in which one of the columns contains a lengthy string which contains a lot of information which I need to break up into separate columns and add these to the dataframe. Its similar to this How to add a new column with multiple string contain conditions in python pandas other than using np.where? but I can't see how to adapt that.

I can create the empty columns but I don't know if the string can have elements extracted or if it can be separated into columns.

e.g. Line of data

0    Row 1 Ch475 Vi 17.0V BF27 Sclk 100ns 1in24 24segs

Desired output

RowNumber, Volts, Wfm, Sclk, Image, Segment

1 , 17 , BF27, 100 , 1in24, 24

Data

                                              Comments  Image
0    Row 1 Ch475 Vi 17.0V BF27 Sclk 100ns 1in24 24segs      0
1    Row 1 Ch475 Vi 17.0V BF27 Sclk 100ns 1in24 24segs      0
2    Row 1 Ch475 Vi 17.0V BF27 Sclk 100ns 1in24 24segs      0
3    Row 1 Ch475 Vi 17.0V BF27 Sclk 100ns 1in24 24segs      0
4    Row 1 Ch475 Vi 17.0V BF27 Sclk 100ns 1in24 24segs      0
..                                                 ...    ...
706  Row 2 Ch475 Vi 17.5V BF27 Sclk 100ns 1in24 24segs      0
707  Row 2 Ch475 Vi 17.5V BF27 Sclk 100ns 1in24 24segs      0
708  Row 2 Ch475 Vi 17.5V BF27 Sclk 100ns 1in24 24segs      0
709  Row 2 Ch475 Vi 17.5V BF27 Sclk 100ns 1in24 24segs      0
710  Row 2 Ch475 Vi 17.5V BF27 Sclk 100ns 1in24 24segs      0

Code

import pandas as pd
import numpy as np

path = "/Users/.../Desktop/tk_gui_grid/"
file = "orig_data.txt"
filepath = path+file

df = pd.read_csv(filepath, sep='	', lineterminator='
')

com = df.loc[:,['Comments']]
dfLen = len(com)

image = [0]*dfLen
com['Image'] = image

print(com)

S3DEV · Accepted Answer

Here is a quick solution using regex with named capture groups.

Benefit of regex over `split`:

Some have commented that regex is not required, and this is a true statement. However, from the standpoint of data validation, using regex helps to prevent 'stray' data from creeping in. Using a 'blind' split() function splits the data on (a character); but what if the source data has changed? The split function is blind to this. Whereas, using regex will help to highlight an issue as the pattern simply won't match. Yes, you may get an error message - but this is a good thing as you'll be alerted to a data format change, providing the opportunity to address the issue, or update the regex pattern.

Source data:

Additional rows simulated for demonstration.

0    Row 1 Ch475 Vi 17.0V BF27 Sclk 100ns 1in24 24segs
1    Row 2 Ch475 Vi 17.1V BF27 Sclk 101ns 1in24 25segs
2    Row 3 Ch475 Vi 17.2V BF27 Sclk 102ns 1in24 26segs
3    Row 4 Ch475 Vi 17.3V BF27 Sclk 103ns 1in24 27segs
4    Row 5 Ch475 Vi 17.4V BF27 Sclk 104ns 1in24 28segs

Code:

import pandas as pd
import re

path = './orig_data.txt'
cols = ['rownumber', 'volts', 'wfm', 'sclk', 'image', 'segment']
exp = re.compile(r'^\d+\s+Row\s'
                 r'(?P<rownumber>\d+).*\s'
                 r'(?P<volts>\d+\.\d+)V\s'
                 r'(?P<wfm>\w+)\sSclk\s'
                 r'(?P<sclk>\d+)ns\s'
                 r'(?P<image>\w+)\s'
                 r'(?P<segment>\d+)segs.*$')

df = pd.read_csv(path, sep='|', header=None, names=['comment'])
df[cols] = df['comment'].str.extract(exp, expand=True)

Output:

                                             comment rownumber volts   wfm  \
0  0    Row 1 Ch475 Vi 17.0V BF27 Sclk 100ns 1in2...         1  17.0  BF27   
1  1    Row 2 Ch475 Vi 17.1V BF27 Sclk 101ns 1in2...         2  17.1  BF27   
2  2    Row 3 Ch475 Vi 17.2V BF27 Sclk 102ns 1in2...         3  17.2  BF27   
3  3    Row 4 Ch475 Vi 17.3V BF27 Sclk 103ns 1in2...         4  17.3  BF27   
4  4    Row 5 Ch475 Vi 17.4V BF27 Sclk 104ns 1in2...         5  17.4  BF27   

  sclk  image segment  
0  100  1in24      24  
1  101  1in24      25  
2  102  1in24      26  
3  103  1in24      27  
4  104  1in24      28

How to extract part of a string in pandas dataframe cell and create a new column with that string inside it

Tags:

python

regex

split

pandas

Windy71

1 Answers

Benefit of regex over `split`:

Source data:

Code:

Output:

S3DEV

Recent Activity

Donate For Us

How to extract part of a string in pandas dataframe cell and create a new column with that string inside it

Tags:

python

regex

split

pandas

Windy71

1 Answers

Benefit of regex over split:

Source data:

Code:

Output:

S3DEV

Related questions

Recent Activity

Donate For Us

Benefit of regex over `split`: