I have a dataframe in which one of the columns contains a lengthy string which contains a lot of information which I need to break up into separate columns and add these to the dataframe. Its similar to this How to add a new column with multiple string contain conditions in python pandas other than using np.where? but I can't see how to adapt that.
I can create the empty columns but I don't know if the string can have elements extracted or if it can be separated into columns.
e.g. Line of data
0 Row 1 Ch475 Vi 17.0V BF27 Sclk 100ns 1in24 24segs
Desired output
RowNumber, Volts, Wfm, Sclk, Image, Segment
1 , 17 , BF27, 100 , 1in24, 24
Data
Comments Image
0 Row 1 Ch475 Vi 17.0V BF27 Sclk 100ns 1in24 24segs 0
1 Row 1 Ch475 Vi 17.0V BF27 Sclk 100ns 1in24 24segs 0
2 Row 1 Ch475 Vi 17.0V BF27 Sclk 100ns 1in24 24segs 0
3 Row 1 Ch475 Vi 17.0V BF27 Sclk 100ns 1in24 24segs 0
4 Row 1 Ch475 Vi 17.0V BF27 Sclk 100ns 1in24 24segs 0
.. ... ...
706 Row 2 Ch475 Vi 17.5V BF27 Sclk 100ns 1in24 24segs 0
707 Row 2 Ch475 Vi 17.5V BF27 Sclk 100ns 1in24 24segs 0
708 Row 2 Ch475 Vi 17.5V BF27 Sclk 100ns 1in24 24segs 0
709 Row 2 Ch475 Vi 17.5V BF27 Sclk 100ns 1in24 24segs 0
710 Row 2 Ch475 Vi 17.5V BF27 Sclk 100ns 1in24 24segs 0
Code
import pandas as pd
import numpy as np
path = "/Users/.../Desktop/tk_gui_grid/"
file = "orig_data.txt"
filepath = path+file
df = pd.read_csv(filepath, sep='\t', lineterminator='\r')
com = df.loc[:,['Comments']]
dfLen = len(com)
image = [0]*dfLen
com['Image'] = image
print(com)
Here is a quick solution using regex with named capture groups.
split:Some have commented that regex is not required, and this is a true statement. However, from the standpoint of data validation, using regex helps to prevent 'stray' data from creeping in. Using a 'blind' split() function splits the data on (a character); but what if the source data has changed? The split function is blind to this. Whereas, using regex will help to highlight an issue as the pattern simply won't match. Yes, you may get an error message - but this is a good thing as you'll be alerted to a data format change, providing the opportunity to address the issue, or update the regex pattern.
Additional rows simulated for demonstration.
0 Row 1 Ch475 Vi 17.0V BF27 Sclk 100ns 1in24 24segs
1 Row 2 Ch475 Vi 17.1V BF27 Sclk 101ns 1in24 25segs
2 Row 3 Ch475 Vi 17.2V BF27 Sclk 102ns 1in24 26segs
3 Row 4 Ch475 Vi 17.3V BF27 Sclk 103ns 1in24 27segs
4 Row 5 Ch475 Vi 17.4V BF27 Sclk 104ns 1in24 28segs
import pandas as pd
import re
path = './orig_data.txt'
cols = ['rownumber', 'volts', 'wfm', 'sclk', 'image', 'segment']
exp = re.compile(r'^\d+\s+Row\s'
r'(?P<rownumber>\d+).*\s'
r'(?P<volts>\d+\.\d+)V\s'
r'(?P<wfm>\w+)\sSclk\s'
r'(?P<sclk>\d+)ns\s'
r'(?P<image>\w+)\s'
r'(?P<segment>\d+)segs.*$')
df = pd.read_csv(path, sep='|', header=None, names=['comment'])
df[cols] = df['comment'].str.extract(exp, expand=True)
comment rownumber volts wfm \
0 0 Row 1 Ch475 Vi 17.0V BF27 Sclk 100ns 1in2... 1 17.0 BF27
1 1 Row 2 Ch475 Vi 17.1V BF27 Sclk 101ns 1in2... 2 17.1 BF27
2 2 Row 3 Ch475 Vi 17.2V BF27 Sclk 102ns 1in2... 3 17.2 BF27
3 3 Row 4 Ch475 Vi 17.3V BF27 Sclk 103ns 1in2... 4 17.3 BF27
4 4 Row 5 Ch475 Vi 17.4V BF27 Sclk 104ns 1in2... 5 17.4 BF27
sclk image segment
0 100 1in24 24
1 101 1in24 25
2 102 1in24 26
3 103 1in24 27
4 104 1in24 28
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With