Below is my dataframe having a column that are merged together,
PLUGS\nDESIGN\nGEAR
0 700\nDaewoo 8000 Gearless
1 300\nHyundai 4400 Gearless
2 600\nSTX 2600 Gearless
3 200\nB170 \nGeared
4 362 Wenchong 1700 Mk II \nGeared
5 252\nRichMax 1550 Gearless
6 220\nCV 1100 Plus \nGeared
7 232\nOrskov Mk VII Gearless
8 119\nKouan 1000 Gearless
9 100\nHanjin 700 Gearless
I want to split the columns into three different columns namely PLUGS, DESIGN, GEAR. Is there any way to do this?
Below is the code which i tried:
new_df[['PLUGS', 'DESIGN', 'GEAR']] = new_df['PLUGS\nDESIGN\nGEAR'].str.split(' ')
print(new_df)
expected output:
PLUGS DESIGN GEAR
0 700 Daewoo 8000 Gearless
1 300 Hyundai 4400 Gearless
2 600 STX 2600 Gearless
3 200 B170 Geared
4 362 Wenchong 1700 Mk II Geared
5 252 RichMax 1550 Gearless
6 220 CV 1100 Plus Geared
7 232 Orskov Mk VII Gearless
8 119 Kouan 1000 Gearless
9 100 Hanjin 700 Gearless
As Suggested in the comment section, the regex should work pretty well here,
>>> df
PLUGS\nDESIGN\nGEAR
0 700\nDaewoo 8000 Gearless
1 300\nHyundai 4400 Gearless
2 600\nSTX 2600 Gearless
3 200\nB170 \nGeared
4 362 Wenchong 1700 Mk II \nGeared
5 252\nRichMax 1550 Gearless
6 220\nCV 1100 Plus \nGeared
7 232\nOrskov Mk VII Gearless
8 119\nKouan 1000 Gearless
9 100\nHanjin 700 Gearless
Just removing the newline char from the column name to make readability easy for use as well.
>>> df.columns = df.columns.str.replace(r"\\n", " ", regex=True)
Now, Column name does not have any special cars:
>>> df
PLUGS DESIGN GEAR
0 700\nDaewoo 8000 Gearless
1 300\nHyundai 4400 Gearless
2 600\nSTX 2600 Gearless
3 200\nB170 \nGeared
4 362 Wenchong 1700 Mk II \nGeared
5 252\nRichMax 1550 Gearless
6 220\nCV 1100 Plus \nGeared
7 232\nOrskov Mk VII Gearless
8 119\nKouan 1000 Gearless
9 100\nHanjin 700 Gearless
Now, we can use pandas.Series.str.extract. While using regex method, all the Named groups () will become column names in the result.
As, the named group will become columns with predefined names like 0,1,2 thus we can rename them altogether with desired names to get the desired result as follows:
>>> df = df['PLUGS DESIGN GEAR'].str.extract(r"^(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\\n|^Gear][a-z]+)").rename(columns={0: 'PLUGS', 1: 'DESIGN', 2: 'GEAR'})
>>> print(df)
PLUGS DESIGN GEAR
0 700 Daewoo 8000 Gearless
1 300 Hyundai 4400 Gearless
2 600 STX 2600 Gearless
3 200 B170 Geared
4 362 Wenchong 1700 Mk II Geared
5 252 RichMax 1550 Gearless
6 220 CV 1100 Plus Geared
7 232 Orskov Mk VII Gearless
8 119 Kouan 1000 Gearless
9 100 Hanjin 700 Gearless
regex Explanation:
You can check at regex101.com
(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\|^Gear][a-z]+)
1st Capturing Group (\d+)
\d matches a digit (equivalent to [0-9])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
Match a single character present in the list below [\\n\s]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
\\ matches the character \ literally (case sensitive)
n matches the character n literally (case sensitive)
\s matches any whitespace character (equivalent to [\r\n\t\f\v ])
2nd Capturing Group ([^\]+)
Match a single character not present in the list below [^\\]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
\\ matches the character \ literally (case sensitive)
Match a single character present in the list below [\\n\s]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
\\ matches the character \ literally (case sensitive)
n matches the character n literally (case sensitive)
\s matches any whitespace character (equivalent to [\r\n\t\f\v ])
3rd Capturing Group ([|^Gear][a-z]+)
Match a single character present in the list below [\|^Gear]
\| matches the character | literally (case sensitive)
^Gear matches a single character in the list ^Gear (case sensitive)
Match a single character present in the list below [a-z]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
Starting from your Dataframe :
>>> import pandas as pd
>>> df = pd.DataFrame({'PLUGS\nDESIGN\nGEAR': ['700\nDaewoo 8000 Gearless', '300\nHyundai 4400 Gearless', '600\nSTX 2600 Gearless', '200\nB170 \nGeared', '362 Wenchong 1700 Mk II \nGeared', '252\nRichMax 1550 Gearless'], },
... index = [0, 1, 2, 3, 4, 5])
>>> df
PLUGS\nDESIGN\nGEAR
0 700\nDaewoo 8000 Gearless
1 300\nHyundai 4400 Gearless
2 600\nSTX 2600 Gearless
3 200\nB170 \nGeared
4 362 Wenchong 1700 Mk II \nGeared
5 252\nRichMax 1550 Gearless
You can indeed use the split method on several separators, here \n and space:
>>> df = pd.DataFrame(df['PLUGS\nDESIGN\nGEAR'].str.split('\n| '))
PLUGS\nDESIGN\nGEAR
0 [700, Daewoo, 8000, , Gearless]
1 [300, Hyundai, 4400, , Gearless]
2 [600, STX, 2600, , Gearless]
3 [200, B170, , Geared]
4 [362, Wenchong, 1700, Mk, II, , Geared]
5 [252, RichMax, 1550, , Gearless]
Then, you can assign the first and last element to the correct column, and the rest to the DESIGN column :
>>> df['PLUGS'] = df['PLUGS\nDESIGN\nGEAR'].str[0]
>>> df['DESIGN'] = df['PLUGS\nDESIGN\nGEAR'].str[1:-1]
>>> df['GEAR'] = df['PLUGS\nDESIGN\nGEAR'].str[-1]
>>> df
PLUGS\nDESIGN\nGEAR PLUGS DESIGN GEAR
0 [700, Daewoo, 8000, , Gearless] 700 [Daewoo, 8000, ] Gearless
1 [300, Hyundai, 4400, , Gearless] 300 [Hyundai, 4400, ] Gearless
2 [600, STX, 2600, , Gearless] 600 [STX, 2600, ] Gearless
3 [200, B170, , Geared] 200 [B170, ] Geared
4 [362, Wenchong, 1700, Mk, II, , Geared] 362 [Wenchong, 1700, Mk, II, ] Geared
5 [252, RichMax, 1550, , Gearless] 252 [RichMax, 1550, ] Gearless
The last thing to do is to improve the DESIGN column to map it as a string instead of a list using the join method, and drop the PLUGS\nDESIGN\nGEAR column like so :
>>> df['DESIGN'] = df['DESIGN'].apply(lambda x: ' '.join(map(str, x)))
>>> df.drop(['PLUGS\nDESIGN\nGEAR'], axis=1)
PLUGS DESIGN GEAR
0 700 Daewoo 8000 Gearless
1 300 Hyundai 4400 Gearless
2 600 STX 2600 Gearless
3 200 B170 Geared
4 362 Wenchong 1700 Mk II Geared
5 252 RichMax 1550 Gearless
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With