I want to split the data contained in a cell into multiple rows in using python. Such an example is given below:
This is my data:
fuel cert_region veh_class air_pollution city_mpg hwy_mpg cmb_mpg smartway
ethanol/gas FC SUV 6/8 9/14 15/20 1/16 yes
ethanol/gas FC SUV 6/3 1/14 14/19 10/16 no
I want to convert it into this form:
fuel cert_region veh_class air_pollution city_mpg hwy_mpg cmb_mpg smartway
ethanol FC SUV 6 9 15 1 yes
gas FC SUV 8 14 20 16 yes
ethanol FC SUV 6 1 14 10 no
gas FC SUV 3 14 19 16 no
The following code is returning an error:
import numpy as np
from itertools import chain
# return list from series of comma-separated strings
def chainer(s):
return list(chain.from_iterable(s.str.split('/')))
# calculate lengths of splits
lens = df_08['fuel'].str.split('/').map(len)
# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({
'cert_region': np.repeat(df_08['cert_region'], lens),
'veh_class': np.repeat(df_08['veh_class'], lens),
'smartway': np.repeat(df_08['smartway'], lens),
'fuel': chainer(df_08['fuel']),
'air_pollution': chainer(df_08['air_pollution']),
'city_mpg': chainer(df_08['city_mpg']),
'hwy_mpg': chainer(df_08['hwy_mpg']),
'cmb_mpg': chainer(df_08['cmb_mpg'])})
It gives me this error:
TypeError Traceback (most recent call last)
<ipython-input-31-916fed75eee2> in <module>()
20 'fuel': chainer(df_08['fuel']),
21 'air_pollution_score': chainer(df_08['air_pollution_score']),
---> 22 'city_mpg': chainer(df_08['city_mpg']),
23 'hwy_mpg': chainer(df_08['hwy_mpg']),
24 'cmb_mpg': chainer(df_08['cmb_mpg']),
<ipython-input-31-916fed75eee2> in chainer(s)
4 # return list from series of comma-separated strings
5 def chainer(s):
----> 6 return list(chain.from_iterable(s.str.split('/')))
7
8 # calculate lengths of splits
TypeError: 'float' object is not iterable
But city_mpg
has the Object
data type:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2404 entries, 0 to 2403
Data columns (total 14 columns):
fuel 2404 non-null object
cert_region 2404 non-null object
veh_class 2404 non-null object
air_pollution 2404 non-null object
city_mpg 2205 non-null object
hwy_mpg 2205 non-null object
cmb_mpg 2205 non-null object
smartway 2404 non-null object
To split cell into multiple rows in a Python Pandas dataframe, we can use the apply method. to call apply with a lambda function that calls str. split to split the x string value. And then we call explode to fill new rows with the split values.
To split text in a column into multiple rows with Python Pandas, we can use the str. split method. to create the df data frame. Then we call str.
Enter command mode (esc), use shift-s to toggle the current cell to either a split cell or full width.
my suggestion is to step out of pandas, do ur computation and put the result back into a dataframe. in my opinion, it is much easier to manipulate, and I'd like to believe faster :
from itertools import chain
Step 1: convert to dict :
M = df.to_dict('records')
Step 2: do a list comprehension and split the values:
res = [[(key,*value.split('/'))
for key,value in d.items()]
for d in M]
Step 3: find the length of the longest row. We need this to ensure all rows are the same length:
longest = max(len(line) for line in chain(*res))
print(longest)
#3
Step 4: the longest entry is 3; we need to ensure that the lines less than 3 are adjusted :
explode = [[(entry[0], entry[-1], entry[-1])
if len(entry) < longest else entry for entry in box]
for box in res]
print(explode)
[[('fuel', 'ethanol', 'gas'),
('cert_region', 'FC', 'FC'),
('veh_class', 'SUV', 'SUV'),
('air_pollution', '6', '8'),
('city_mpg', '9', '14'),
('hwy_mpg', '15', '20'),
('cmb_mpg', '1', '16'),
('smartway', 'yes', 'yes')],
[('fuel', 'ethanol', 'gas'),
('cert_region', 'FC', 'FC'),
('veh_class', 'SUV', 'SUV'),
('air_pollution', '6', '3'),
('city_mpg', '1', '14'),
('hwy_mpg', '14', '19'),
('cmb_mpg', '10', '16'),
('smartway', 'no', 'no')]]
Step 4: Now we can pair the keys, with respective values to get a dictionary:
result = {start[0] :(*start[1:],*end[1:])
for start,end in zip(*explode)}
print(result)
{'fuel': ('ethanol', 'gas', 'ethanol', 'gas'),
'cert_region': ('FC', 'FC', 'FC', 'FC'),
'veh_class': ('SUV', 'SUV', 'SUV', 'SUV'),
'air_pollution': ('6', '8', '6', '3'),
'city_mpg': ('9', '14', '1', '14'),
'hwy_mpg': ('15', '20', '14', '19'),
'cmb_mpg': ('1', '16', '10', '16'),
'smartway': ('yes', 'yes', 'no', 'no')}
Read result into dataframe:
pd.DataFrame(result)
fuel cert_region veh_class air_pollution city_mpg hwy_mpg cmb_mpg smartway
0 ethanol FC SUV 6 9 15 1 yes
1 gas FC SUV 8 14 20 16 yes
2 ethanol FC SUV 6 1 14 10 no
3 gas FC SUV 3 14 19 16 no
I think you're better off constructing a new dataframe
result = pd.DataFrame(columns=[your_columns])
for index, series in df_08.iterrows():
temp1 = {}
temp2 = {}
for key, value in dict(series).items():
if '/' in value:
val1, val2 = value.split('/')
temp1[key] = [val1]
temp2[key] = [val2]
else:
temp1[key] = temp2[key] = [value]
result = pd.concat([result, pd.DataFrame(data=temp1),
pd.DataFrame(data=temp2)], axis=0, ignore_index=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With