Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a cell data into multiple rows in using python

Tags:

I want to split the data contained in a cell into multiple rows in using python. Such an example is given below:

This is my data:

fuel          cert_region   veh_class   air_pollution      city_mpg     hwy_mpg    cmb_mpg  smartway
ethanol/gas    FC              SUV          6/8              9/14        15/20      1/16      yes
ethanol/gas    FC              SUV          6/3              1/14        14/19      10/16     no

I want to convert it into this form:

fuel          cert_region   veh_class   air_pollution     city_mpg     hwy_mpg    cmb_mpg   smartway
ethanol         FC             SUV          6               9           15          1          yes
 gas            FC             SUV          8               14          20          16         yes
ethanol         FC             SUV          6               1           14          10         no  
 gas            FC             SUV          3               14          19          16         no

The following code is returning an error:

import numpy as np
from itertools import chain

# return list from series of comma-separated strings
def chainer(s):
return list(chain.from_iterable(s.str.split('/')))

# calculate lengths of splits
lens = df_08['fuel'].str.split('/').map(len)

# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({
                'cert_region': np.repeat(df_08['cert_region'], lens),
                'veh_class': np.repeat(df_08['veh_class'], lens),
                'smartway': np.repeat(df_08['smartway'], lens),
                'fuel': chainer(df_08['fuel']),
                'air_pollution': chainer(df_08['air_pollution']),
                'city_mpg': chainer(df_08['city_mpg']),
               'hwy_mpg': chainer(df_08['hwy_mpg']),
               'cmb_mpg': chainer(df_08['cmb_mpg'])})

It gives me this error:

 TypeError                                 Traceback (most recent call last)
 <ipython-input-31-916fed75eee2> in <module>()
 20                     'fuel': chainer(df_08['fuel']),
 21                     'air_pollution_score': chainer(df_08['air_pollution_score']),
 ---> 22                     'city_mpg': chainer(df_08['city_mpg']),
 23                    'hwy_mpg': chainer(df_08['hwy_mpg']),
 24                    'cmb_mpg': chainer(df_08['cmb_mpg']),

  <ipython-input-31-916fed75eee2> in chainer(s)
  4 # return list from series of comma-separated strings
  5 def chainer(s):
  ----> 6     return list(chain.from_iterable(s.str.split('/')))
  7 
  8 # calculate lengths of splits

  TypeError: 'float' object is not iterable

But city_mpg has the Object data type:

   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 2404 entries, 0 to 2403
   Data columns (total 14 columns):
  fuel                    2404 non-null object
  cert_region             2404 non-null object
  veh_class               2404 non-null object
  air_pollution           2404 non-null object
  city_mpg                2205 non-null object
  hwy_mpg                 2205 non-null object
  cmb_mpg                 2205 non-null object
  smartway                2404 non-null object
like image 942
Umair Mukhtar Avatar asked Apr 10 '20 22:04

Umair Mukhtar


People also ask

How do I split a cell into multiple rows in pandas?

To split cell into multiple rows in a Python Pandas dataframe, we can use the apply method. to call apply with a lambda function that calls str. split to split the x string value. And then we call explode to fill new rows with the split values.

How do I split one column into multiple rows in Python?

To split text in a column into multiple rows with Python Pandas, we can use the str. split method. to create the df data frame. Then we call str.

How do you split a cell in Python?

Enter command mode (esc), use shift-s to toggle the current cell to either a split cell or full width.


2 Answers

my suggestion is to step out of pandas, do ur computation and put the result back into a dataframe. in my opinion, it is much easier to manipulate, and I'd like to believe faster :

from itertools import chain


Step 1: convert to dict :

M = df.to_dict('records')


Step 2: do a list comprehension and split the values:

res = [[(key,*value.split('/'))
       for key,value in d.items()]
       for d in M]


Step 3: find the length of the longest row. We need this to ensure all rows are the same length:

 longest = max(len(line) for line in chain(*res))
 print(longest)
 #3


Step 4: the longest entry is 3; we need to ensure that the lines less than 3 are adjusted :

explode = [[(entry[0], entry[-1], entry[-1])
            if len(entry) < longest else entry for entry in box]
            for box in res]

print(explode)

[[('fuel', 'ethanol', 'gas'),
  ('cert_region', 'FC', 'FC'),
  ('veh_class', 'SUV', 'SUV'),
  ('air_pollution', '6', '8'),
  ('city_mpg', '9', '14'),
  ('hwy_mpg', '15', '20'),
  ('cmb_mpg', '1', '16'),
  ('smartway', 'yes', 'yes')],
 [('fuel', 'ethanol', 'gas'),
  ('cert_region', 'FC', 'FC'),
  ('veh_class', 'SUV', 'SUV'),
  ('air_pollution', '6', '3'),
  ('city_mpg', '1', '14'),
  ('hwy_mpg', '14', '19'),
  ('cmb_mpg', '10', '16'),
  ('smartway', 'no', 'no')]]


Step 4: Now we can pair the keys, with respective values to get a dictionary:

result = {start[0] :(*start[1:],*end[1:])
          for start,end in zip(*explode)}

print(result)

{'fuel': ('ethanol', 'gas', 'ethanol', 'gas'),
 'cert_region': ('FC', 'FC', 'FC', 'FC'),
 'veh_class': ('SUV', 'SUV', 'SUV', 'SUV'),
 'air_pollution': ('6', '8', '6', '3'),
 'city_mpg': ('9', '14', '1', '14'),
 'hwy_mpg': ('15', '20', '14', '19'),
 'cmb_mpg': ('1', '16', '10', '16'),
 'smartway': ('yes', 'yes', 'no', 'no')}


Read result into dataframe:

pd.DataFrame(result)

    fuel    cert_region veh_class   air_pollution   city_mpg    hwy_mpg cmb_mpg smartway
0   ethanol     FC       SUV           6       9            15             1     yes
1   gas         FC       SUV           8       14           20             16    yes
2   ethanol     FC       SUV           6       1            14             10    no
3   gas         FC       SUV           3       14           19             16    no
like image 109
sammywemmy Avatar answered Sep 19 '22 23:09

sammywemmy


I think you're better off constructing a new dataframe

result = pd.DataFrame(columns=[your_columns])
for index, series in df_08.iterrows():
    temp1 = {}
    temp2 = {}
    for key, value in dict(series).items():
        if '/' in value:
            val1, val2 = value.split('/')
            temp1[key] = [val1]
            temp2[key] = [val2]
        else:
            temp1[key] = temp2[key] = [value]

    result = pd.concat([result, pd.DataFrame(data=temp1), 
                        pd.DataFrame(data=temp2)], axis=0, ignore_index=True) 
like image 45
Michael Hsi Avatar answered Sep 20 '22 23:09

Michael Hsi