Split a cell data into multiple rows in using python

Tags:

I want to split the data contained in a cell into multiple rows in using python. Such an example is given below:

This is my data:

fuel          cert_region   veh_class   air_pollution      city_mpg     hwy_mpg    cmb_mpg  smartway
ethanol/gas    FC              SUV          6/8              9/14        15/20      1/16      yes
ethanol/gas    FC              SUV          6/3              1/14        14/19      10/16     no

I want to convert it into this form:

fuel          cert_region   veh_class   air_pollution     city_mpg     hwy_mpg    cmb_mpg   smartway
ethanol         FC             SUV          6               9           15          1          yes
 gas            FC             SUV          8               14          20          16         yes
ethanol         FC             SUV          6               1           14          10         no  
 gas            FC             SUV          3               14          19          16         no

The following code is returning an error:

import numpy as np
from itertools import chain

# return list from series of comma-separated strings
def chainer(s):
return list(chain.from_iterable(s.str.split('/')))

# calculate lengths of splits
lens = df_08['fuel'].str.split('/').map(len)

# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({
                'cert_region': np.repeat(df_08['cert_region'], lens),
                'veh_class': np.repeat(df_08['veh_class'], lens),
                'smartway': np.repeat(df_08['smartway'], lens),
                'fuel': chainer(df_08['fuel']),
                'air_pollution': chainer(df_08['air_pollution']),
                'city_mpg': chainer(df_08['city_mpg']),
               'hwy_mpg': chainer(df_08['hwy_mpg']),
               'cmb_mpg': chainer(df_08['cmb_mpg'])})

It gives me this error:

 TypeError                                 Traceback (most recent call last)
 <ipython-input-31-916fed75eee2> in <module>()
 20                     'fuel': chainer(df_08['fuel']),
 21                     'air_pollution_score': chainer(df_08['air_pollution_score']),
 ---> 22                     'city_mpg': chainer(df_08['city_mpg']),
 23                    'hwy_mpg': chainer(df_08['hwy_mpg']),
 24                    'cmb_mpg': chainer(df_08['cmb_mpg']),

  <ipython-input-31-916fed75eee2> in chainer(s)
  4 # return list from series of comma-separated strings
  5 def chainer(s):
  ----> 6     return list(chain.from_iterable(s.str.split('/')))
  7 
  8 # calculate lengths of splits

  TypeError: 'float' object is not iterable

But city_mpg has the Object data type:

   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 2404 entries, 0 to 2403
   Data columns (total 14 columns):
  fuel                    2404 non-null object
  cert_region             2404 non-null object
  veh_class               2404 non-null object
  air_pollution           2404 non-null object
  city_mpg                2205 non-null object
  hwy_mpg                 2205 non-null object
  cmb_mpg                 2205 non-null object
  smartway                2404 non-null object

942

asked Apr 10 '20 22:04

Umair Mukhtar

2 Answers

my suggestion is to step out of pandas, do ur computation and put the result back into a dataframe. in my opinion, it is much easier to manipulate, and I'd like to believe faster :

from itertools import chain

Step 1: convert to dict :

M = df.to_dict('records')

Step 2: do a list comprehension and split the values:

res = [[(key,*value.split('/'))
       for key,value in d.items()]
       for d in M]

Step 3: find the length of the longest row. We need this to ensure all rows are the same length:

 longest = max(len(line) for line in chain(*res))
 print(longest)
 #3

Step 4: the longest entry is 3; we need to ensure that the lines less than 3 are adjusted :

explode = [[(entry[0], entry[-1], entry[-1])
            if len(entry) < longest else entry for entry in box]
            for box in res]

print(explode)

[[('fuel', 'ethanol', 'gas'),
  ('cert_region', 'FC', 'FC'),
  ('veh_class', 'SUV', 'SUV'),
  ('air_pollution', '6', '8'),
  ('city_mpg', '9', '14'),
  ('hwy_mpg', '15', '20'),
  ('cmb_mpg', '1', '16'),
  ('smartway', 'yes', 'yes')],
 [('fuel', 'ethanol', 'gas'),
  ('cert_region', 'FC', 'FC'),
  ('veh_class', 'SUV', 'SUV'),
  ('air_pollution', '6', '3'),
  ('city_mpg', '1', '14'),
  ('hwy_mpg', '14', '19'),
  ('cmb_mpg', '10', '16'),
  ('smartway', 'no', 'no')]]

Step 4: Now we can pair the keys, with respective values to get a dictionary:

result = {start[0] :(*start[1:],*end[1:])
          for start,end in zip(*explode)}

print(result)

{'fuel': ('ethanol', 'gas', 'ethanol', 'gas'),
 'cert_region': ('FC', 'FC', 'FC', 'FC'),
 'veh_class': ('SUV', 'SUV', 'SUV', 'SUV'),
 'air_pollution': ('6', '8', '6', '3'),
 'city_mpg': ('9', '14', '1', '14'),
 'hwy_mpg': ('15', '20', '14', '19'),
 'cmb_mpg': ('1', '16', '10', '16'),
 'smartway': ('yes', 'yes', 'no', 'no')}

Read result into dataframe:

pd.DataFrame(result)

    fuel    cert_region veh_class   air_pollution   city_mpg    hwy_mpg cmb_mpg smartway
0   ethanol     FC       SUV           6       9            15             1     yes
1   gas         FC       SUV           8       14           20             16    yes
2   ethanol     FC       SUV           6       1            14             10    no
3   gas         FC       SUV           3       14           19             16    no

109

answered Sep 19 '22 23:09

sammywemmy

I think you're better off constructing a new dataframe

result = pd.DataFrame(columns=[your_columns])
for index, series in df_08.iterrows():
    temp1 = {}
    temp2 = {}
    for key, value in dict(series).items():
        if '/' in value:
            val1, val2 = value.split('/')
            temp1[key] = [val1]
            temp2[key] = [val2]
        else:
            temp1[key] = temp2[key] = [value]

    result = pd.concat([result, pd.DataFrame(data=temp1), 
                        pd.DataFrame(data=temp2)], axis=0, ignore_index=True)

answered Sep 20 '22 23:09

Michael Hsi

Related questions
                            
                                Add scoped service when creating a new scope
                            
                                Is it possible to get full url (including origin) from route in VueJS?
                            
                                org.apache.druid.java.util.common.ISE: No default server found
                            
                                error: cannot bind non-const lvalue reference of type ‘bool&’ to an rvalue of type ‘bool’
                            
                                Beautifulsoup Python unable to scrape data from a website
                            
                                How to plot plotly gauge charts next to each other?
                            
                                getting each line text in navbar and add as id
                            
                                Creating a TimeseriesGenerator with multiple inputs
                            
                                what is the difference between if-else statement and torch.where in pytorch?
                            
                                Apache HttpClient failing with Java 11 on macOS
                            
                                Running async functions on Google Apps Script
                            
                                Swift Decodable parse part of JSON

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With