How to split a dataframe column into more columns, conditional to another column value?

Question

I am stuck because I can not split a dataframe column into more columns, conditional to another column value. I have a pandas dataframe which I generated straight from a '.csv' file with more than 100K rows.

Excerpt1:

enter image description here

I want to split column dca by ',' (comma) into more columns. The number of splits will be constrained by the values in n_mppts.

Edited on 2023-04-12:

I could succesfully perform the split column operation in the dataframe generated from this .csv file with the following code (thanks to @Abdulmajeed's solution):

def split_dca(row):
    values = row['dca'].split(',') if row['dca'] else []
    values += [float('NaN')] * (row['n_mppts'] - len(values))
    values = values[:row['n_mppts']]
    return pd.Series(values)

df_dca_dcv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418643 entries, 0 to 418642
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   pipe_id      418643 non-null  int64 
 1   date         418643 non-null  object
 2   inverter_id  418643 non-null  object
 3   n_mppts      418643 non-null  int64 
 4   dca          418538 non-null  object
 5   dcv          418538 non-null  object
dtypes: int64(2), object(4)
memory usage: 19.2+ MB

df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace('{', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace('}', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].astype(str)

Excerpt2:

enter image description here

mppts_dca = df_dca_dcv.apply(split_dca, axis=1)
mppts_dca['dca_mppt_0'] = pd.to_numeric(mppts_dca[0], errors='coerce')
mppts_dca['dca_mppt_1'] = pd.to_numeric(mppts_dca[1], errors='coerce')
mppts_dca['dca_mppt_2'] = pd.to_numeric(mppts_dca[2], errors='coerce')
mppts_dca['dca_mppt_3'] = pd.to_numeric(mppts_dca[3], errors='coerce')
mppts_dca['dca_mppt_4'] = pd.to_numeric(mppts_dca[4], errors='coerce')
mppts_dca['dca_mppt_5'] = pd.to_numeric(mppts_dca[5], errors='coerce')
mppts_dca['dca_mppt_6'] = pd.to_numeric(mppts_dca[6], errors='coerce')
mppts_dca['dca_mppt_7'] = pd.to_numeric(mppts_dca[7], errors='coerce')
mppts_dca['dca_mppt_8'] = pd.to_numeric(mppts_dca[8], errors='coerce')
df_dca_dcv = pd.concat([df_dca_dcv, mppts_dca], axis=1)

Excerpt3:

enter image description here

However, I am facing a problem when I generate the dataframe from a pandas sql query specifying inverter_id=a2 and therefore the current solution wont succeed (the issue also persists with other inverter_id values):

df_dca_dcv = pd.read_sql_query("select pipe_id,created_at as date,inverter_id,n_mppts,dca,dcv from inverters where inverter_id = 'a2' order by pipe_id, inverter_id, date;", con=con) # connected to a postgreSQL db
df_dca_dcv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16507 entries, 0 to 16506
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   pipe_id      16507 non-null  object        
 1   date         16507 non-null  datetime64[ns]
 2   inverter_id  16507 non-null  object        
 3   n_mppts      16507 non-null  int64         
 4   dca          16428 non-null  object        
 5   dcv          16428 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 773.9+ KB

Column dca Dtype is still object, but now it has values between "[ ]" instead of "{ }" (unlike in Excerpt 1), and when I perform this:

df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace('[', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace(']', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].astype(str)

I get the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[6], line 2
      1 df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace("[", "")
----> 2 df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace("]", "")

File ~\Anaconda3\lib\site-packages\pandas\core\generic.py:5575, in NDFrame.__getattr__(self, name)
   5568 if (
   5569     name not in self._internal_names_set
   5570     and name not in self._metadata
   5571     and name not in self._accessors
   5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573 ):
   5574     return self[name]
-> 5575 return object.__getattribute__(self, name)

File ~\Anaconda3\lib\site-packages\pandas\core\accessor.py:182, in CachedAccessor.__get__(self, obj, cls)
    179 if obj is None:
    180     # we're accessing the attribute of the class, i.e., Dataset.geo
    181     return self._accessor
--> 182 accessor_obj = self._accessor(obj)
    183 # Replace the property with the accessor object. Inspired by:
    184 # https://www.pydanny.com/cached-property.html
    185 # We need to use object.__setattr__ because we overwrite __setattr__ on
    186 # NDFrame
    187 object.__setattr__(obj, self._name, accessor_obj)

File ~\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py:177, in StringMethods.__init__(self, data)
    174 def __init__(self, data):
    175     from pandas.core.arrays.string_ import StringDtype
--> 177     self._inferred_dtype = self._validate(data)
    178     self._is_categorical = is_categorical_dtype(data.dtype)
    179     self._is_string = isinstance(data.dtype, StringDtype)

File ~\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py:231, in StringMethods._validate(data)
    228 inferred_dtype = lib.infer_dtype(values, skipna=True)
    230 if inferred_dtype not in allowed_types:
--> 231     raise AttributeError("Can only use .str accessor with string values!")
    232 return inferred_dtype

AttributeError: Can only use .str accessor with string values!

I antecipated the ".astype(str)" operation, and then performed the ".str.replace(...)" operations. However, when I look at the dataframe now

Excerpt4:

enter image description here

column dca values are not in the same format as they are in Excerpt2 (e.g. "Decimal('2.2'),Decimal('2.2')..."). When I go ahead and execute

mppts_dca = df_dca_dcv.apply(split_dca, axis=1)
df_dca_dcv = pd.concat([df_dca_dcv, mppts_dca], axis=1)
df_dca_dcv['date'] = df_dca_dcv['date'].astype('datetime64[ns]')
df_dca_dcv['dca_mppt_0'] = pd.to_numeric(df_dca_dcv[0], errors='coerce')
df_dca_dcv['dca_mppt_1'] = pd.to_numeric(df_dca_dcv[1], errors='coerce')

the dca values are not passed to the newly splited columns, which (I suppose) is because "pd.to_numeric(" can't read "Decimal(...)":

Excerpt5:

enter image description here

I've tried all the following methods to convert dca column to string:

METHOD1: df_dca_dcv['dca'] = df_dca_dcv['dca'].map(str) #produced same output format as before
METHOD2: df_dca_dcv['dca'] = df_dca_dcv['dca'].apply(str) #produced same output format as before
METHOD3: df_dca_dcv['dca'] = df_dca_dcv['dca'].astype(str) #generated the following error:

ValueError                                Traceback (most recent call last)
Cell In[6], line 1
----> 1 df_dca_dcv['dca'] = df_dca_dcv['dca'].values.astype(str)
ValueError: setting an array element with a sequence

METHOD4: df_dca_dcv['dca'] = df_dca_dcv['dca'].values.astype(str) #generated same error as METHOD3

METHOD5: df_dca_dcv['dca'] = df_dca_dcv['dca'].applymap(str)  #generated the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[7], line 1
----> 1 df_dca_dcv['dca'] = df_dca_dcv['dca'].applymap(str)
File ~\Anaconda3\lib\site-packages\pandas\core\generic.py:5575, in NDFrame.__getattr__(self, name)
   5568 if (
   5569     name not in self._internal_names_set
   5570     and name not in self._metadata
   5571     and name not in self._accessors
   5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573 ):
   5574     return self[name]
-> 5575 return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'applymap'

METHOD6:
def convert_float_string(row):
    float_list = row['dca']
    if len(float_list) > 0:
        string_list = ["%.2f" % i for i in float_list]
    else:
        string_list = float('NaN')
    return string_list
df_dca_dcv['dca'] = df_dca_dcv.apply(lambda row: convert_float_string(row), axis=1) #generated the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 df_dca_dcv['dca'] = df_dca_dcv.apply(lambda row: convert_float_string(row), axis=1)
File ~\Anaconda3\lib\site-packages\pandas\core\frame.py:8839, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
   8828 from pandas.core.apply import frame_apply
   8830 op = frame_apply(
   8831     self,
   8832     func=func,
   (...)
   8837     kwargs=kwargs,
   8838 )
-> 8839 return op.apply().__finalize__(self, method="apply")
File ~\Anaconda3\lib\site-packages\pandas\core\apply.py:727, in FrameApply.apply(self)
    724 elif self.raw:
    725     return self.apply_raw()
--> 727 return self.apply_standard()
File ~\Anaconda3\lib\site-packages\pandas\core\apply.py:851, in FrameApply.apply_standard(self)
    850 def apply_standard(self):
--> 851     results, res_index = self.apply_series_generator()
    853     # wrap results
    854     return self.wrap_results(results, res_index)
File ~\Anaconda3\lib\site-packages\pandas\core\apply.py:867, in FrameApply.apply_series_generator(self)
    864 with option_context("mode.chained_assignment", None):
    865     for i, v in enumerate(series_gen):
    866         # ignore SettingWithCopy here in case the user mutates
--> 867         results[i] = self.f(v)
    868         if isinstance(results[i], ABCSeries):
    869             # If we have a view on v, we need to make a copy because
    870             #  series_generator will swap out the underlying data
    871             results[i] = results[i].copy(deep=False)
Cell In[8], line 1, in <lambda>(row)
----> 1 df_dca_dcv['dca'] = df_dca_dcv.apply(lambda row: convert_float_string(row), axis=1)
Cell In[6], line 3, in convert_float_string(row)
      1 def convert_float_string(row):
      2     float_list = row['dca']
----> 3     if len(float_list) > 0:
      4         string_list = ["%.2f" % i for i in float_list]
      5     else:
TypeError: object of type 'NoneType' has no len()

...and if I simply skip converting dca to string and use

df_dca_dcv['dca'] = df_dca_dcv['dca'].replace("[", "")
df_dca_dcv['dca'] = df_dca_dcv['dca'].replace("]", "")

the replacement does not take place.

I would appreciate any suggestions on how to fix that issue.

Abdulmajeed · Accepted Answer

I am not sure if I understood your question correctly. But you can use custom function with apply. Example below:

Hope it helps

def split_dca(row):
    values = row['dca'].split(',') if row['dca'] else []
    
    values += [float('NaN')] * (row['n_mppts'] - len(values))
    
    values = values[:row['n_mppts']]
    
    return pd.Series(values)

split_columns = df.apply(split_dca, axis=1)

Corralien · Answer

Update

They can have 0 to 11 elements and the split operation should filter only the 'n' first elements from left to right, where 'n' = row['n_mppts']

Since dca has variable length, you can use this code:

# Part 0: fix special cases
mask = df['dca'].isna()
df.loc[mask, 'dca'] = df.loc[mask, 'dca'].apply(lambda x: [])
lens = df['dca'].str.len().values  # get the length of each array
n_mppts = df['n_mppts'].mask(df['n_mppts'].gt(lens), lens)

# Part 1: pad each array to be stacked
nrows, ncols = len(df), int(lens.max())
dca = np.zeros((nrows, ncols))  # create a 0s target array
mask = lens[:, None] > np.arange(ncols)
dca[mask] = np.concatenate(df['dca']).astype(float)  # copy data

# Part 2: keep values according n_mppts
mask = n_mppts.values[:, None] <= np.arange(ncols)
dca[mask] = np.nan
dca_df = pd.DataFrame(dca).add_prefix('dca_mppt_')
dca_df

Output:

   dca_mppt_0  dca_mppt_1  dca_mppt_2  dca_mppt_3  dca_mppt_4  dca_mppt_5  dca_mppt_6  dca_mppt_7  dca_mppt_8  dca_mppt_9  dca_mppt_10
0         2.3         2.3         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
1         2.6         2.6         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
2         2.9         2.9         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
3         6.0         5.9         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
4         3.9         3.9         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN

What you receive from read_sql_query is a list of Decimal instances.

It seems dca and dcv are a length of 11 items. You can use numpy to get your expected output in vectorized way:

dca = np.vstack(df['dca']).astype(float)
mask = df['n_mppts'].values[:, None] <= np.arange(12)
dca[mask] = np.nan
dca_df = pd.DataFrame(dca).add_prefix('dca_mppt_')

Output:

>>> dca_df
   dca_mppt_0  dca_mppt_1  dca_mppt_2  dca_mppt_3  dca_mppt_4  dca_mppt_5  dca_mppt_6  dca_mppt_7  dca_mppt_8  dca_mppt_9  dca_mppt_10
0         2.3         2.3         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
1         2.6         2.6         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
2         2.9         2.9         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
3         6.0         5.9         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
4         3.9         3.9         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN

How to split a dataframe column into more columns, conditional to another column value?

Tags:

python

string

split

pandas

tuples

NigelBlainey

2 Answers

Abdulmajeed

Corralien

Recent Activity

Donate For Us

How to split a dataframe column into more columns, conditional to another column value?

Tags:

python

string

split

pandas

tuples

NigelBlainey

2 Answers

Abdulmajeed

Corralien

Related questions

Recent Activity

Donate For Us