Changing values in pandas dataframe does not work

Tags:

I’m having a problem changing values in a dataframe. I also want to consult regarding a problem I need to solve and the proper way to use pandas to solve it. I'll appreciate help on both. I have a file containing information about matching degree of audio files to speakers. The file looks something like that:

wave_path   spk_name    spk_example#    score   mark    comments    isUsed
190  122_65_02.04.51.800.wav     idoD    idoD    88  NaN     NaN     False
191  121_110_20.17.27.400.wav    idoD    idoD    87  NaN     NaN     False
192  121_111_00.34.57.300.wav    idoD    idoD    87  NaN     NaN     False
193  103_31_18.59.12.800.wav     idoD    idoD_0  99  HIT     VP  False
194  131_101_02.08.06.500.wav    idoD    idoD_0  96  HIT     VP  False

What I need to do, is some kind of a sophisticated counting. I need to group the results by speaker, and calculate for each speaker some calculation. I then proceed with the speaker that made the best calculation for me, but before proceeding I need to mark all the files which I used for the calculation as being used, i.e. changing the isUsed value for each row in which they appear (files can appear more than once) to TRUE. Then I make another iteration. Calculate for each speaker, mark the used files and so on until no more speakers left to be calculated.

I thought a lot about how to implement that process using pandas (it is quite easy to implement in regular python but it will take a lot of looping and data structuring that my guess will slow the process down significantly, and also I’m using this process to get to learn pandas abilities more deeply)

I came out with the following solution. As preparation steps, I’ll group by speaker name and set the file name as index by the set_index method. I will then iterate over the groupbyObj and apply the calculation function, which will return the selected speaker and the files to be marked as used.

Then I’ll iterate over the files and mark them as used (this would be fast and simple since I set them as indexes beforehand), and so on until I finish calculating.

First, I’m not sure about this solution, so feel free to tell me your thoughts on it. Now, I’ve tried implementing this, and got into trouble:

First I indexed by file name, no problem here:

In [53]:

    marked_results['isUsed'] = False
    ind_res = marked_results.set_index('wave_path')
    ind_res.head()

Out[53]:
    spk_name    spk_example#    score   mark    comments    isUsed
    wave_path                       
    103_31_18.59.12.800.wav      idoD    idoD    99  HIT     VP  False
    131_101_02.08.06.500.wav     idoD    idoD    99  HIT     VP  False
    144_35_22.46.38.700.wav      idoD    idoD    96  HIT     VP  False
    41_09_17.10.11.700.wav       idoD    idoD    93  HIT     TEST    False
    122_188_03.19.20.400.wav     idoD    idoD    93  NaN     NaN     False

Then I choose a file and checked that I get the entries relevant to that file:

In [54]:

    example_file = ind_res.index[0];
    ind_res.ix[example_file]

Out[54]:
    spk_name    spk_example#    score   mark    comments    isUsed
    wave_path                       
    103_31_18.59.12.800.wav  idoD    idoD    99  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_0  99  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_1  97  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_2  95  HIT     VP  False

Now problems here too. Then I tried to change the isUsed value for that file to True, and that where I got the problem:

In [56]:

    ind_res.ix[example_file]['isUsed'] = True
    ind_res.ix[example_file].isUsed = True
    ind_res.ix[example_file]
Out[56]:
    spk_name    spk_example#    score   mark    comments    isUsed
    wave_path                       
    103_31_18.59.12.800.wav  idoD    idoD    99  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_0  99  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_1  97  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_2  95  HIT     VP  False

So, you see the problem. Nothing has changed. What am I doing wrong? Is the problem described above should be solved using pandas?

And also: 1. How can I approach a specific group by a groupby object? bcz I thought maybe instead of setting the files as indexed, grouping by a file, and the using that groupby obj to apply a changing function to all of its occurrences. But I didn’t find a way to approach a specific group and passing the group name as parameter and calling apply on all the groups and then acting only on one of them seemed not "right" to me.

I hope it is not to long... :)

586

asked Aug 01 '13 13:08

idoda

1 Answers

Indexing Panda objects can return two fundamentally different objects: a view or a copy.

If mask is a basic slice, then df.ix[mask] returns a view of df. Views share the same underlying data as the original object (df). So modifying the view, also modifies the original object.

If mask is something more complicated, such as an arbitrary sequence of indices, then df.ix[mask] returns a copy of some rows in df. Modifying the copy has no affect on the original.

In your case, since the rows which share the same wave_path occur at arbitrary locations, ind_res.ix[example_file] returns a copy. So

ind_res.ix[example_file]['isUsed'] = True

has no effect on ind_res.

Instead, you could use

ind_res.ix[example_file, 'isUsed'] = True

to modify ind_res. However, see below for a groupby suggestion which I think might be closer to what you really want.

Jeff has already provided a link to the Pandas docs which state that

The rules about when a view on the data is returned are entirely dependent on NumPy.

Here are the (complicated) rules which describe when a view or copy is returned. Basically, however, the rule is if the index is requesting a regularly spaced slice of the underlying array then a view is returned, otherwise a copy (out of necessity) is returned.

Here is a simple example which uses basic slice. A view is returned by df.ix, so modifying subdf modifies df as well:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(12).reshape(4,3), 
         columns=list('ABC'), index=[0,1,2,3])

subdf = df.ix[0]
print(subdf.values)
# [0 1 2]
subdf.values[0] = 100
print(subdf)
# A    100
# B      1
# C      2
# Name: 0, dtype: int32

print(df)           # df is modified
#      A   B   C
# 0  100   1   2
# 1    3   4   5
# 2    6   7   8
# 3    9  10  11

Here is a simple example which uses "fancy indexing" (arbitrary rows selected). A copy is returned by df.ix. So modifying subdf does not affect df.

df = pd.DataFrame(np.arange(12).reshape(4,3), 
         columns=list('ABC'), index=[0,1,0,3])

subdf = df.ix[0]
print(subdf.values)
# [[0 1 2]
#  [6 7 8]]

subdf.values[0] = 100
print(subdf)
#      A    B    C
# 0  100  100  100
# 0    6    7    8

print(df)          # df is NOT modified
#    A   B   C
# 0  0   1   2
# 1  3   4   5
# 0  6   7   8
# 3  9  10  11

Notice the only difference between the two examples is that in the first, where a view is returned, the index was [0,1,2,3], whereas in the second, where a copy is returned, the index was [0,1,0,3].

Since we are selected rows where the index is 0, in the first example, we can do that with a basic slice. In th second example, the rows where index equals 0 could appear at arbitrary locations, so a copy has to be returned.

Despite having ranted on about the subtlety of Pandas/NumPy slicing, I really don't think that

ind_res.ix[example_file, 'isUsed'] = True

is what you are ultimately looking for. You probably want to do something more like

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(12).reshape(4,3), 
                  columns=list('ABC'))
df['A'] = df['A']%2
print(df)
#    A   B   C
# 0  0   1   2
# 1  1   4   5
# 2  0   7   8
# 3  1  10  11

def calculation(grp):
    grp['C'] = True
    return grp

newdf = df.groupby('A').apply(calculation)
print(newdf)

which yields

   A   B     C
0  0   1  True
1  1   4  True
2  0   7  True
3  1  10  True

176

answered Oct 24 '22 22:10

unutbu

Related questions
                            
                                Pandas seems to ignore first column name when reading tab-delimited data, gives KeyError
                            
                                sum values of columns starting with the same string in pandas dataframe
                            
                                Save pandas dataframe but conserving NA values
                            
                                List index out of range with Panda read_csv
                            
                                Remove special characters in pandas dataframe
                            
                                How to read data in Python dataframe without concatenating?
                            
                                Pandas group by cumsum keep columns
                            
                                Write strings/text and pandas dataframe to excel
                            
                                Pandas - retrieve row and column name for each element during applymap
                            
                                Pandas iloc vs direct slicing?
                            
                                How to get slopes of data in pandas dataframe in Python?
                            
                                Unable to import Pandas Profiling
                            
                                'DataFrame' object has no attribute 'isna'
                            
                                Stratified splitting of pandas dataframe into training, validation and test set
                            
                                How to store pandas DataFrame in SQLite DB
                            
                                How to set the columns in pandas
                            
                                "Zebra Tables" in IPython Notebook?
                            
                                How to identify the first occurence of duplicate rows in Python pandas Dataframe
                            
                                Get dot-product of dataframe with vector, and return dataframe, in Pandas
                            
                                Pandas read_csv dtype leading zeros

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Changing values in pandas dataframe does not work

Tags:

pandas

dataframe

idoda

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us