Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace certain values in a pandas column with the mean column value of similar rows?

The Problem

I currently have a pandas dataframe with property information from this kaggle dataset. The following is an example dataframe from that set:

| neighborhood  | borough | block | year built | ... |
------------------------------------------------------
| Annadale      | 5       | 5425  | 2015       | ... |
| Woodside      | 4       | 2327  | 1966       | ... |
| Alphabet City | 1       | 396   | 1985       | ... |
| Alphabet City | 1       | 405   | 1996       | ... |
| Alphabet City | 1       | 396   | 1986       | ... |
| Alphabet City | 1       | 396   | 1992       | ... |
| Alphabet City | 1       | 396   | 0          | ... |
| Alphabet City | 1       | 396   | 1990       | ... |
| Alphabet City | 1       | 396   | 1984       | ... |
| Alphabet City | 1       | 396   | 0          | ... |

What I want to do is take every row where the value in the "year built" column equals zero, and replace the "year built" value in those rows with the median of the "year built" values in the rows with the same neighborhood, borough, and block. There are cases where there are multiple rows within a {neighborhood, borough, block} set that have a zero in the "year built" column. This is shown in the example dataframe above.

To illustrate the problem I put these two rows in the example dataframe.

| neighborhood  | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1       | 396   | 0          | ... |
| Alphabet City | 1       | 396   | 0          | ... |

To solve the problem I want to use the mean of the "year built" values from all the other rows that had the same neighborhood, borough, and block to fill the "year built" value in rows that had a zero in the "year built" column. For the example rows the neighborhood is Alphabet City, the borough is 1, and the block is 396 so I would use the following matching rows from the example dataframe to calculate the mean:

| neighborhood  | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1       | 396   | 1985       | ... |
| Alphabet City | 1       | 396   | 1986       | ... |
| Alphabet City | 1       | 396   | 1992       | ... |
| Alphabet City | 1       | 396   | 1990       | ... |
| Alphabet City | 1       | 396   | 1984       | ... |

I would take the mean of the "year built" column from those rows (which is 1987.4) and replace the zeros with the mean. The rows that originally had zeros would end up looking like this:

| neighborhood  | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1       | 396   | 1987.4     | ... |
| Alphabet City | 1       | 396   | 1987.4     | ... |

The code I have so far

All I've managed to do so far is chop out rows with zeros in the "year built" column and find the mean year of every {neighborhood, borough, block} set. The original dataframe is stored in raw_data and it looks like the example dataframe at the very top of this post. The code looks like this:

# create a copy of the data
temp_data = raw_data.copy()

# remove all rows with zero in the "year built" column
mean_year_by_location = temp_data[temp_data["YEAR BUILT"] > 0]

# group the rows into {neighborhood, borough, block} sets and take the mean of the "year built" column in those sets
mean_year_by_location = mean_year_by_location.groupby(["NEIGHBORHOOD","BOROUGH","BLOCK"], as_index = False)["YEAR BUILT"].mean()

and the output looks like this:

| neighborhood  | borough | block | year built | 
------------------------------------------------
| ....          | ...     | ...   | ...        |
| Alphabet City | 1       | 390   | 1985.342   | 
| Alphabet City | 1       | 391   | 1986.76    | 
| Alphabet City | 1       | 392   | 1992.8473  | 
| Alphabet City | 1       | 393   | 1990.096   | 
| Alphabet City | 1       | 394   | 1984.45    | 

So how can I take those average "year built" values from the mean_year_by_location dataframe and replace the zeros in the original raw_data dataframe?

I apologize for the long post. I just wanted to be really clear.

like image 379
kpsgf7 Avatar asked Mar 08 '23 04:03

kpsgf7


2 Answers

Use set_index + replace, and then fillna on mean.

v = df.set_index(
    ['neighborhood', 'borough', 'block']
)['year built'].replace(0, np.nan)   

df = v.fillna(v.mean(level=[0, 1, 2])).reset_index()
df

    neighborhood  borough  block  year built
0       Annadale        5   5425      2015.0
1       Woodside        4   2327      1966.0
2  Alphabet City        1    396      1985.0
3  Alphabet City        1    405      1996.0
4  Alphabet City        1    396      1986.0
5  Alphabet City        1    396      1992.0
6  Alphabet City        1    396      1987.4
7  Alphabet City        1    396      1990.0
8  Alphabet City        1    396      1984.0
9  Alphabet City        1    396      1987.4

Details

First, set the index, and replace 0s with NaNs so that the forthcoming mean calculation is not affected by these values -

v = df.set_index(
    ['neighborhood', 'borough', 'block']
)['year built'].replace(0, np.nan)   

v 

neighborhood   borough  block
Annadale       5        5425     2015.0
Woodside       4        2327     1966.0
Alphabet City  1        396      1985.0
                        405      1996.0
                        396      1986.0
                        396      1992.0
                        396         NaN
                        396      1990.0
                        396      1984.0
                        396         NaN
Name: year built, dtype: float64

Next, calculate the mean -

m = v.mean(level=[0, 1, 2])
m

neighborhood   borough  block
Annadale       5        5425     2015.0
Woodside       4        2327     1966.0
Alphabet City  1        396      1987.4
                        405      1996.0
Name: year built, dtype: float64

This serves as a mapping, which we'll pass to fillna. fillna accordingly replaces the NaNs introduced earlier, and replaces them with the corresponding mean values mapped by the index. Once that's done, just reset the index to get our original structure back.

v.fillna(m).reset_index()

    neighborhood  borough  block  year built
0       Annadale        5   5425      2015.0
1       Woodside        4   2327      1966.0
2  Alphabet City        1    396      1985.0
3  Alphabet City        1    405      1996.0
4  Alphabet City        1    396      1986.0
5  Alphabet City        1    396      1992.0
6  Alphabet City        1    396      1987.4
7  Alphabet City        1    396      1990.0
8  Alphabet City        1    396      1984.0
9  Alphabet City        1    396      1987.4
like image 134
cs95 Avatar answered May 10 '23 14:05

cs95


I'll use mask within a groupby.apply. I only do this because I like the way it flows. I don't make any claims for it being particularly speedy. Nevertheless, this answer may provide some perspective on what alternatives may be possible.

gidx = ['neighborhood', 'borough', 'block']

def fill_with_mask(s):
    mean = s.loc[lambda x: x != 0].mean()
    return s.mask(s.eq(0), mean)

df.groupby(gidx)['year built'].apply(fill_with_mask)

0    2015.0
1    1966.0
2    1985.0
3    1996.0
4    1986.0
5    1992.0
6    1987.4
7    1990.0
8    1984.0
9    1987.4
Name: year built, dtype: float64

We can then create a copy of the dataframe with pd.DataFrame.assign

df.assign(**{'year built': df.groupby(gidx)['year built'].apply(fill_with_mask)})

    neighborhood  borough  block  year built
0       Annadale        5   5425      2015.0
1       Woodside        4   2327      1966.0
2  Alphabet City        1    396      1985.0
3  Alphabet City        1    405      1996.0
4  Alphabet City        1    396      1986.0
5  Alphabet City        1    396      1992.0
6  Alphabet City        1    396      1987.4
7  Alphabet City        1    396      1990.0
8  Alphabet City        1    396      1984.0
9  Alphabet City        1    396      1987.4

The same task could've been done inplace with column assignment:

df['year built'] = df.groupby(gidx)['year built'].apply(fill_with_mask)

Or

df.update(df.groupby(gidx)['year built'].apply(fill_with_mask))
like image 35
piRSquared Avatar answered May 10 '23 15:05

piRSquared