I'm trying to reproduce my Stata code in Python, and I was pointed in the direction of Pandas. I am, however, having a hard time wrapping my head around how to process the data. Let's say I want to iterate over all values in the column head 'ID.' If that ID matches a specific number, then I want to change two corresponding values FirstName and LastName. In Stata it looks like this: <pre class="prettyprint"><code>replace FirstName = "Matt" if ID==103 replace LastName = "Jones" if ID==103 </code></pre> So this replaces all values in FirstName that correspond with values of ID == 103 to Matt. In Pandas, I'm trying something like this <pre class="prettyprint"><code>df = read_csv("test.csv") for i in df['ID']: if i ==103: ... </code></pre> Not sure where to go from here. Any ideas?

One option is to use Python's slicing and indexing features to logically evaluate the places where your condition holds and overwrite the data there. Assuming you can load your data directly into <code>pandas</code> with <code>pandas.read_csv</code> then the following code might be helpful for you. <pre class="prettyprint"><code>import pandas df = pandas.read_csv("test.csv") df.loc[df.ID == 103, 'FirstName'] = "Matt" df.loc[df.ID == 103, 'LastName'] = "Jones" </code></pre> As mentioned in the comments, you can also do the assignment to both columns in one shot: <pre class="prettyprint"><code>df.loc[df.ID == 103, ['FirstName', 'LastName']] = 'Matt', 'Jones' </code></pre> Note that you'll need <code>pandas</code> version 0.11 or newer to make use of <code>loc</code> for overwrite assignment operations. Indeed, for older versions like 0.8 (despite what critics of chained assignment may say), chained assignment is the correct way to do it, hence why it's useful to know about even if it should be avoided in more modern versions of pandas. <hr> Another way to do it is to use what is called chained assignment. The behavior of this is less stable and so it is not considered the best solution (it is explicitly discouraged in the docs), but it is useful to know about: <pre class="prettyprint"><code>import pandas df = pandas.read_csv("test.csv") df['FirstName'][df.ID == 103] = "Matt" df['LastName'][df.ID == 103] = "Jones" </code></pre>

You can use <code>map</code>, it can map vales from a dictonairy or even a custom function. Suppose this is your df: <pre class="prettyprint"><code> ID First_Name Last_Name 0 103 a b 1 104 c d </code></pre> Create the dicts: <pre class="prettyprint"><code>fnames = {103: "Matt", 104: "Mr"} lnames = {103: "Jones", 104: "X"} </code></pre> And map: <pre class="prettyprint"><code>df['First_Name'] = df['ID'].map(fnames) df['Last_Name'] = df['ID'].map(lnames) </code></pre> The result will be: <pre class="prettyprint"><code> ID First_Name Last_Name 0 103 Matt Jones 1 104 Mr X </code></pre> Or use a custom function: <pre class="prettyprint"><code>names = {103: ("Matt", "Jones"), 104: ("Mr", "X")} df['First_Name'] = df['ID'].map(lambda x: names[x][0]) </code></pre>

The original question addresses a specific narrow use case. For those who need more generic answers here are some examples: <h3>Creating a new column using data from other columns</h3> Given the dataframe below: <pre class="prettyprint lang-python prettyprint-override"><code>import pandas as pd import numpy as np df = pd.DataFrame([['dog', 'hound', 5], ['cat', 'ragdoll', 1]], columns=['animal', 'type', 'age']) In[1]: Out[1]: animal type age ---------------------- 0 dog hound 5 1 cat ragdoll 1 </code></pre> Below we are adding a new <code>description</code> column as a concatenation of other columns by using the <code>+</code> operation which is overridden for series. Fancy string formatting, f-strings etc won't work here since the <code>+</code> applies to scalars and not 'primitive' values: <pre class="prettyprint lang-python prettyprint-override"><code>df['description'] = 'A ' + df.age.astype(str) + ' years old ' \ + df.type + ' ' + df.animal In [2]: df Out[2]: animal type age description ------------------------------------------------- 0 dog hound 5 A 5 years old hound dog 1 cat ragdoll 1 A 1 years old ragdoll cat </code></pre> We get <code>1 years</code> for the cat (instead of <code>1 year</code>) which we will be fixing below using conditionals. <h3>Modifying an existing column with conditionals</h3> Here we are replacing the original <code>animal</code> column with values from other columns, and using <code>np.where</code> to set a conditional substring based on the value of <code>age</code>: <pre class="prettyprint lang-python prettyprint-override"><code># append 's' to 'age' if it's greater than 1 df.animal = df.animal + ", " + df.type + ", " + \ df.age.astype(str) + " year" + np.where(df.age > 1, 's', '') In [3]: df Out[3]: animal type age ------------------------------------- 0 dog, hound, 5 years hound 5 1 cat, ragdoll, 1 year ragdoll 1 </code></pre> <h3>Modifying multiple columns with conditionals</h3> A more flexible approach is to call <code>.apply()</code> on an entire dataframe rather than on a single column: <pre class="prettyprint lang-python prettyprint-override"><code>def transform_row(r): r.animal = 'wild ' + r.type r.type = r.animal + ' creature' r.age = "{} year{}".format(r.age, r.age > 1 and 's' or '') return r df.apply(transform_row, axis=1) In[4]: Out[4]: animal type age ---------------------------------------- 0 wild hound dog creature 5 years 1 wild ragdoll cat creature 1 year </code></pre> In the code above the <code>transform_row(r)</code> function takes a <code>Series</code> object representing a given row (indicated by <code>axis=1</code>, the default value of <code>axis=0</code> will provide a <code>Series</code> object for each column). This simplifies processing since you can access the actual 'primitive' values in the row using the column names and have visibility of other cells in the given row/column.

Change one value based on another value in pandas

Tags:

python

pandas

I'm trying to reproduce my Stata code in Python, and I was pointed in the direction of Pandas. I am, however, having a hard time wrapping my head around how to process the data.

Let's say I want to iterate over all values in the column head 'ID.' If that ID matches a specific number, then I want to change two corresponding values FirstName and LastName.

In Stata it looks like this:

replace FirstName = "Matt" if ID==103
replace LastName =  "Jones" if ID==103

So this replaces all values in FirstName that correspond with values of ID == 103 to Matt.

In Pandas, I'm trying something like this

df = read_csv("test.csv")
for i in df['ID']:
    if i ==103:
          ...

Not sure where to go from here. Any ideas?

495

asked Oct 07 '22 06:10

Parseltongue

3 Answers

One option is to use Python's slicing and indexing features to logically evaluate the places where your condition holds and overwrite the data there.

Assuming you can load your data directly into pandas with pandas.read_csv then the following code might be helpful for you.

import pandas
df = pandas.read_csv("test.csv")
df.loc[df.ID == 103, 'FirstName'] = "Matt"
df.loc[df.ID == 103, 'LastName'] = "Jones"

As mentioned in the comments, you can also do the assignment to both columns in one shot:

df.loc[df.ID == 103, ['FirstName', 'LastName']] = 'Matt', 'Jones'

Note that you'll need pandas version 0.11 or newer to make use of loc for overwrite assignment operations. Indeed, for older versions like 0.8 (despite what critics of chained assignment may say), chained assignment is the correct way to do it, hence why it's useful to know about even if it should be avoided in more modern versions of pandas.

Another way to do it is to use what is called chained assignment. The behavior of this is less stable and so it is not considered the best solution (it is explicitly discouraged in the docs), but it is useful to know about:

import pandas
df = pandas.read_csv("test.csv")
df['FirstName'][df.ID == 103] = "Matt"
df['LastName'][df.ID == 103] = "Jones"

292

answered Oct 24 '22 09:10

ely

You can use map, it can map vales from a dictonairy or even a custom function.

Suppose this is your df:

    ID First_Name Last_Name
0  103          a         b
1  104          c         d

Create the dicts:

fnames = {103: "Matt", 104: "Mr"}
lnames = {103: "Jones", 104: "X"}

And map:

df['First_Name'] = df['ID'].map(fnames)
df['Last_Name'] = df['ID'].map(lnames)

The result will be:

    ID First_Name Last_Name
0  103       Matt     Jones
1  104         Mr         X

Or use a custom function:

names = {103: ("Matt", "Jones"), 104: ("Mr", "X")}
df['First_Name'] = df['ID'].map(lambda x: names[x][0])

answered Oct 24 '22 10:10

Rutger Kassies

The original question addresses a specific narrow use case. For those who need more generic answers here are some examples:

Creating a new column using data from other columns

Given the dataframe below:

import pandas as pd
import numpy as np

df = pd.DataFrame([['dog', 'hound', 5],
                   ['cat', 'ragdoll', 1]],
                  columns=['animal', 'type', 'age'])

In[1]:
Out[1]:
  animal     type  age
----------------------
0    dog    hound    5
1    cat  ragdoll    1

Below we are adding a new description column as a concatenation of other columns by using the + operation which is overridden for series. Fancy string formatting, f-strings etc won't work here since the + applies to scalars and not 'primitive' values:

df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
                    + df.type + ' ' + df.animal

In [2]: df
Out[2]:
  animal     type  age                description
-------------------------------------------------
0    dog    hound    5    A 5 years old hound dog
1    cat  ragdoll    1  A 1 years old ragdoll cat

We get 1 years for the cat (instead of 1 year) which we will be fixing below using conditionals.

Modifying an existing column with conditionals

Here we are replacing the original animal column with values from other columns, and using np.where to set a conditional substring based on the value of age:

# append 's' to 'age' if it's greater than 1
df.animal = df.animal + ", " + df.type + ", " + \
    df.age.astype(str) + " year" + np.where(df.age > 1, 's', '')

In [3]: df
Out[3]:
                 animal     type  age
-------------------------------------
0   dog, hound, 5 years    hound    5
1  cat, ragdoll, 1 year  ragdoll    1

Modifying multiple columns with conditionals

A more flexible approach is to call .apply() on an entire dataframe rather than on a single column:

def transform_row(r):
    r.animal = 'wild ' + r.type
    r.type = r.animal + ' creature'
    r.age = "{} year{}".format(r.age, r.age > 1 and 's' or '')
    return r

df.apply(transform_row, axis=1)

In[4]:
Out[4]:
         animal            type      age
----------------------------------------
0    wild hound    dog creature  5 years
1  wild ragdoll    cat creature   1 year

In the code above the transform_row(r) function takes a Series object representing a given row (indicated by axis=1, the default value of axis=0 will provide a Series object for each column). This simplifies processing since you can access the actual 'primitive' values in the row using the column names and have visibility of other cells in the given row/column.

answered Oct 24 '22 10:10

ccpizza

Related questions
                            
                                Local (?) variable referenced before assignment [duplicate]
                            
                                How Python web frameworks, WSGI and CGI fit together
                            
                                pip broke. how to fix DistributionNotFound error?
                            
                                How to replace text in a column of a Pandas dataframe?
                            
                                How to write the Fibonacci Sequence?
                            
                                What do square brackets mean in pip install?
                            
                                Python string class like StringBuilder in C#?
                            
                                How do I keep track of pip-installed packages in an Anaconda (Conda) environment?
                            
                                Should I use scipy.pi, numpy.pi, or math.pi?
                            
                                Python group by
                            
                                Rotating a two-dimensional array in Python
                            
                                Multiline f-string in Python
                            
                                how to clear the screen in python [duplicate]
                            
                                How can I find the first occurrence of a sub-string in a python string?
                            
                                pandas: How do I split text in a column into multiple rows?
                            
                                Days between two dates? [duplicate]
                            
                                Convert string in base64 to image and save on filesystem
                            
                                Python: Convert timedelta to int in a dataframe
                            
                                Remove all values within one list from another list? [duplicate]
                            
                                Return first N key:value pairs from dict

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With