I have a pandas dataframe "df". In this dataframe I have multiple columns, one of which I have to substring. Lets say the column name is "col". I can run a "for" loop like below and substring the column: <pre class="prettyprint"><code>for i in range(0,len(df)): df.iloc[i].col = df.iloc[i].col[:9] </code></pre> But I wanted to know, if there is an option where I don't have to use a "for" loop, and do it directly using an attribute.I have huge amount of data, and if I do this, the data will take a very long time process.

Use the <code>str</code> accessor with square brackets: <pre class="prettyprint"><code>df['col'] = df['col'].str[:9] </code></pre> Or str.slice: <pre class="prettyprint"><code>df['col'] = df['col'].str.slice(0, 9) </code></pre>

In case the column isn't a string, use <code>astype</code> to convert it: <pre class="prettyprint"><code>df['col'] = df['col'].astype(str).str[:9] </code></pre>

As one doesn't know exactly OP's dataframe, one can create one to be used as test. <pre class="prettyprint"><code>df = pd.DataFrame({'col': {0: '2020-12-08', 1: '2020-12-08', 2: '2020-12-08', 3: '2020-12-08', 4: '2020-12-08', 5: '2020-12-08', 6: '2020-12-08', 7: '2020-12-08', 8: '2020-12-08', 9: '2020-12-08'}}) [Out]: col 0 2020-12-08 1 2020-12-08 2 2020-12-08 3 2020-12-08 4 2020-12-08 5 2020-12-08 6 2020-12-08 7 2020-12-08 8 2020-12-08 9 2020-12-08 </code></pre> Assuming one wants to store the column in the same dataframe <code>df</code>, and that we want to keep only 4 characters, on a column called <code>col_substring</code>, there are various options one can do. Option 1 Using <code>pandas.Series.str</code> <pre class="prettyprint"><code>df['col_substring'] = df['col'].str[:4] [Out]: col col_substring 0 2020-12-08 2020 1 2020-12-08 2020 2 2020-12-08 2020 3 2020-12-08 2020 4 2020-12-08 2020 5 2020-12-08 2020 6 2020-12-08 2020 7 2020-12-08 2020 8 2020-12-08 2020 9 2020-12-08 2020 </code></pre> <hr> Option 2 Using <code>pandas.Series.str.slice</code> as follows <pre class="prettyprint"><code>df['col_substring'] = df['col'].str.slice(0, 4) [Out]: col col_substring 0 2020-12-08 2020 1 2020-12-08 2020 2 2020-12-08 2020 3 2020-12-08 2020 4 2020-12-08 2020 5 2020-12-08 2020 6 2020-12-08 2020 7 2020-12-08 2020 8 2020-12-08 2020 9 2020-12-08 2020 </code></pre> or like this <pre class="prettyprint"><code>df['col_substring'] = df['col'].str.slice(stop=4) </code></pre> <hr> Option 3 Using a custom lambda function <pre class="prettyprint"><code>df['col_substring'] = df['col'].apply(lambda x: x[:4]) [Out]: col col_substring 0 2020-12-08 2020 1 2020-12-08 2020 2 2020-12-08 2020 3 2020-12-08 2020 4 2020-12-08 2020 5 2020-12-08 2020 6 2020-12-08 2020 7 2020-12-08 2020 8 2020-12-08 2020 9 2020-12-08 2020 </code></pre> <hr> Option 4 Using a custom lambda function with a regular expression (with <code>re</code>) <pre class="prettyprint"><code>import re df['col_substring'] = df['col'].apply(lambda x: re.findall(r'^.{4}', x)[0]) [Out]: col col_substring 0 2020-12-08 2020 1 2020-12-08 2020 2 2020-12-08 2020 3 2020-12-08 2020 4 2020-12-08 2020 5 2020-12-08 2020 6 2020-12-08 2020 7 2020-12-08 2020 8 2020-12-08 2020 9 2020-12-08 2020 </code></pre> <hr> Option 5 Using <code>numpy.vectorize</code> <pre class="prettyprint"><code>df['col_substring'] = np.vectorize(lambda x: x[:4])(df['col']) [Out]: col col_substring 0 2020-12-08 2020 1 2020-12-08 2020 2 2020-12-08 2020 3 2020-12-08 2020 4 2020-12-08 2020 5 2020-12-08 2020 6 2020-12-08 2020 7 2020-12-08 2020 8 2020-12-08 2020 9 2020-12-08 2020 </code></pre> <hr> Note: <ul> <li>The ideal solution would depend on the use case, constraints, and the dataframe.</li> </ul>

substring of an entire column in pandas dataframe

I have a pandas dataframe "df". In this dataframe I have multiple columns, one of which I have to substring. Lets say the column name is "col". I can run a "for" loop like below and substring the column:

for i in range(0,len(df)):
  df.iloc[i].col = df.iloc[i].col[:9]

But I wanted to know, if there is an option where I don't have to use a "for" loop, and do it directly using an attribute.I have huge amount of data, and if I do this, the data will take a very long time process.

How do you select part of a column in Python?

To select a single column, use square brackets [] with the column name of the column of interest.

Use the str accessor with square brackets:

df['col'] = df['col'].str[:9]

Or str.slice:

df['col'] = df['col'].str.slice(0, 9)

In case the column isn't a string, use astype to convert it:

df['col'] = df['col'].astype(str).str[:9]

As one doesn't know exactly OP's dataframe, one can create one to be used as test.

df = pd.DataFrame({'col': {0: '2020-12-08', 1: '2020-12-08', 2: '2020-12-08', 3: '2020-12-08', 4: '2020-12-08', 5: '2020-12-08', 6: '2020-12-08', 7: '2020-12-08', 8: '2020-12-08', 9: '2020-12-08'}})

[Out]:
          col
0  2020-12-08
1  2020-12-08
2  2020-12-08
3  2020-12-08
4  2020-12-08
5  2020-12-08
6  2020-12-08
7  2020-12-08
8  2020-12-08
9  2020-12-08

Assuming one wants to store the column in the same dataframe df, and that we want to keep only 4 characters, on a column called col_substring, there are various options one can do.

Option 1

Using pandas.Series.str

df['col_substring'] = df['col'].str[:4]

[Out]:

          col col_substring
0  2020-12-08          2020
1  2020-12-08          2020
2  2020-12-08          2020
3  2020-12-08          2020
4  2020-12-08          2020
5  2020-12-08          2020
6  2020-12-08          2020
7  2020-12-08          2020
8  2020-12-08          2020
9  2020-12-08          2020

Option 2

Using pandas.Series.str.slice as follows

df['col_substring'] = df['col'].str.slice(0, 4)

[Out]:

          col col_substring
0  2020-12-08          2020
1  2020-12-08          2020
2  2020-12-08          2020
3  2020-12-08          2020
4  2020-12-08          2020
5  2020-12-08          2020
6  2020-12-08          2020
7  2020-12-08          2020
8  2020-12-08          2020
9  2020-12-08          2020

or like this

df['col_substring'] = df['col'].str.slice(stop=4)

Option 3

Using a custom lambda function

df['col_substring'] = df['col'].apply(lambda x: x[:4])

[Out]:

          col col_substring
0  2020-12-08          2020
1  2020-12-08          2020
2  2020-12-08          2020
3  2020-12-08          2020
4  2020-12-08          2020
5  2020-12-08          2020
6  2020-12-08          2020
7  2020-12-08          2020
8  2020-12-08          2020
9  2020-12-08          2020

Option 4

Using a custom lambda function with a regular expression (with re)

import re

df['col_substring'] = df['col'].apply(lambda x: re.findall(r'^.{4}', x)[0])

[Out]:

          col col_substring
0  2020-12-08          2020
1  2020-12-08          2020
2  2020-12-08          2020
3  2020-12-08          2020
4  2020-12-08          2020
5  2020-12-08          2020
6  2020-12-08          2020
7  2020-12-08          2020
8  2020-12-08          2020
9  2020-12-08          2020

Option 5

Using numpy.vectorize

df['col_substring'] = np.vectorize(lambda x: x[:4])(df['col'])

[Out]:

          col col_substring
0  2020-12-08          2020
1  2020-12-08          2020
2  2020-12-08          2020
3  2020-12-08          2020
4  2020-12-08          2020
5  2020-12-08          2020
6  2020-12-08          2020
7  2020-12-08          2020
8  2020-12-08          2020
9  2020-12-08          2020

Note:

The ideal solution would depend on the use case, constraints, and the dataframe.

substring of an entire column in pandas dataframe

Tags:

python

pandas

dataframe

thenakulchawla

People also ask

3 Answers

ayhan

Elton da Mata

Gonçalo Peres

Recent Activity

Donate For Us

substring of an entire column in pandas dataframe

Tags:

python

pandas

dataframe

thenakulchawla

People also ask

3 Answers

ayhan

Elton da Mata

Gonçalo Peres

Related questions

Recent Activity

Donate For Us