I have a pandas dataframe "df". In this dataframe I have multiple columns, one of which I have to substring. Lets say the column name is "col". I can run a "for" loop like below and substring the column:
for i in range(0,len(df)):
df.iloc[i].col = df.iloc[i].col[:9]
But I wanted to know, if there is an option where I don't have to use a "for" loop, and do it directly using an attribute.I have huge amount of data, and if I do this, the data will take a very long time process.
To select a single column, use square brackets [] with the column name of the column of interest.
Use the str
accessor with square brackets:
df['col'] = df['col'].str[:9]
Or str.slice:
df['col'] = df['col'].str.slice(0, 9)
In case the column isn't a string, use astype
to convert it:
df['col'] = df['col'].astype(str).str[:9]
As one doesn't know exactly OP's dataframe, one can create one to be used as test.
df = pd.DataFrame({'col': {0: '2020-12-08', 1: '2020-12-08', 2: '2020-12-08', 3: '2020-12-08', 4: '2020-12-08', 5: '2020-12-08', 6: '2020-12-08', 7: '2020-12-08', 8: '2020-12-08', 9: '2020-12-08'}})
[Out]:
col
0 2020-12-08
1 2020-12-08
2 2020-12-08
3 2020-12-08
4 2020-12-08
5 2020-12-08
6 2020-12-08
7 2020-12-08
8 2020-12-08
9 2020-12-08
Assuming one wants to store the column in the same dataframe df
, and that we want to keep only 4 characters, on a column called col_substring
, there are various options one can do.
Option 1
Using pandas.Series.str
df['col_substring'] = df['col'].str[:4]
[Out]:
col col_substring
0 2020-12-08 2020
1 2020-12-08 2020
2 2020-12-08 2020
3 2020-12-08 2020
4 2020-12-08 2020
5 2020-12-08 2020
6 2020-12-08 2020
7 2020-12-08 2020
8 2020-12-08 2020
9 2020-12-08 2020
Option 2
Using pandas.Series.str.slice
as follows
df['col_substring'] = df['col'].str.slice(0, 4)
[Out]:
col col_substring
0 2020-12-08 2020
1 2020-12-08 2020
2 2020-12-08 2020
3 2020-12-08 2020
4 2020-12-08 2020
5 2020-12-08 2020
6 2020-12-08 2020
7 2020-12-08 2020
8 2020-12-08 2020
9 2020-12-08 2020
or like this
df['col_substring'] = df['col'].str.slice(stop=4)
Option 3
Using a custom lambda function
df['col_substring'] = df['col'].apply(lambda x: x[:4])
[Out]:
col col_substring
0 2020-12-08 2020
1 2020-12-08 2020
2 2020-12-08 2020
3 2020-12-08 2020
4 2020-12-08 2020
5 2020-12-08 2020
6 2020-12-08 2020
7 2020-12-08 2020
8 2020-12-08 2020
9 2020-12-08 2020
Option 4
Using a custom lambda function with a regular expression (with re
)
import re
df['col_substring'] = df['col'].apply(lambda x: re.findall(r'^.{4}', x)[0])
[Out]:
col col_substring
0 2020-12-08 2020
1 2020-12-08 2020
2 2020-12-08 2020
3 2020-12-08 2020
4 2020-12-08 2020
5 2020-12-08 2020
6 2020-12-08 2020
7 2020-12-08 2020
8 2020-12-08 2020
9 2020-12-08 2020
Option 5
Using numpy.vectorize
df['col_substring'] = np.vectorize(lambda x: x[:4])(df['col'])
[Out]:
col col_substring
0 2020-12-08 2020
1 2020-12-08 2020
2 2020-12-08 2020
3 2020-12-08 2020
4 2020-12-08 2020
5 2020-12-08 2020
6 2020-12-08 2020
7 2020-12-08 2020
8 2020-12-08 2020
9 2020-12-08 2020
Note:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With