Get first letter of a string from column

Tags:

pandas

I'm fighting with pandas and for now I'm loosing. I have source table similar to this:

import pandas as pd

a=pd.Series([123,22,32,453,45,453,56])
b=pd.Series([234,4353,355,453,345,453,56])
df=pd.concat([a, b], axis=1)
df.columns=['First', 'Second']

I would like to add new column to this data frame with first digit from values in column 'First': a) change number to string from column 'First' b) extracting first character from newly created string c) Results from b save as new column in data frame

I don't know how to apply this to the pandas data frame object. I would be grateful for helping me with that.

685

asked Feb 22 '16 11:02

michalk

2 Answers

Cast the dtype of the col to str and you can perform vectorised slicing calling str:

In [29]:
df['new_col'] = df['First'].astype(str).str[0]
df

Out[29]:
   First  Second new_col
0    123     234       1
1     22    4353       2
2     32     355       3
3    453     453       4
4     45     345       4
5    453     453       4
6     56      56       5

if you need to you can cast the dtype back again calling astype(int) on the column

128

answered Oct 19 '22 02:10

EdChum

`.str.get`

This is the simplest to specify string methods

# Setup
df = pd.DataFrame({'A': ['xyz', 'abc', 'foobar'], 'B': [123, 456, 789]})
df

        A    B
0     xyz  123
1     abc  456
2  foobar  789

df.dtypes

A    object
B     int64
dtype: object

For string (read:object) type columns, use

df['C'] = df['A'].str[0]
# Similar to,
df['C'] = df['A'].str.get(0)

.str handles NaNs by returning NaN as the output.

For non-numeric columns, an .astype conversion is required beforehand, as shown in @Ed Chum's answer.

# Note that this won't work well if the data has NaNs. 
# It'll return lowercase "n"
df['D'] = df['B'].astype(str).str[0]

df
        A    B  C  D
0     xyz  123  x  1
1     abc  456  a  4
2  foobar  789  f  7

List Comprehension and Indexing

There is enough evidence to suggest a simple list comprehension will work well here and probably be faster.

# For string columns
df['C'] = [x[0] for x in df['A']]

# For numeric columns
df['D'] = [str(x)[0] for x in df['B']]

df
        A    B  C  D
0     xyz  123  x  1
1     abc  456  a  4
2  foobar  789  f  7

If your data has NaNs, then you will need to handle this appropriately with an if/else in the list comprehension,

df2 = pd.DataFrame({'A': ['xyz', np.nan, 'foobar'], 'B': [123, 456, np.nan]})
df2

        A      B
0     xyz  123.0
1     NaN  456.0
2  foobar    NaN

# For string columns
df2['C'] = [x[0] if isinstance(x, str) else np.nan for x in df2['A']]

# For numeric columns
df2['D'] = [str(x)[0] if pd.notna(x) else np.nan for x in df2['B']]

        A      B    C    D
0     xyz  123.0    x    1
1     NaN  456.0  NaN    4
2  foobar    NaN    f  NaN

Let's do some timeit tests on some larger data.

df_ = df.copy()
df = pd.concat([df_] * 5000, ignore_index=True) 

%timeit df.assign(C=df['A'].str[0])
%timeit df.assign(D=df['B'].astype(str).str[0])

%timeit df.assign(C=[x[0] for x in df['A']])
%timeit df.assign(D=[str(x)[0] for x in df['B']])

12 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
27.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

3.77 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.84 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

List comprehensions are 4x faster.

answered Oct 19 '22 02:10

cs95

Related questions
                            
                                Replace all occurrences of a string in a pandas dataframe (Python)
                            
                                Is it possible to append Series to rows of DataFrame without making a list first?
                            
                                Monitoring contents of files/directories? [duplicate]
                            
                                How does polymorphism work in Python?
                            
                                ImportError: No module named Cython.Distutils
                            
                                Ruby equivalent for Python's "try"?
                            
                                Remove default apps from Django-admin
                            
                                Last element in OrderedDict
                            
                                What's the official way of storing settings for Python programs?
                            
                                Financial technical analysis in python [closed]
                            
                                How to plot time series in python
                            
                                Programmatically searching google in Python using custom search
                            
                                Python-like list comprehension in Java
                            
                                Plotting networkx graph with node labels defaulting to node name
                            
                                Python sum, why not strings? [closed]
                            
                                How to sort OrderedDict of OrderedDict?
                            
                                Access a function variable outside the function without using "global"
                            
                                How can I tell which python implementation I'm using?
                            
                                Pandas: Return Hour from Datetime Column Directly
                            
                                Writing an mp4 video using python opencv

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With