Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

vectorized string manipulation in pandas dataframe

Tags:

python

pandas

I have a large DataFrame, something like

import pandas as pd

sqldate = pd.Series(["2014-0-1", "2015-10-10", "1990-23-2"])
pdf = pd.Series(["2014.pdf", "2015.pdf", "1999.pdf"])

df = pd.DataFrame({"sqldate":sqldate, "pdf": pdf})

I want to create a boolean column that indicates whether the year of sqldate is same as year of the pdf name.

Another situation where a forloop is easy to do this, but I'd like to vectorize it for speed/cleanliness. But I cannot figure out how.

I have tried simpler approaches, even just making a df['newcol'] and try to strip the left four characters from date. like df['newcol'] = df['sqldate'][0:4] but that fails. It just makes the first four rows of newcol = sqldate, and the rest of the rows Nan, because it interprets the [0:4] as an index selector.

Any suggestions for a more elegant, vectorized way to use manipulated string values on a dataframe?

like image 251
user3556757 Avatar asked Jan 08 '23 05:01

user3556757


1 Answers

You can use Series.str to use string functions on the column. Thus df['sqldate'].str[0:4] would extract the first 4 characters (if they exist), and the following checks if the first four characters of both columns (pdf and sqldate) are the same, and it puts the result in 'newcol':

df['newcol'] = df['sqldate'].str[0:4]==df['pdf'].str[0:4]

See more about the string functions:

http://pandas.pydata.org/pandas-docs/stable/text.html

like image 52
agold Avatar answered Jan 22 '23 01:01

agold