I have a large DataFrame, something like
import pandas as pd
sqldate = pd.Series(["2014-0-1", "2015-10-10", "1990-23-2"])
pdf = pd.Series(["2014.pdf", "2015.pdf", "1999.pdf"])
df = pd.DataFrame({"sqldate":sqldate, "pdf": pdf})
I want to create a boolean column that indicates whether the year of sqldate is same as year of the pdf name.
Another situation where a forloop is easy to do this, but I'd like to vectorize it for speed/cleanliness. But I cannot figure out how.
I have tried simpler approaches, even just making a df['newcol'] and try to strip the left four characters from date. like df['newcol'] = df['sqldate'][0:4] but that fails. It just makes the first four rows of newcol = sqldate, and the rest of the rows Nan, because it interprets the [0:4] as an index selector.
Any suggestions for a more elegant, vectorized way to use manipulated string values on a dataframe?
You can use Series.str
to use string functions on the column. Thus df['sqldate'].str[0:4]
would extract the first 4 characters (if they exist), and the following checks if the first four characters of both columns (pdf and sqldate) are the same, and it puts the result in 'newcol':
df['newcol'] = df['sqldate'].str[0:4]==df['pdf'].str[0:4]
See more about the string functions:
http://pandas.pydata.org/pandas-docs/stable/text.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With