Problem: I have a pandas dataframe that I am trying to extract specific rows from. The rows I'm interested in are those that contain a date, and the row immediately following the row with a date. Importantly, I want to move the information from the line that follows the date to a new column in the row that contains the date. By doing this, I will have "one persons" information on the same line. To be clear, I want to find the rows containing dates, and move the information from the row following the date to the row that contained the date in a new column.
From this:
Col0 Col1 Col2 Col3 Col4 Col5
0 NaN NaN NaN NaN NaN
1 *1/23/20 Joe G USA NaN G5 paper
2 NaN get_me NaN NaN NaN
3 +1/5/20 Frank F CAN NaN F4 Paper
4 NaN get_me_2 NaN NaN NaN
To this:
Col0 Col1 Col2 Col3 Col4 Col5 Col6(New column)
0 1/23/20 Joe G USA NaN G5 paper get_me
1 1/5/20 Frank F CAN NaN F4 paper get_me_2
Stated another way: I'm basically just trying to get all the date rows to grab the information on the next line so for each date there is a person and then all of their information is on one line. It is okay if all the information in the second line is in the same column in the row preceding it.
Things to keep in mind: There are often (but not always) a "*" or "+" character preceding the dates (e.g., **1/12/12 or +5/5/20). I tried first to match rows that contain a date. There is only one of these, but one date has a name "attached" (e.g. *1/1/20Dev). I would like to know if the column containing the date (dates always are in same column) has any other "crap" in it. This would be icing on the cake, but that is not the core issue I'm having.
There is usually only one item in the second row, but if there are more, I can deal with those later. I just need all the "persons" information on the same line. I'm reading the original data in through a PDF and I'm trying to clean it up.
What I've tried: I began by trying to match strings that contain a "date-like" string. In reality these will all be in a pandas dataframe row but it does seem like regex would be suitable to get just the rows containing dates (for which I can get the row immediately after and move its contents onto the row containing the date)
import re
search_in = '*1/4/13'
wanted_regex = r'(\d+/\d+/\d+)'
match = re.search(wanted_regex, search_in)
match.group(1)
output : '1/4/13'
A good example:
def regex_filter(myregex, val):
if val:
mo = re.search(myregex,val)
if mo:
return True
else:
return False
else:
return False
df_filtered =
df[df['col'].apply(regex_filter)]
What gives? Above is a good example of what I believe I'm trying to do, but I'm really stumped here and dont really know how I should control where in the code to get the next row and move it up.I see a lot of similar problems but I can't determine whether I should be grouping, filtering, querying...? If you could offer a brief theory on why you choose what you chose to solve this problem it would be really helpful for how to think about this in the future. This is where I'm at now and could really use some suggestions. Thank you.
First off, start with pandas.Series.str.extract
to get the date-like string:
s = df["Col1"].str.extract("(\d+/\d+/\d+)", expand=False)
Then use pandas.to_datetime
to actually filter out valid dates:
s = pd.to_datetime(s, dayfirst=True, errors="coerce")
# errors="coerce" to transform invalid strings
which so far yields:
0 NaT
1 2020-01-23
2 NaT
3 2020-05-01
4 NaT
Name: Col1, dtype: datetime64[ns]
Then use pandas.Series.ffill
with limit==1
to grab the valid date and right next row:
df["Col1"] = s.ffill(limit=1)
df = df.dropna(subset=["Col1"])
print(df)
So we have desired rows and their next row:
Col0 Col1 Col2 Col3 Col4 Col5
1 1 2020-01-23 Joe G USA NaN G5 paper
2 2 2020-01-23 get_me NaN NaN NaN
3 3 2020-05-01 Frank F CAN NaN F4 Paper
4 4 2020-05-01 get_me_2 NaN NaN NaN
Finally, use pandas.DataFrame.groupby
to iterate and unmelt Col2
only:
dfs = []
for k,d in df.groupby("Col1"):
dfs.append(d.assign(tmp=["Col2", "Col6"]).pivot("Col1", "tmp", "Col2").merge(d))
new_df = pd.concat(dfs).sort_index(1).reset_index(drop=True)
print(new_df)
Final output:
Col1 Col2 Col3 Col4 Col5 Col6
0 2020-01-23 Joe G USA NaN G5 paper get_me
1 2020-05-01 Frank F CAN NaN F4 Paper get_me_2
Logic behind the groupby
section:
groupby
: to pivot subset of dataframe for each date
d.assign(...)
: to keep the original colname Col2
and have new column named as desired, Col6
pivot
: to unmelt the Col2
. With assign
and pivot
, the subset looks like:
tmp Col2 Col6
Col1
2020-05-01 Frank F get_me_2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With