Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use regex to select a row and a fixed number of rows following a row containing a specific substring in a pandas dataframe

Tags:

pandas

Problem: I have a pandas dataframe that I am trying to extract specific rows from. The rows I'm interested in are those that contain a date, and the row immediately following the row with a date. Importantly, I want to move the information from the line that follows the date to a new column in the row that contains the date. By doing this, I will have "one persons" information on the same line. To be clear, I want to find the rows containing dates, and move the information from the row following the date to the row that contained the date in a new column.

From this:

    Col0  Col1       Col2      Col3   Col4    Col5
    0     NaN        NaN       NaN    NaN     NaN
    1     *1/23/20   Joe G     USA    NaN     G5 paper
    2     NaN        get_me    NaN    NaN     NaN
    3     +1/5/20    Frank F   CAN    NaN     F4 Paper
    4     NaN        get_me_2  NaN    NaN     NaN

To this:

    Col0  Col1     Col2     Col3   Col4     Col5      Col6(New column)
    0     1/23/20  Joe G     USA    NaN     G5 paper      get_me
    1     1/5/20  Frank F   CAN    NaN     F4 paper      get_me_2

Stated another way: I'm basically just trying to get all the date rows to grab the information on the next line so for each date there is a person and then all of their information is on one line. It is okay if all the information in the second line is in the same column in the row preceding it.

Things to keep in mind: There are often (but not always) a "*" or "+" character preceding the dates (e.g., **1/12/12 or +5/5/20). I tried first to match rows that contain a date. There is only one of these, but one date has a name "attached" (e.g. *1/1/20Dev). I would like to know if the column containing the date (dates always are in same column) has any other "crap" in it. This would be icing on the cake, but that is not the core issue I'm having.

There is usually only one item in the second row, but if there are more, I can deal with those later. I just need all the "persons" information on the same line. I'm reading the original data in through a PDF and I'm trying to clean it up.

What I've tried: I began by trying to match strings that contain a "date-like" string. In reality these will all be in a pandas dataframe row but it does seem like regex would be suitable to get just the rows containing dates (for which I can get the row immediately after and move its contents onto the row containing the date)

    import re
    search_in = '*1/4/13'
    wanted_regex = r'(\d+/\d+/\d+)'
    match = re.search(wanted_regex, search_in)
    match.group(1)

output : '1/4/13'

  • summary: good start but need to somehow iterate over each row containing these, and move information from the row following it to the row containing the date.

A good example:

    def regex_filter(myregex, val):
        if val:
            mo = re.search(myregex,val)
            if mo:
                return True
            else:
                return False
        else:
            return False

    df_filtered = 
    df[df['col'].apply(regex_filter)]

What gives? Above is a good example of what I believe I'm trying to do, but I'm really stumped here and dont really know how I should control where in the code to get the next row and move it up.I see a lot of similar problems but I can't determine whether I should be grouping, filtering, querying...? If you could offer a brief theory on why you choose what you chose to solve this problem it would be really helpful for how to think about this in the future. This is where I'm at now and could really use some suggestions. Thank you.

like image 501
Kevin Zehnder Avatar asked Oct 14 '22 23:10

Kevin Zehnder


1 Answers

First off, start with pandas.Series.str.extract to get the date-like string:

s = df["Col1"].str.extract("(\d+/\d+/\d+)", expand=False)

Then use pandas.to_datetime to actually filter out valid dates:

s = pd.to_datetime(s, dayfirst=True, errors="coerce")
# errors="coerce" to transform invalid strings

which so far yields:

0          NaT
1   2020-01-23
2          NaT
3   2020-05-01
4          NaT
Name: Col1, dtype: datetime64[ns]

Then use pandas.Series.ffill with limit==1 to grab the valid date and right next row:

df["Col1"] = s.ffill(limit=1)
df = df.dropna(subset=["Col1"])
print(df)

So we have desired rows and their next row:

   Col0       Col1      Col2 Col3  Col4      Col5
1     1 2020-01-23     Joe G  USA   NaN  G5 paper
2     2 2020-01-23    get_me  NaN   NaN       NaN
3     3 2020-05-01   Frank F  CAN   NaN  F4 Paper
4     4 2020-05-01  get_me_2  NaN   NaN       NaN

Finally, use pandas.DataFrame.groupby to iterate and unmelt Col2 only:

dfs = []
for k,d in df.groupby("Col1"):
    dfs.append(d.assign(tmp=["Col2", "Col6"]).pivot("Col1", "tmp", "Col2").merge(d))
new_df = pd.concat(dfs).sort_index(1).reset_index(drop=True)

print(new_df)

Final output:

        Col1     Col2 Col3  Col4      Col5      Col6
0 2020-01-23    Joe G  USA   NaN  G5 paper    get_me
1 2020-05-01  Frank F  CAN   NaN  F4 Paper  get_me_2

Logic behind the groupby section:

  • groupby: to pivot subset of dataframe for each date

  • d.assign(...): to keep the original colname Col2 and have new column named as desired, Col6

  • pivot: to unmelt the Col2. With assign and pivot, the subset looks like:

      tmp            Col2      Col6
      Col1                         
      2020-05-01  Frank F  get_me_2 
    
like image 85
Chris Avatar answered Jan 04 '23 05:01

Chris