Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split column data based on Condition Pandas Dataframe

I have a large data set and looking for something that will split my Street Address into two columns Street Number and Street Name.

I am trying to figure out how can I do this efficiently since I first need to process the street address and then check if the first index of the split has a digit or not.

So far I have a working code that looks like this. I created a two function one for extracting street number data from the street address, while the other one replaces the first occurrence of that street number from the street address.

def extract_street_number(row):
      if any(map(str.isdigit, row.split(" ")[0])):
        return row.split(" ")[0]


def extract_street_name(address, streetnumber):
    if streetnumber:
        return address.replace(streetnumber, "", 1)
    else:
        return address

Then using the apply function to have the two columns.

df[street_number] = df.apply(lambda row: extract_street_number(row[address_col]), axis=1)
df[street_name] = df.apply(lambda row: extract_street_name(row[address_col], row[street_number]), axis=1)

I'm wondering if there is a more efficient way to do this? Based on this current routine I need to build first the Street Number Column before I process the street name column.

I'm thinking of something like building the two series on the first iteration of the address column. The pseudo-code is something like this I just can't figure it out how can I code it in python.

Pseudocode:

  1. Split Address into two columns based on first space that encounters a non-numeric character: street_data = address.split(" ", maxsplit=1)

  2. If street_data[0] has digits then return the columns on this way:

      df[street_number] = street_data[0]
      df[street_name] = street_data[1]
  1. Else if street_data[0] is not digit then return the columns on this way:
      df[street_number] = ""
      df[street_name] = street_data[0] + " " + street_data[1]
      # or just simply the address
      df[street_name] = address

By the way this is the working sample of the data:

# In
df = pd.DataFrame({'Address':['111 Rubin Center', 'Monroe St', '513 Banks St', '5600 77 Center Dr', '1013 1/2 E Main St', '1234C Main St', '37-01 Fair Lawn Ave']})

# Out
   Street_Number    Street_Name
0           111     Rubin Center
1                   Monroe St
2           513     Banks St
3        560 77     Center Dr
4      1013 1/2     E Main St
5         1234C     Main St
6         37-01     Fair Lawn Ave
like image 970
BLitE.exe Avatar asked Nov 06 '22 08:11

BLitE.exe


1 Answers

TL;DR: This can be achieved in three steps-

Step 1-

df['Street Number'] = [street_num[0] if any(i.isdigit() for i in street_num[0]) else 'N/A' for street_num in df.Address.apply(lambda s: s.split(" ",1))]

Step 2-

df['Street Address'] = [street_num[1] if any(i.isdigit() for i in street_num[0]) else 'N/A' for street_num in df.Address.apply(lambda s: s.split(" ",1))]

Step 3-

df['Street Address'].loc[df['Street Address'].str.contains("N/A") == True] = df1['Address'].loc[df1['Street Address'].str.contains("N/A") == True]

Explanation-

Added two more test cases in the dataframe for code flexibility (Row 7,8)- Initial Dataframe

Step 1 - We separate the street numbers from the address here. This is done by slicing the first element from the list after splitting the address string and initialising to Street Number column. If the first element doesn't contain a number, N/A is appended in the Street Number column.

Step 1

Step 2 - As the first element in the sliced string contains the Street Number, the second element has to be the Street Address hence is appended to the Street Address column.

Step 2

Step 3 - Due to step two, the Street Address become 'N/A' for the 'Address` that do not contain a number and that is resolved by this -

Step 3

Hence, we can solve this in three steps after hours of struggle put in.

like image 56
rohetoric Avatar answered Nov 15 '22 06:11

rohetoric