I have a large data set and looking for something that will split my Street Address into two columns Street Number
and Street Name
.
I am trying to figure out how can I do this efficiently since I first need to process the street address and then check if the first index of the split has a digit or not.
So far I have a working code that looks like this. I created a two function one for extracting street number data from the street address, while the other one replaces the first occurrence of that street number from the street address.
def extract_street_number(row):
if any(map(str.isdigit, row.split(" ")[0])):
return row.split(" ")[0]
def extract_street_name(address, streetnumber):
if streetnumber:
return address.replace(streetnumber, "", 1)
else:
return address
Then using the apply function to have the two columns.
df[street_number] = df.apply(lambda row: extract_street_number(row[address_col]), axis=1)
df[street_name] = df.apply(lambda row: extract_street_name(row[address_col], row[street_number]), axis=1)
I'm wondering if there is a more efficient way to do this? Based on this current routine I need to build first the Street Number Column before I process the street name column.
I'm thinking of something like building the two series on the first iteration of the address column. The pseudo-code is something like this I just can't figure it out how can I code it in python.
Pseudocode:
Split Address into two columns based on first space that encounters a non-numeric character:
street_data = address.split(" ", maxsplit=1)
If street_data[0] has digits then return the columns on this way:
df[street_number] = street_data[0]
df[street_name] = street_data[1]
df[street_number] = ""
df[street_name] = street_data[0] + " " + street_data[1]
# or just simply the address
df[street_name] = address
By the way this is the working sample of the data:
# In
df = pd.DataFrame({'Address':['111 Rubin Center', 'Monroe St', '513 Banks St', '5600 77 Center Dr', '1013 1/2 E Main St', '1234C Main St', '37-01 Fair Lawn Ave']})
# Out
Street_Number Street_Name
0 111 Rubin Center
1 Monroe St
2 513 Banks St
3 560 77 Center Dr
4 1013 1/2 E Main St
5 1234C Main St
6 37-01 Fair Lawn Ave
TL;DR: This can be achieved in three steps-
Step 1-
df['Street Number'] = [street_num[0] if any(i.isdigit() for i in street_num[0]) else 'N/A' for street_num in df.Address.apply(lambda s: s.split(" ",1))]
Step 2-
df['Street Address'] = [street_num[1] if any(i.isdigit() for i in street_num[0]) else 'N/A' for street_num in df.Address.apply(lambda s: s.split(" ",1))]
Step 3-
df['Street Address'].loc[df['Street Address'].str.contains("N/A") == True] = df1['Address'].loc[df1['Street Address'].str.contains("N/A") == True]
Explanation-
Added two more test cases in the dataframe for code flexibility (Row 7,8)-
Step 1 - We separate the street numbers from the address here. This is done by slicing the first element from the list after splitting the address string and initialising to Street Number
column.
If the first element doesn't contain a number, N/A
is appended in the Street Number
column.
Step 2 - As the first element in the sliced string contains the Street Number
, the second element has to be the Street Address
hence is appended to the Street Address
column.
Step 3 - Due to step two, the Street Address
become 'N/A' for the 'Address` that do not contain a number and that is resolved by this -
Hence, we can solve this in three steps after hours of struggle put in.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With