I am looking into trying to create a full address, but the data I have comes in the form of:
Line 1 | Line 2 | Postcode
1, First Street, City, X13
1, First Street First Street, City X13
1 1, First Street, City, X13 X13
There are a few other permutations of how this data is created, but I want to be able to merge all this into one string where there is no overlap. So I want to create the string:1, First Street, City, X13
But not 1, First Street, First Street, City, X13
etc.
How can I concat or merge these without duplicating data already there? There are also some cells like on the top line where there is no information past the first cell.
If you have a plain text you can split your text with \n
in order to get the line and split the lines with ,
to get the separate fields :
>>> s = """1, First Street, City, X13
... 1, First Street First Street, City, X13
... 1 1, First Street, City, X13 X13"""
>>>
>>> lines = s.split('\n')
>>>
>>> splitted_lines = [line.split(',') for line in lines]
Note that as a more pythonic way you can use csv
module to read your text by specifying the comma ,
as the delimiter.
import csv
with open('file_name') as f:
splitted_lines = csv.reader(f,delimiter=',')
Then you can use following list comprehension to get the unique fields in each column :
>>> import re
>>> ' '.join([set([set(re.split(r'\s{2,}',i)).pop() for i in column]).pop() for column in zip(*splitted_lines)])
'1 First Street City'
Note that here you can get the columns using zip()
function and then split the items with re.split()
with regex r'\s{2,}'
which split your string with 2 or more white-space, then you can sue set()
to preserve the unique items.
Note : If you care about the order you can use collections.OrderedDict
instead of set
>>> from collections import OrderedDict
>>>
>>> d = OrderedDict()
>>> ' '.join([d.fromkeys([set(re.split('\s{2,}',i)).pop() for i in column]).keys()[0] for column in zip(*splitted_lines)])
'1 First Street City X13'
If you don't mind losing punctuation:
from collections import OrderedDict
od = OrderedDict()
from string import punctuation
with open("test.txt") as f:
next(f)
print("".join(od.fromkeys(word.strip(punctuation) for line in f
for word in line.split())))
1 First Street City X13
If you have repeated words you won't be able to use the approach but based on your input there is no way to know what possible combination are possible bar the second line actually being always intact in which case you would just need pull the second line.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With