Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Merging two strings that overlap

Tags:

python

I am looking into trying to create a full address, but the data I have comes in the form of:

Line 1                     | Line 2                   | Postcode
1, First Street, City, X13
1, First Street             First Street, City          X13 
1                           1, First Street, City, X13  X13

There are a few other permutations of how this data is created, but I want to be able to merge all this into one string where there is no overlap. So I want to create the string:
1, First Street, City, X13

But not 1, First Street, First Street, City, X13 etc.

How can I concat or merge these without duplicating data already there? There are also some cells like on the top line where there is no information past the first cell.

like image 684
Abi Avatar asked Dec 10 '15 10:12

Abi


2 Answers

If you have a plain text you can split your text with \n in order to get the line and split the lines with , to get the separate fields :

>>> s = """1, First Street, City, X13
... 1, First Street             First Street, City,          X13 
... 1                           1, First Street, City, X13  X13"""
>>> 
>>> lines = s.split('\n')
>>> 
>>> splitted_lines = [line.split(',') for line in lines]

Note that as a more pythonic way you can use csv module to read your text by specifying the comma , as the delimiter.

import csv
with open('file_name') as f:
    splitted_lines = csv.reader(f,delimiter=',') 

Then you can use following list comprehension to get the unique fields in each column :

>>> import re
>>> ' '.join([set([set(re.split(r'\s{2,}',i)).pop() for i in column]).pop() for column in zip(*splitted_lines)])
'1  First Street  City'

Note that here you can get the columns using zip() function and then split the items with re.split() with regex r'\s{2,}' which split your string with 2 or more white-space, then you can sue set() to preserve the unique items.

Note : If you care about the order you can use collections.OrderedDict instead of set

>>> from collections import OrderedDict
>>> 
>>> d = OrderedDict()
>>> ' '.join([d.fromkeys([set(re.split('\s{2,}',i)).pop() for i in column]).keys()[0] for column in zip(*splitted_lines)])
'1  First Street  City  X13'
like image 197
Mazdak Avatar answered Nov 08 '22 00:11

Mazdak


If you don't mind losing punctuation:

from collections import OrderedDict
od = OrderedDict()


from string import punctuation
with open("test.txt") as f:
    next(f)
    print("".join(od.fromkeys(word.strip(punctuation) for line in f    
          for word in line.split())))

1 First Street City X13

If you have repeated words you won't be able to use the approach but based on your input there is no way to know what possible combination are possible bar the second line actually being always intact in which case you would just need pull the second line.

like image 41
Padraic Cunningham Avatar answered Nov 08 '22 01:11

Padraic Cunningham