Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python struggling with data?

The basis of the programme is to convert postcodes(UK version of ZIP codes) into co-ordinates. So I have a file with a load of postcodes(and other attached data such as house prices) and another file with all of the UK postcodes and their correlating co-ordinates.

I turn both of these into lists and then use a for loop inside a for loop to iterate over and compare the postcodes in either file. If postcodes in file1 == postcodes in file2 then the co-ordinates are taken and appended to the relevant file.

I've got my code up and running as I want it too. All of my tests output exactly what I want which is great.

The only problem is that it will only work with small batches of data (I've been testing with .csv files holding ~100 rows - creating lists of 100 inner lists).

Now I want to apply my programme to my entire data set. I ran it once, and nothing happened. I went away, watched some tv and still nothing happened. IDLE wouldn't let me quit the programme or anything. So I restarted and tried again, this time adding in a counter to see if my code was running. I run the code and the counter starts going. Until it hits 78902, the size of my dataset. Then it stops and does nothing. I can't do anything nor can I close the window.

The annoying thing is is that it doesn't even get past reading the CSV file, so I haven't been able to manipulate my data whatsoever.

Here is the code where it gets stuck (the very first part of the code):

    #empty variable to put the list into    
    lst = []
    # List function enables use for all files
    def create_list():

        #find the file
        file2 = input('enter filepath:')
        #read the file and iterate over it to append into the list
        with open(file2, 'r') as f:
            reader = csv.reader(f, delimiter=',')
            for row in reader:
                lst.append(row)
        return lst

So does anyone know a way for me to make my data more manageable?

EDIT: for those interested here is my full code:

from tkinter.filedialog import asksaveasfile
import csv

new_file = asksaveasfile()

lst = []
# List function enables use for all files
def create_list():
    #empty variable to put the list into
    #find the file
    file2 = input('enter filepath:')
    #read the file and iterate over it to append into the list
    with open(file2, 'r') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            lst.append(row)
    return lst


def remove_space(lst):
    '''(lst)->lst
    Returns the postcode value without any whitespace

    >>> ac45 6nh
    ac456nh
    The above would occur inside a list inside a list
    '''
    filetype = input('Is this a sale or crime?: ')
    num = 0
    #check the filetype to find the position of the postcodes
    if filetype == 'sale':
        num = 3
        #iterate over the postcode to add all characters but the space
    for line in range(len(lst)):        
        pc = ''
        for char in lst[line][num]:
            if char != ' ':
                pc = pc+char
        lst[line][num] = pc

def write_new_file(lst, new_file):
    '''(lst)->.CSV file
    Takes a list and writes it into a .CSV file.
    '''
    writer = csv.writer(new_file, delimiter=',')
    writer.writerows(lst)
    new_file.close()


#conversion function
def find_coord(postcode):

    lst = create_list()
    #create python list for conversion comparison
    print(lst[0])
    #empty variables
    long = 0
    lat = 0
    #iterate over the list of postcodes, when the right postcode is found,
    # return the co-ordinates.
    for row in lst:
        if row[1] == postcode:
            long = row[2]
            lat = row[3]
    return str(long)+' '+str(lat)

def find_all_coord(postcode, file):

    #empty variables
    long = 0
    lat = 0
    #iterate over the list of postcodes, when the right postcode is found,
    # return the co-ordinates.
    for row in file:
        if row[1] == postcode:
            long = row[2]
            lat = row[3]
    return str(long)+' '+str(lat)

def convert_postcodes():
    '''
    take a list of lst = []
    #find the file
    file2 = input('enter filepath:')
    #read the file and iterate over it to append into the list
    with open(file2, 'r') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            lst.append(row)
    '''
    #save the files into lists so that they can be used
    postcodes = []
    with open(input('enter postcode key filepath:'), 'r') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            postcodes.append(row)
    print('enter filepath to be converted:')
    file = []
    with open(input('enter filepath to be converted:'), 'r') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            file.append(row)
    #here is the conversion code
    long = 0
    lat = 0
    matches = 0
    for row in range(len(file)):
        for line in range(len(postcodes)):
            if file[row][3] == postcodes[line][1]:
                long = postcodes[line][2]
                lat = postcodes[line][3]
                file[row].append(str(long)+','+str(lat))
                matches = matches+1
                print(matches)
    final_file = asksaveasfile()
    write_new_file(file, final_file)

I call the functions individually from IDLE so I can test it before making the programme run them itself.

like image 580
NDevox Avatar asked May 04 '26 02:05

NDevox


2 Answers

Your problem is that looking up all codes in all files, that makes a huge number of comparisons.

You could try to save that in a dict, with the postral code being the key.

like image 174
Martin Ueding Avatar answered May 05 '26 16:05

Martin Ueding


Your main bottleneck is in your convert_postcodes function:

for row in range(len(file)):
    for line in range(len(postcodes)):

If there are N items in file and M items in postcodes then this double-loop requires M*N iterations.

Instead, loop over the items in postcodes once and save the data mapping postcodes to longitude/latitude in a dict. Then loop over file once and use this dict to supply the desired data for each item in file. This will complete the M+N iterations:


def convert_postcodes(postcode_path, file_path, output_path):
    postcodes = dict()
    with open(postcode_path, 'rb') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            code, lng, lat = row[1:4]
            postcodes[code] = [lng, lat]
    with open(file_path, 'rb') as fin, open(output_path, 'wb') as fout:
        reader = csv.reader(fin, delimiter=',')
        writer = csv.writer(fout, delimiter=',')
        for row in reader:
            code = row[3]
            row.extend(postcodes[code])
            writer.writerow(row)
like image 24
unutbu Avatar answered May 05 '26 15:05

unutbu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!