Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python in memory table data structures for analysis (dict, list, combo)

I'm trying to simulate some code that I have working with SQL but using all Python instead.. With some help here CSV to Python Dictionary with all column names?

I now can read my zipped-csv file into a dict Only one line though, the last one. (how do I get a sample of lines or the whole data file?)

I am hoping to have a memory resident table that I can manipulate much like sql when I'm done eg Clean data by matching bad data to to another table with bad data and correct entries.. then sum by type average by time period and the like.. The total data file is about 500,000 rows.. I'm not fussed about getting all in memory but want to solve the general case as best I can,, again so I know what can be done without resorting to SQL

import csv, sys, zipfile
sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip"
zip_file    = zipfile.ZipFile(sys.argv[0])
items_file  = zip_file.open('AllListing1RES.txt', 'rU')
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
    pass 
# Then is my result is
>>> for key in row:
print 'key=%s, value=%s' % (key, row[key])  
key=YEAR_BUILT_DESC, value=EXIST
key=SUBDIVISION, value=KNOLLWOOD
key=DOM, value=2
key=STREET_NAME, value=ORLEANS RD
key=BEDROOMS, value=3
key=SOLD_PRICE, value=
key=PROP_TYPE, value=SFR
key=BATHS_FULL, value=2
key=PENDING_DATE, value=
key=STREET_NUM, value=3828
key=SOLD_DATE, value=
key=LIST_PRICE, value=324900
key=AREA, value=200
key=STATUS_DATE, value=3/3/2011 11:54:56 PM
key=STATUS, value=A
key=BATHS_HALF, value=0
key=YEAR_BUILT, value=1968
key=ZIP, value=35243
key=COUNTY, value=JEFF
key=MLS_ACCT, value=492859
key=CITY, value=MOUNTAIN BROOK
key=OWNER_NAME, value=SPARKS
key=LIST_DATE, value=3/3/2011
key=DATE_MODIFIED, value=3/4/2011 12:04:11 AM 
key=PARCEL_ID, value=28-15-3-009-001.0000
key=ACREAGE, value=0
key=WITHDRAWN_DATE, value=
>>>

I think I'm barking up a few wrong trees here... One is that I only have 1 line of my about 500,000 line data file.. Two is it seems that the dict may not be the right structure since I don't think I can just load all 500,000 lines and do various operations on them. Like..Sum by group and date.. plus it seems that duplicate keys may cause problems ie the non unique descriptors like county and subdivision.

I also don't know how to read a specific small subset of line into memory (like 10 or 100 to test with, before loading all (which I also don't get..) I have read over the Python docs and several reference books but it just is not clicking yet..

It seems that most of the answers I can find all suggest using various SQL solutions for this sort of thing,, but I am anxious to learn the basics of achieving the similar results with Python. As in some cases I think it will be easier and faster as well as expand my tool set. But I'm having a hard time finding relevant examples.

one answer that hints at what I'm getting at is:

Once the reading is done right, DictReader should work for getting rows as dictionaries, a typical row-oriented structure. Oddly enough, this isn't normally the efficient way to handle queries like yours; having only column lists makes searches a lot easier. Row orientation means you have to redo some lookup work for every row. Things like date matching requires data that is certainly not present in a CSV, like how dates are represented and which columns are dates.

An example of getting a column-oriented data structure (however, involving loading the whole file):

import csv
allrows=list(csv.reader(open('test.csv')))
# Extract the first row as keys for a columns dictionary
columns=dict([(x[0],x[1:]) for x in zip(*allrows)])
The intermediate steps of going to list and storing in a variable aren't necessary. 
The key is using zip (or its cousin itertools.izip) to transpose the table.
Then extracting column two from all rows with a certain criterion in column one:

matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2]
print map(columns['two'].__getitem__, matchingrows)
When you do know the type of a column, it may make sense to parse it, using appropriate 
functions like datetime.datetime.strptime.

via Yann Vernier

Surely there is some good reference for this general topic?

like image 216
dartdog Avatar asked Apr 17 '11 21:04

dartdog


People also ask

What is the most efficient data structure in Python?

Python's tuples are a straightforward data structure for grouping arbitrary objects. Tuples are immutable—they can't be modified once they've been created. Performance-wise, tuples take up slightly less memory than lists in CPython, and they're also faster to construct.

What data structure does Python use in place of arrays?

A Python array is a little bit different to arrays in other programming languages in that it uses something called 'list' instead of array.


2 Answers

You can only read one line at a time from the csv reader, but you can store them all in memory quite easily:

rows = []
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
    rows.append(row)

# rows[0]
{'keyA': 13, 'keyB': 'dataB' ... }
# rows[1]
{'keyA': 5, 'keyB': 'dataB' ... }

Then, to do aggregations and calculations:

sum(row['keyA'] for row in rows)

You may want to transform the data before it goes into rows, or use a friendlier data structure. Iterating over 500,000 rows for each calculation could become quite inefficient.

As a commenter mentioned, using an in-memory database could be really beneficial to you. another question asks exactly how to transfer csv data into a sqlite database.

import csv
import sqlite3

conn = sqlite3.connect(":memory:")
c = conn.cursor()
c.execute("create table t (col1 text, col2 float);")

# csv.DictReader uses the first line in the file as column headings by default
dr = csv.DictReader(open('data.csv', delimiter=','))
to_db = [(i['col1'], i['col2']) for i in dr]
c.executemany("insert into t (col1, col2) values (?, ?);", to_db)
like image 101
Josh Smeaton Avatar answered Oct 27 '22 20:10

Josh Smeaton


You say """I now can read my zipped-csv file into a dict Only one line though, the last one. (how do I get a sample of lines or the whole data file?)"""

Your code does this:

for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
    pass

I can't imagine why you wrote that, but the effect is to read the whole input file row by row, ignoring each row (pass means "do exactly nothing"). The end result is that row refers to the last row (unless of course the file is empty).

To "get" the whole file, change pass to do_something_useful_with(row).

If you want to read the whole file into memory, simply do this:

rows = list(csv.DictReader(.....))

To get a sample, e.g. every Nth row (N > 0), starting at the Mth row (0 <= M < N), do something like this:

for row_index, row in enumerate(csv.DictReader(.....)):
    if row_index % N != M: continue
    do_something_useful_with(row)

By the way, you don't need dialect='excel'; that's the default.

like image 30
John Machin Avatar answered Oct 27 '22 21:10

John Machin