Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading both raw lines and dicionaries from csv in Python

Tags:

python

csv

My scenario: I am reading a csv file. I want to have access to both a dictionary of the fields generated by each line, and the raw, un-parsed line.

The goal is ultimately to do some processing on the fields, use the result to decide which lines I am interested in, and write those lines only into an output file.

An easy solution, involving reading the file twice looks something like:

def dict_and_row(filename):
    with open(filename) as f:
        tmp = [row for row in DictReader(f)]

    with open(filename) as f:
        next(f)    # skip header
        for i, line in enumerate(f):
            if len(line.strip()) > 0:
                yield line.strip(), tmp[i]

Any better suggestions?

Edit: to be more specific about the usage scenario. I intended to index the lines by some of the data in the dict, and then use this index to find lines I am interested in. Something like:

d = {}
for raw, parsed in dict_and_row(somefile):
      d[(parsed["SOMEFIELD"], parsed ["ANOTHERFIELD"])] = raw

and then later on

for pair in some_other_source_of_pairs:
      if pair in d:
            output.write(d[pair])
like image 616
daphshez Avatar asked Dec 10 '22 23:12

daphshez


2 Answers

I ended up wrapping the file with an object that saves the last line read, and the handing this object to the DictReader.

class FileWrapper:
  def __init__(self, f):
    self.f = f
    self.last_line = None

  def __iter__(self):
    return self

  def __next__(self):
    self.last_line = next(self.f)
    return self.last_line

This could be then used this way:

  f = FileWrapper(file_object)
  for row in csv.DictReader(f):
      print(row)   # that's the dict
      print(f.last_line)   # that's the line

Or I can implement dict_and_row:

 def dict_and_row(filename):
    with open(filename) as f:
         wrapper = FileWrapper(f)
         reader = DictReader(wrapper)
         for row in reader:
              yield row, wrapper.last_line 

This also allows access to other properties such as the number of characters read.

Not sure that's the best solution but it does have the advantage of retaining access to strings as they were originally read from the file.

like image 147
daphshez Avatar answered Feb 23 '23 16:02

daphshez


You could use Pandas which is an excellent library to do such kind of processing...

import pandas as pd

# read the csv file
data = pd.read_csv('data.csv')

# do some calculation on a column and store it in another column
data['column2'] = data['column1'] * 2

# If you decide that you need only a particular set of rows
# that match some condition of yours
data = data[data['column2'] > 100]

# store only particular columns back    
cols = ['column1', 'column2', 'column3']
data[cols].to_csv('data_edited.csv')
like image 44
ComputerFellow Avatar answered Feb 23 '23 16:02

ComputerFellow