Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Importing csv into Numpy datetime64

Tags:

python

numpy

I am trying out the latest version of numpy 2.0 dev:

np.__version__
Out[44]: '2.0.0.dev-aded70c'

I am trying to import CSV data that looks like this:

date,system,pumping,rgt,agt,sps,eskom_import,temperature,wind,pressure,weather
2007-01-01 00:30,481.9,,,,,481.9,15,SW,1040,Fine
2007-01-01 01:00,471.9,,,,,471.9,15,SW,1040,Fine
2007-01-01 01:30,455.9,,,,,455.9,,,,

etc.

by using the following code:

convertdict = {0: lambda s: np.datetime64(s, 'm'), 1: lambda s: float(s or 0), 2: lambda s: float(s or 0), 3: lambda s: float(s or 0), 4: lambda s: float(s or 0), 5: lambda s: float(s or 0), 6: lambda s: float(s or 0), 7: lambda s: float(s or 0), 8: str, 9: str, 10: str}

dt = [('date', np.datetime64),('system', float), ('pumping', float),('rgt', 
float), ('agt', float), ('sps', float) ,('eskom_import', float),('temperature', float), ('wind', str), ('pressure', float), ('weather', str)]

a = np.recfromcsv(fp, dtype=dt, converters=convertdict, usecols=range(0-11), 
names=True)         

The dtype it generates for a.date is 'object':

array([2007-01-01T00:30+0200, 2007-01-01T01:00+0200, 2007-01-01T01:30+0200,
       ..., 2007-12-31T23:00+0200, 2007-12-31T23:30+0200,
       2008-01-01T00:00+0200], dtype=object)

But I need it to be datetime64, like in this example (but including hrs and minutes):

array(['2011-07-11', '2011-07-12', '2011-07-13', '2011-07-14',
       '2011-07-15', '2011-07-16', '2011-07-17'], dtype='datetime64[D]')

It seems that the CSV import creates an embedded object datetype for 'date' rather than a datetime64 data type. Any ideas on how to fix this?

Grové

like image 512
grovesteyn Avatar asked Mar 28 '26 15:03

grovesteyn


1 Answers

I think the trick to avoid the generic 'object' type is to avoid using the recfromcsv function. Manually reading in your data file and parsing the information yields the requested dtype='datetime64[m]'

import numpy as np
dt = np.dtype([ ('date',        '<M8[m]'), 
                ('system',      '<f8'), 
                ('pumping',     '<f8'), 
                ('rgt',         '<f8'), 
                ('agt',         '<f8'), 
                ('sps',         '<f8'), 
                ('eskom_import','<f8'), 
                ('temperature', '<f8'), 
                ('wind',        np.str), 
                ('pressure',    '<f8'), 
                ('weather',     np.str) ])
numfields = len(dt.fields.keys())
data = np.zeros(numlines, dtype=dt)         
fid = open('data.csv', 'rb')
count = 0
try:
    fieldnames = fid.readline().strip().split(',') #Header
    for line in fid:
        parsedline = line.strip().split(',')
        data['date'][count]         = np.datetime64(parsedline[0], 'm')
        data['system'][count]       = np.double(parsedline[1])
        data['pumping'][count]      = np.double(parsedline[2])
        data['rgt'][count]          = np.double(parsedline[3])
        data['agt'][count]          = np.double(parsedline[4])
        data['sps'][count]          = np.double(parsedline[5])
        data['eskom_import'][count] = np.double(parsedline[6])
        data['temperature'][count]  = np.double(parsedline[7])
        data['wind'][count]         = np.str(parsedline[8])
        data['pressure'][count]     = np.double(parsedline[9])
        data['weather'][count]      = np.str(parsedline[10])
        count += 1
 finally:
     fid.close()

>>> data['date']
array(['2007-01-01T00:30-0500', '2007-01-01T01:00-0500',
       '2007-01-01T00:30-0500', '2007-01-01T01:00-0500',
       '2007-01-01T00:30-0500', '2007-01-01T01:00-0500',
       '2007-01-01T00:30-0500', '2007-01-01T01:00-0500'], dtype='datetime64[m]')

You could definitely improve upon this code by making use of your "convertdict" and iterating over the parsedline but the idea is the same.

like image 115
Joel Vroom Avatar answered Mar 31 '26 04:03

Joel Vroom



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!