I have a txt file with the following format(simplified):
date this that other
2007-05-25 11:00:00 10 20 30
2007-05-25 11:10:00 15 18 30
2007-05-25 11:20:00 10 27 30
2007-05-25 11:30:00 20 35 30
2007-05-25 11:50:00 30 20
2007-05-25 12:00:00 30 13
2007-05-25 12:10:00 30 13
The first raw is strings defining what is the column above them. The first column it is clear that is time. It can be observed also that some values are missing. I do not want to erase the rows that some values are missing. As i want to make some calculations with that data later I thought to use numpy to import that data by using numpy.loadtxt
:
data = numpy.loadtxt('data.txt')
It gives an error ValueError: could not convert string to float: b'date'
due to the first raw. Using:
data = numpy.genfromtxt('data.txt')
gives an error Line #51028 (got 38 columns instead of 37)
for many lines which is because some values are missing. What should i try?
Pandas is a NumPy-based library. Among many other things, it was made to work well with incomplete data.
You should be able to install pandas with a simple:
$ pip install pandas
I saved your example file under http://pastebin.com/NuNaTW9n and replaced the spaces between the columns with tabs.
>>> import pandas as pd
>>> from urllib import urlopen
>>> df = pd.read_csv(urlopen("http://pastebin.com/raw.php?i=NuNaTW9n"), sep='\t')
>>> df
date this that other
0 2007-05-25 11:00:00 10 20 30
1 2007-05-25 11:10:00 15 18 30
2 2007-05-25 11:20:00 10 27 30
3 2007-05-25 11:30:00 20 30 NaN
4 2007-05-25 11:50:00 30 20 NaN
5 2007-05-25 12:00:00 30 13 NaN
6 2007-05-25 12:10:00 30 13 NaN
Once you have a handle on a data frame, you can start to explore your data:
>>> df["this"].sum()
145
>>> df["that"].mean()
20.142857142857142
>>> df[df["that"] < 20]["date"]
1 2007-05-25 11:10:00
5 2007-05-25 12:00:00
6 2007-05-25 12:10:00
By default, pandas will try to guess the best data type for your values (e.g. it will guess that df["that"]
should be an int64
), but you can control this behavior by passing a dtype
argument to read_csv
.
To handle missing values like this using genfromtxt
you could use the delimiter
argument with a list of field widths since your file has fixed-width fields:
In [2]: a = np.genfromtxt('test.txt', delimiter=[19,4,5,5], skip_header=1)
in your example. You'd need to set the dtype
correctly or use a converter function to handle the date/time field, however. For example:
In [3]: a = np.genfromtxt('test.txt', delimiter=[19,4,5,5], skip_header=1,
dtype=np.dtype([('date', 'datetime64[s]'),
('this', int), ('that', int),
('other', int)])
)
In [4]: a
Out[4]: array([(datetime.datetime(2007, 5, 25, 15, 0), 10, 20, 30),
(datetime.datetime(2007, 5, 25, 15, 10), 15, 18, 30),
(datetime.datetime(2007, 5, 25, 15, 20), 10, 27, 30),
(datetime.datetime(2007, 5, 25, 15, 30), 20, 35, 30),
(datetime.datetime(2007, 5, 25, 15, 50), 30, 20, -1),
(datetime.datetime(2007, 5, 25, 16, 0), 30, 13, -1),
(datetime.datetime(2007, 5, 25, 16, 10), 30, 13, -1)],
dtype=[('date', '<M8[s]'), ('this', '<i8'), ('that', '<i8'), ('other', '<i8')])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With