Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which is a good way to open a 'complicated' txt file in python

I have a txt file with the following format(simplified):

date                 this that other
2007-05-25 11:00:00  10   20   30
2007-05-25 11:10:00  15   18   30
2007-05-25 11:20:00  10   27   30
2007-05-25 11:30:00  20   35   30
2007-05-25 11:50:00  30   20   
2007-05-25 12:00:00  30   13   
2007-05-25 12:10:00  30   13   

The first raw is strings defining what is the column above them. The first column it is clear that is time. It can be observed also that some values are missing. I do not want to erase the rows that some values are missing. As i want to make some calculations with that data later I thought to use numpy to import that data by using numpy.loadtxt:

data = numpy.loadtxt('data.txt')

It gives an error ValueError: could not convert string to float: b'date' due to the first raw. Using:

data = numpy.genfromtxt('data.txt')

gives an error Line #51028 (got 38 columns instead of 37) for many lines which is because some values are missing. What should i try?

like image 429
Mpizos Dimitris Avatar asked Feb 11 '23 06:02

Mpizos Dimitris


2 Answers

Pandas is a NumPy-based library. Among many other things, it was made to work well with incomplete data.

You should be able to install pandas with a simple:

$ pip install pandas

I saved your example file under http://pastebin.com/NuNaTW9n and replaced the spaces between the columns with tabs.

>>> import pandas as pd
>>> from urllib import urlopen
>>> df = pd.read_csv(urlopen("http://pastebin.com/raw.php?i=NuNaTW9n"), sep='\t')
>>> df
                  date  this  that  other
0  2007-05-25 11:00:00    10    20     30
1  2007-05-25 11:10:00    15    18     30
2  2007-05-25 11:20:00    10    27     30
3  2007-05-25 11:30:00    20    30    NaN
4  2007-05-25 11:50:00    30    20    NaN
5  2007-05-25 12:00:00    30    13    NaN
6  2007-05-25 12:10:00    30    13    NaN

Once you have a handle on a data frame, you can start to explore your data:

>>> df["this"].sum()
145

>>> df["that"].mean()
20.142857142857142

>>> df[df["that"] < 20]["date"]
1    2007-05-25 11:10:00
5    2007-05-25 12:00:00
6    2007-05-25 12:10:00

By default, pandas will try to guess the best data type for your values (e.g. it will guess that df["that"] should be an int64), but you can control this behavior by passing a dtype argument to read_csv.

like image 114
miku Avatar answered Feb 13 '23 02:02

miku


To handle missing values like this using genfromtxt you could use the delimiter argument with a list of field widths since your file has fixed-width fields:

In [2]: a = np.genfromtxt('test.txt', delimiter=[19,4,5,5], skip_header=1)

in your example. You'd need to set the dtype correctly or use a converter function to handle the date/time field, however. For example:

In [3]: a = np.genfromtxt('test.txt', delimiter=[19,4,5,5], skip_header=1,
                          dtype=np.dtype([('date', 'datetime64[s]'),
                                          ('this', int), ('that', int),
                                          ('other', int)])
                         )

In [4]: a
Out[4]: array([(datetime.datetime(2007, 5, 25, 15, 0), 10, 20, 30),
   (datetime.datetime(2007, 5, 25, 15, 10), 15, 18, 30),
   (datetime.datetime(2007, 5, 25, 15, 20), 10, 27, 30),
   (datetime.datetime(2007, 5, 25, 15, 30), 20, 35, 30),
   (datetime.datetime(2007, 5, 25, 15, 50), 30, 20, -1),
   (datetime.datetime(2007, 5, 25, 16, 0), 30, 13, -1),
   (datetime.datetime(2007, 5, 25, 16, 10), 30, 13, -1)], 
  dtype=[('date', '<M8[s]'), ('this', '<i8'), ('that', '<i8'), ('other', '<i8')])
like image 31
xnx Avatar answered Feb 13 '23 03:02

xnx