Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read Space-separated Data with Pandas [duplicate]

Tags:

python

pandas

I used to read my data with numpy.loadtxt(). However, lately I found out in SO, that pandas.read_csv() is much more faster.

To read these data I use:

pd.read_csv(filename, sep=' ',header=None)

The problem that I encounter right now is that in my case the separator can differ from one space, x spaces to even a tab.

Here how my data could look like:

56.00     101.85 52.40 101.85 56.000000 101.850000 1
56.00 100.74 50.60 100.74 56.000000 100.740000 2
56.00 100.74 52.10 100.74 56.000000 100.740000 3
56.00 102.96 52.40 102.96 56.000000 102.960000 4
56.00 100.74 55.40 100.74 56.000000 100.740000 5

That leads to results like:

     0       1     2       3     4       5   6       7   8
0   56     NaN   NaN  101.85  52.4  101.85  56  101.85   1
1   56  100.74  50.6  100.74  56.0  100.74   2     NaN NaN
2   56  100.74  52.1  100.74  56.0  100.74   3     NaN NaN
3   56  102.96  52.4  102.96  56.0  102.96   4     NaN NaN
4   56  100.74  55.4  100.74  56.0  100.74   5     NaN NaN

I have to specify that my data are >100 MB. So I can not preprocess the data or clean them first. Any ideas how to get this fixed?

like image 262
Tengis Avatar asked Apr 02 '14 10:04

Tengis


People also ask

How do you remove blank spaces in pandas?

strip() Python String strip() function will remove leading and trailing whitespaces. If you want to remove only leading or trailing spaces, use lstrip() or rstrip() function instead.

How read comma separated CSV file in pandas?

The pandas DataFrame class supports serializing and de-serializing of CSV in an extenstive way through the read_csv() method. The read_csv() method of pandas DataFrame class reads a CSV file and loads each record as a row in the DataFrame.

What pandas function would you use to read in a file where observations are separated by rows and columns are separated by commas?

pandas. read_csv() is a general function for reading data files separated by commas, spaces, or other common separators. Here we only provided one argument (the filepath) to the pd. read_csv() method.

What command is read in a tab separated text file into a pandas DataFrame?

TSV stands for Tab Separated File Use pandas which is a text file where each field is separated by tab (\t). In pandas, you can read the TSV file into DataFrame by using the read_table() function.


1 Answers

Your original line:

pd.read_csv(filename, sep=' ',header=None)

was specifying the separator as a single space, because your csvs can have spaces or tabs you can pass a regular expression to the sep param like so:

pd.read_csv(filename, sep='\s+',header=None)

This defines separator as being one single white space or more, there is a handy cheatsheet that lists regular expressions.

like image 157
EdChum Avatar answered Oct 13 '22 01:10

EdChum