I used to read my data with numpy.loadtxt()
. However, lately I found out in SO, that pandas.read_csv()
is much more faster.
To read these data I use:
pd.read_csv(filename, sep=' ',header=None)
The problem that I encounter right now is that in my case the separator can differ from one space, x spaces to even a tab.
Here how my data could look like:
56.00 101.85 52.40 101.85 56.000000 101.850000 1
56.00 100.74 50.60 100.74 56.000000 100.740000 2
56.00 100.74 52.10 100.74 56.000000 100.740000 3
56.00 102.96 52.40 102.96 56.000000 102.960000 4
56.00 100.74 55.40 100.74 56.000000 100.740000 5
That leads to results like:
0 1 2 3 4 5 6 7 8
0 56 NaN NaN 101.85 52.4 101.85 56 101.85 1
1 56 100.74 50.6 100.74 56.0 100.74 2 NaN NaN
2 56 100.74 52.1 100.74 56.0 100.74 3 NaN NaN
3 56 102.96 52.4 102.96 56.0 102.96 4 NaN NaN
4 56 100.74 55.4 100.74 56.0 100.74 5 NaN NaN
I have to specify that my data are >100 MB. So I can not preprocess the data or clean them first. Any ideas how to get this fixed?
strip() Python String strip() function will remove leading and trailing whitespaces. If you want to remove only leading or trailing spaces, use lstrip() or rstrip() function instead.
The pandas DataFrame class supports serializing and de-serializing of CSV in an extenstive way through the read_csv() method. The read_csv() method of pandas DataFrame class reads a CSV file and loads each record as a row in the DataFrame.
pandas. read_csv() is a general function for reading data files separated by commas, spaces, or other common separators. Here we only provided one argument (the filepath) to the pd. read_csv() method.
TSV stands for Tab Separated File Use pandas which is a text file where each field is separated by tab (\t). In pandas, you can read the TSV file into DataFrame by using the read_table() function.
Your original line:
pd.read_csv(filename, sep=' ',header=None)
was specifying the separator as a single space, because your csvs can have spaces or tabs you can pass a regular expression to the sep
param like so:
pd.read_csv(filename, sep='\s+',header=None)
This defines separator as being one single white space or more, there is a handy cheatsheet that lists regular expressions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With