I have a pandas dataframe that looks like this:
In [1]: df
Out[1]:
0 1
0 a A\nB\nC
1 a D\nE\nF
2 b A\nB\nC
When I write it to a csv file then read it back, I expect to have the same dataframe. This is not the case:
In [2]: df.to_csv("out.csv")
In [3]: df = pd.read_csv("out.csv", index_col=0)
In [4]: df
Out[4]:
0 1
0 a A\r\nB\r\nC
1 a D\r\nE\r\nF
2 b A\r\nB\r\nC
A \r
character is added before each \n
. Writing and reading it again, the same thing happens:
In [5]: df.to_csv("out.csv")
In [6]: df = pd.read_csv("out.csv", index_col=0)
In [7]: df
Out[7]:
0 1
0 a A\r\r\nB\r\r\nC
1 a D\r\r\nE\r\r\nF
2 b A\r\r\nB\r\r\nC
How can I stop pandas from adding a \r
character?
Edits:
Yes I am on windows.
pd.read_csv(pd.compat.StringIO(df.to_csv(index=False)))
gives me the same dataframe, so the problem seems to be writing to a file
with open("out.csv", "wb") as file:
df.to_csv(file)
results in:
TypeError Traceback (most recent call last)
<ipython-input-20-f31d52fb2ce3> in <module>()
1 with open("out.csv", "wb") as file:
----> 2 df.to_csv(file)
3
C:\Program Files\Anaconda3\lib\site-packages\pandas\core\frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal, **kwds)
1342 doublequote=doublequote,
1343 escapechar=escapechar, decimal=decimal)
-> 1344 formatter.save()
1345
1346 if path_or_buf is None:
C:\Program Files\Anaconda3\lib\site-packages\pandas\formats\format.py in save(self)
1549
1550 else:
-> 1551 self._save()
1552
1553 finally:
C:\Program Files\Anaconda3\lib\site-packages\pandas\formats\format.py in _save(self)
1636 def _save(self):
1637
-> 1638 self._save_header()
1639
1640 nrows = len(self.data_index)
C:\Program Files\Anaconda3\lib\site-packages\pandas\formats\format.py in _save_header(self)
1632
1633 # write out the index label line
-> 1634 writer.writerow(encoded_labels)
1635
1636 def _save(self):
TypeError: a bytes-like object is required, not 'str'
In [1]: with open("out.csv", "w") as file:
...: df.to_csv(file)
...:
In [2]: df = pd.read_csv("out.csv")
In [3]: df
Out[3]:
Unnamed: 0 0 1
0 0 a A\r\nB\r\nC
1 1 a D\r\nE\r\nF
2 2 b A\r\nB\r\nC
Python 3.5.2 :: Anaconda 4.2.0 (64-bit)
pandas.read_csv
and not pandas.to_csv
In [1]: df
Out[1]:
0 1
0 a A\nB\nC
1 a D\nE\nF
2 b A\nB\nC
In [2]: df.to_csv("out.csv")
In [3]: with open("out.csv", "r") as file:
...: s = file.read()
...:
In [4]: s # Only to_csv has been used, no \r's!
Out[4]: ',0,1\n0,a,"A\nB\nC"\n1,a,"D\nE\nF"\n2,b,"A\nB\nC"\n'
In [5]: pd.read_csv("out.csv") # Now the \r's come in
Out[5]:
Unnamed: 0 0 1
0 0 a A\r\nB\r\nC
1 1 a D\r\nE\r\nF
2 2 b A\r\nB\r\nC
As some have already said in comments above and on the post you have put in reference here, this is a typical windows issue when serializing newlines. The issue has been reported on pandas-dev github #17365 as well.
Hopefully on Python3, you can specify the newline:
with open("out.csv", mode='w', newline='\n') as f:
df.to_csv(f, sep=",", line_terminator='\n', encoding='utf-8')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With