Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas equivalent of Python's readlines function

Tags:

python

pandas

With python's readlines() function I can retrieve a list of each line in a file:

with open('dat.csv', 'r') as dat:
    lines = dat.readlines()

I am working on a problem involving a very large file and this method is producing a memory error. Is there a pandas equivalent to Python's readlines() function? The pd.read_csv() option chunksize seems to append numbers to my lines, which is far from ideal.

Minimal example:

In [1]: lines = []

In [2]: for df in pd.read_csv('s.csv', chunksize = 100):
   ...:     lines.append(df)
In [3]: lines
Out[3]: 
[   hello here is a line
 0  here is another line
 1  here is my last line]

In [4]: with open('s.csv', 'r') as dat:
   ...:     lines = dat.readlines()
   ...:     

In [5]: lines
Out[5]: ['hello here is a line\n', 'here is another line\n', 'here is my last line\n']

In [6]: cat s.csv
hello here is a line
here is another line
here is my last line
like image 227
kilojoules Avatar asked Dec 02 '22 15:12

kilojoules


1 Answers

You should try to use the chunksize option of pd.read_csv(), as mentioned in some of the comments.

This will force pd.read_csv() to read in a defined amount of lines at a time, instead of trying to read the entire file in one go. It would look like this:

>> df = pd.read_csv(filepath, chunksize=1, header=None, encoding='utf-8')

In the above example the file will be read line by line.

Now, in fact, according to the documentation of pandas.read_csv, it is not a pandas.DataFrame object that is being returned here, but a TextFileReader object instead.

  • chunksize : int, default None

Return TextFileReader object for iteration. See IO Tools docs for more information on iterator and chunksize.

Therefore, in order to complete the exercise, you would need to put this in a loop like this:

In [385]: cat data_sample.tsv
This is a new line
This is another line of text
And this is the last line of text in this file

In [386]: lines = []

In [387]: for line in pd.read_csv('./data_sample.tsv', encoding='utf-8', header=None, chunksize=1):
    lines.append(line.iloc[0,0])
   .....:     

In [388]: print(lines)
['This is a new line', 'This is another line of text', 'And this is the last line of text in this file']

I hope this helps!

like image 191
Thanos Avatar answered Dec 18 '22 06:12

Thanos