Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the size of npy bigger than csv?

Screenshot

I changed csv to npy file. After the change, size of csv file is 5GB, and npy is 13GB. I thought a npy file is more efficient than csv. Am I misunderstanding this? Why is the size of npy bigger than csv?

I just used this code

full = pd.read_csv('data/RGB.csv', header=None).values
np.save('data/RGB.npy', full, allow_pickle=False, fix_imports=False)

and data structure like this:

R, G, B, is_skin
2, 5, 1, 0
10, 52, 242, 1
52, 240, 42, 0
...(row is 420,711,257)
like image 600
YeongHwa Jin Avatar asked Nov 21 '18 07:11

YeongHwa Jin


Video Answer


1 Answers

In your case an element is an integer between 0 and 255, inclusive. That means, saved as ASCII it will need at most

  • 3 chars for the number
  • 1 char for ,
  • 1 char for the whitespace

which results in at most 5 bytes (somewhat less on average) per element on the disc.

Pandas reads/interprets this as an int64 array (see full.dtype) as default, which means it needs 8 bytes per element, which leads to a bigger size of the npy-file (most of which are zeros!).

To save an integer between 0 and 255 we need only one byte, thus the size of the npy-file could be reduced by factor 8 without loosing any information - just tell pandas it needs to interpret the data as unsigned 8bit-integers:

full = pd.read_csv(r'e:\data.csv', dtype=np.uint8).values
# or to get rid of pandas-dependency:
# full = np.genfromtxt(r'e:\data.csv', delimiter=',', dtype=np.uint8, skip_header=1)
np.save(r'e:/RGB.npy', full, allow_pickle=False, fix_imports=False)
# an 8 times smaller npy-file

Most of the time npy-format needs less space, however there can be situations when the ASCII format results in smaller files.

For example if data consist mostly of very small numbers with one digit and some few very big numbers, that for them really 8bytes are needed:

  • in ASCII-format you pay on average 2 bytes per element (there is no need to write whitespace, , alone as delimiter is good enough).
  • in numpy-format you will pay 8 bytes per element.
like image 64
ead Avatar answered Dec 29 '22 06:12

ead