Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is PHP's hash_file streaming internally?

Tags:

php

I notice my machine's memory usage goes way up when I call hash_file on a large file (5GB). The operation completes successfully, but I'm wondering whether the internal implementation of hash_file streams the file contents into the hash algorithm or if it tries to load the entire file contents into memory first. Does anyone know?

like image 226
zjm555 Avatar asked Dec 30 '13 18:12

zjm555


2 Answers

No one answered this correctly yet so I'm answering my own question. After a little test it turns out that hash_file does operate on the file as a stream instead of using something silly internally like file_get_contents. Tested by using an environment with a lower php_memory_limit value than the size of the file being hashed. (Should have done that in the first place, sorry for the noise.)

like image 76
zjm555 Avatar answered Nov 19 '22 17:11

zjm555


It maps the entire contents into virtual memory. The entire contents must be in physical memory for the hash operation to be performed, but not all at the same time. As always, the machine's physical memory contains portions of what is mapped into virtual memory, managed by the OS to make the most efficient use of physical memory.

If you think about it, it is basically unavoidable that all of the file's contents pass through physical memory at some point in order for the hash to be computed. Every byte of the file must be processed by the CPU to produce the hash and thus must be read if not already in memory. And if there's sufficient physical memory, you probably want it to stay there because a common pattern is to read the same file again after you've hashed it.

If you hash a file that's larger than physical memory, then some data will have to be discarded as new data is read in. The OS manages this by adjusting which pages of virtual memory are actually resident in physical memory. Pages that contain clean copies of data still available in the filesystem can be trivially discarded from physical memory and the physical memory used to hold newly-read information.

like image 4
David Schwartz Avatar answered Nov 19 '22 17:11

David Schwartz