I have code which gets frames from a camera and then saves it to disk. The structure of the code is: multiple threads malloc and copy their frames into new memory, enqueue memory. Finally, another thread removes frames from queue and writes them (using ffmpeg API, raw video no compression) to their files (actually I'm using my own memory pool so malloc is only called when I need more buffers). I can have upto 8 files/cams open at the same time enqueing.
The problem is that for the first 45 sec everything works fine: there's never more than one frame on queue. But after that my queue gets backed up, processing takes just a few ms longer resulting in increased ram usage because I cannot save the frames fast enough so I have to malloc more memory to store them.
I have a 8 core, 16GB RAM Windows 7 64 bit computer (NTFS, lots of free space in second disk drive). The disk is supposed to be able to write upto 6Gbits/sec. To save my data in time I need to be able to write at 50 MB/sec. I tested disk speed with "PassMark PerformanceTest" and I had 8 threads writing files simultaneously exactly like ffmpeg saves files (synchronized, uncached I/O) and it was able to achieve 100MB/sec. So why isn't my writes able to achieve that?
Here's how the ffmpeg writes look from Process monitor logs:
Time of Day Operation File# Result Detail 2:30:32.8759350 PM WriteFile 8 SUCCESS Offset: 749,535,120, Length: 32,768 2:30:32.8759539 PM WriteFile 8 SUCCESS Offset: 749,567,888, Length: 32,768 2:30:32.8759749 PM WriteFile 8 SUCCESS Offset: 749,600,656, Length: 32,768 2:30:32.8759939 PM WriteFile 8 SUCCESS Offset: 749,633,424, Length: 32,768 2:30:32.8760314 PM WriteFile 8 SUCCESS Offset: 749,666,192, Length: 32,768 2:30:32.8760557 PM WriteFile 8 SUCCESS Offset: 749,698,960, Length: 32,768 2:30:32.8760866 PM WriteFile 8 SUCCESS Offset: 749,731,728, Length: 32,768 2:30:32.8761259 PM WriteFile 8 SUCCESS Offset: 749,764,496, Length: 32,768 2:30:32.8761452 PM WriteFile 8 SUCCESS Offset: 749,797,264, Length: 32,768 2:30:32.8761629 PM WriteFile 8 SUCCESS Offset: 749,830,032, Length: 32,768 2:30:32.8761803 PM WriteFile 8 SUCCESS Offset: 749,862,800, Length: 32,768 2:30:32.8761977 PM WriteFile 8 SUCCESS Offset: 749,895,568, Length: 32,768 2:30:32.8762235 PM WriteFile 8 SUCCESS Offset: 749,928,336, Length: 32,768, Priority: Normal 2:30:32.8762973 PM WriteFile 8 SUCCESS Offset: 749,961,104, Length: 32,768 2:30:32.8763160 PM WriteFile 8 SUCCESS Offset: 749,993,872, Length: 32,768 2:30:32.8763352 PM WriteFile 8 SUCCESS Offset: 750,026,640, Length: 32,768 2:30:32.8763502 PM WriteFile 8 SUCCESS Offset: 750,059,408, Length: 32,768 2:30:32.8763649 PM WriteFile 8 SUCCESS Offset: 750,092,176, Length: 32,768 2:30:32.8763790 PM WriteFile 8 SUCCESS Offset: 750,124,944, Length: 32,768 2:30:32.8763955 PM WriteFile 8 SUCCESS Offset: 750,157,712, Length: 32,768 2:30:32.8764072 PM WriteFile 8 SUCCESS Offset: 750,190,480, Length: 4,104 2:30:32.8848241 PM WriteFile 4 SUCCESS Offset: 750,194,584, Length: 32,768 2:30:32.8848481 PM WriteFile 4 SUCCESS Offset: 750,227,352, Length: 32,768 2:30:32.8848749 PM ReadFile 4 END OF FILE Offset: 750,256,128, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal 2:30:32.8848989 PM WriteFile 4 SUCCESS Offset: 750,260,120, Length: 32,768 2:30:32.8849157 PM WriteFile 4 SUCCESS Offset: 750,292,888, Length: 32,768 2:30:32.8849319 PM WriteFile 4 SUCCESS Offset: 750,325,656, Length: 32,768 2:30:32.8849475 PM WriteFile 4 SUCCESS Offset: 750,358,424, Length: 32,768 2:30:32.8849637 PM WriteFile 4 SUCCESS Offset: 750,391,192, Length: 32,768 2:30:32.8849880 PM WriteFile 4 SUCCESS Offset: 750,423,960, Length: 32,768, Priority: Normal 2:30:32.8850400 PM WriteFile 4 SUCCESS Offset: 750,456,728, Length: 32,768 2:30:32.8850727 PM WriteFile 4 SUCCESS Offset: 750,489,496, Length: 32,768, Priority: Normal
This looks very efficient, however, from DiskMon the actual disk writes look ridiculously fragmented back and forth writing which may account for this slow speed. See the graph for the write speed according to this data (~5MB/s).
TIme Write duration Sector Length MB/sec 95.6 0.00208855 1490439632 896 0.409131784 95.6 0.00208855 1488197000 128 0.058447398 95.6 0.00009537 1482323640 128 1.279965529 95.6 0.00009537 1482336312 768 7.679793174 95.6 0.00009537 1482343992 384 3.839896587 95.6 0.00009537 1482350648 768 7.679793174 95.6 0.00039101 1489278984 1152 2.809730729 95.6 0.00039101 1489393672 896 2.185346123 95.6 0.0001812 1482349368 256 1.347354443 95.6 0.0001812 1482358328 896 4.715740549 95.6 0.0001812 1482370616 640 3.368386107 95.6 0.0001812 1482378040 256 1.347354443 95.6 0.00208855 1488197128 384 0.175342193 95.6 0.00208855 1488202512 640 0.292236989 95.6 0.00208855 1488210320 1024 0.467579182 95.6 0.00009537 1482351416 256 2.559931058 95.6 0.00009537 1482360120 896 8.959758703 95.6 0.00009537 1482371896 640 6.399827645 95.6 0.00009537 1482380088 256 2.559931058 95.7 0.00039101 1489394568 1152 2.809730729 95.7 0.00039101 1489396744 352 0.858528834 95.7 0.00039101 1489507944 544 1.326817289 95.7 0.0001812 1482378296 768 4.042063328 95.7 0.0001812 1482392120 768 4.042063328 95.7 0.0001812 1482400568 512 2.694708885 95.7 0.00208855 1488224144 768 0.350684386 95.7 0.00208855 1488232208 384 0.175342193
I'm pretty confident it's not my code, because I timed everything and for example enqueing takes a few us suggesting that threads don't get stuck waiting for each other. It must be the disk writes. So the question is how can I improve my disk writes and what can I do to profile actual disk writes (remember that I rely on FFmpeg dlls to save so I cannot access the low level writing functions directly). If I cannot figure it out, I'll dump all the frames in a single sequential binary file (which should increase I/O speed) and then split it into video files post processing.
I don't know how much my disk I/O is caching (CacheSet only shows disk C cache size), but the following image from the performance monitor taken at 0 and 45 sec into the video (just before my queue starts piling up) looks weird to me. Basically, the modified set and standby set grew from very little to this large value. Is that the data being cached? Is it possible that only at 45 sec data is starting to be written to disk so suddenly everything slows down? (FYI, LabVIEW is the program that loads my dll.)
I'll appreciate any help.
M.
With CreateFile
it looks like you want one or both of these parameters:
http://msdn.microsoft.com/en-us/library/cc644950(v=vs.85).aspx
Your delayed performance hit occurs when the OS starts pushing data to the disk.
6Gb/s is the performance capability of the SATA 2 bus not the actual devices connected or the physical platters or flash ram underneath.
A common problem with AV systems is constantly writing a high stream of data can get periodically interrupted by disk overhead tasks. There used to be special AV disks you can purchase that don't do this, these days you can purchase disks with special high throughput performance firmware explicitly for security video recording.
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=210671&NewLang=en
The problem is with repeated malloc
and free
which puts a load on the system. I suggest creating a buffer pools, i.e allocate N buffers in the initialization stage and reuse them instead of mallocing and freeing the memory. Since you have mentioned ffmpeg, to give an example from multimedia, In gstreamer, buffer management occurs in the form of buffer-pools and in a gstreamer pipeline buffers are usually taken and passed around from buffer pools. Most multimedia systems do this.
Regarding:
The problem is that for the first 45 sec everything works fine: there's never more than one frame on queue. But after that my queue gets backed up, processing takes just a few ms longer resulting in increased ram usage because I cannot save the frames fast enough so I have to malloc more memory to store them.
The application is trashing at this point. Calling malloc
at this point will make matters even worse. I suggest implementing a producer-consumer model, where one of them gets waits depending on the case. In your case, set up a threshold of N buffers. If there are N buffers in the queue, new frames from camera are not enqueued till the existing buffers are processed.
Another idea, Instead of writing raw frames why not write encoded data? assuming you a want video, you can at least write a elementary H264 stream (and ffmpeg comes with a good H264 encoder!) or even better if you have access to a Mpeg-4 muxer, as a mp4 file? This will reduce the memory requirements and the IO load dramatically.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With