Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Files loading slower on second run of application, with repro code

Application Description

I have an offline data processing tool. This tool loads hundreds of thousands of files. For each one it performs some calculations and when done writes a single index file. It is all C++ (all IO is via standard library objects/functions), and is being compiled with Visual Studio 2013 targeting amd64.

Performance

My test dataset has 115,757 files that need to be processed. Files total 731MB in size, and the Median file size is 6KB.

  • First run: 12 seconds
  • Second run: ~18 minutes

Thats 90x slower! The second run is extrapolated from one minute of run time. All runs after that, as i've experienced thus far, are equally slow.

Surprise!

If I rename the folder with the files in it, and then rename it back to what it originally was, the next time I run the application it will again perform quickly!

Its the same app, machine, and source data. The only difference is that one folder was temporarily renamed.

So far I can reproduce this 100% of the time.

Profiling

Naturally the next step was to profile. I profiled the quick run and the slow run and compared the hot spots. In the slow version about 86% of the application was spent in a function called NtfsFindPrefix. The quick version spends about 0.4% of its time here. This is the call stack:

Ntfs.sys!NtfsFindPrefix<itself>
Ntfs.sys!NtfsFindPrefix
Ntfs.sys!NtfsFindStartingNode
Ntfs.sys!NtfsCommonCreate
Ntfs.sys!NtfsCommonCreateCallout
ntoskrnl.exe!KySwitchKernelStackCallout
ntoskrnl.exe!KiSwitchKernelStackContinue
ntoskrnl.exe!KeExpandKernelStackAndCalloutEx
Ntfs.sys!NtfsCommonCreateOnNewStack
Ntfs.sys!NtfsFsdCreate
fltmgr.sys!FltpLegacyProcessingAfterPreCallbacksCompleted
fltmgr.sys!FltpCreate
ntoskrnl.exe!IopParseDevice
ntoskrnl.exe!ObpLookupObjectName
ntoskrnl.exe!ObOpenObjectByName
ntoskrnl.exe!NtQueryAttributesFile
ntoskrnl.exe!KiSystemServiceCopyEnd
ntdll.dll!NtQueryAttributesFile
KernelBase.dll!GetFileAttributesW
DataGenerator.exe!boost::filesystem::detail::status

The boost call in question is an exists call. It will test for the zipped version of a file, fail to find it, and then test for the unzipped one and find it.

Profiling also showed that the disk did not get hit by either runs of the application, however File IO was expectedly high. I believe this indicates that the files were already paged to memory.

File IO also showed that the duration of file "Create" events were on average MUCH higher in the slow version. 26 us vs 11704 us.

Machine

  • Samsung SSD 830 Series
  • Intel i7 860
  • Windows 7 64 bit
  • NTFS file system.
  • 32GB Ram

Summary

  • On the second run the calls into NtfsFindPrefix take much longer.
  • This is a function in the NTFS driver.
  • The disk didn't get hit in either profiles, the Files were served from pages in memory.
  • A rename operation seems to be enough to stop this issue occurring on the next run.

Question

Now that the background info is out of the way, Does anyone recognise what is going on and know how to fix it?

It seems like I could work around it by renaming the folder myself, but that seems...dirty. plus im not sure why that even works.

Is the rename invalidating the pages in memory and causing them to get updated before the next run? Is this a bug in the NTFS driver?

Thanks for reading!


Update!!

After some more profiling it looks like the part that is performing slower is testing to see if the non-existent zipped file exists. If I remove this test everything seems to get quicker again.

I have also managed to reproduce this issue in a small C++ app for everyone too see. Note that The sample code will create 100k 6KB files on your machine in the current directory. Can anyone else repro it?

// using VS tr2 could replace with boost::filesystem
#include <filesystem>
namespace fs = std::tr2::sys;
//namespace fs = boost::filesystem;

#include <iostream>
#include <string>
#include <chrono>
#include <fstream>

void createFiles( fs::path outDir )
{
    // create 100k 6KB files with junk data in them. It doesn't matter that they are all the same.
    fs::create_directory( outDir );
    char buf[6144];
    for( int i = 0; i < 100000; ++i )
    {
        std::ofstream fout( outDir / fs::path( std::to_string( i ) ), std::ios::binary );
        fout.write( buf, 6144 );
    }

    fs::rename( outDir, fs::path( outDir.string() + "_tmp" ) );
    fs::rename( fs::path( outDir.string() + "_tmp" ), outDir );
}

int main( int argc, const char* argv[] )
{
    fs::path outDir = "out";

    if( !fs::exists( outDir ) )
        createFiles( outDir );

    auto start = std::chrono::high_resolution_clock::now();

    int counter = 0;
    for( fs::recursive_directory_iterator i( outDir ), iEnd; i != iEnd; ++i )
    {
        // test the non existent one, then the other
        if( !fs::exists( fs::path( i->path().string() + "z" ) ) && fs::exists( i->path() ) )
            counter += 1;

        if( counter % 100 == 0 )
            std::cout << counter << std::endl;
    }
    std::cout << counter << std::endl;

    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration< double, std::milli > s( end - start );

    std::cout << "Time Passed: " << s.count() << "ms" << std::endl;

    return 0;
}

Update 2

I have logged an issue with MS here. Hopefully they can help shed some light on the issue.

like image 324
0xC0DEFACE Avatar asked Feb 18 '15 09:02

0xC0DEFACE


2 Answers

My best guess is that Windows Indexer is creating a lock of the files right after being modified as Damon said

NtfsFindPrefix sure does quite a bit of FCB acquiring and releasing,but unless the code is incorrect (in which case it should fail) why would it be slower on the second run? Windows Explorer or Indexing Service doing some kind of "smart shit" after you have modified a file in that folder, such as scanning all files to adjust the specialized folder type, and holding a lock doing so maybe? Did you test what happens if you wait a few mins doing nothing before running a second time, or if you kill Explorer and the indexer?

I would suggest creating a folder used specifically to keep data for any program you write:

Example

C:/Users/%yourusername%/DataAnalysis

Use this folder to keep any data file accessed by your original software. After creating the folder you have to disable indexing.

I believe that in the folder properties menu there is an option to disable indexing of the folder. If not, follow the instruction here to disable indexing that folder and avoid problems.

The page will tell you to visit Control Panel and visit Indexing Options. There you select Modify and travel through you file system and deselect the folder you just created to avoid indexing its contents.

Let me know if it works!

like image 170
Torrien Avatar answered Nov 16 '22 16:11

Torrien


If the bottleneck is in the exist call, then you will probably get far better performance if you read the list of files into an in-memory data structure and testing for file existence against your data structure.

Use FindFirstFileEx/FindNextFile to get the names of all files in your working folder. Load the results into a std::vector (or your container of choice). Sort. Use std::binary_search to check if a specific file exists.

I've written a lot of tools that work with large numbers of files in a single folder and it's been my experience that FindFirstFileEx/FindNextFile is the best option for these scenarios.

like image 38
user3518060 Avatar answered Nov 16 '22 14:11

user3518060