Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

incomprehensible time consumed in using memory mapped file

I am writing a routine to compare two files using memory-mapped file. In case files are too big to be mapped at one go. I split the files and map them part by part. For example, to map a 1049MB file, I split it into 512MB + 512MB + 25MB.

Every thing works fine except one thing: it always take much, much longer to compare the remainder (25MB in this example), though the code logic is exactly the same. 3 observations:

  1. it does not matter which is compared first, whether the main part (512MB * N) or the remainder (25MB in this example) comes first, the result remains the same
  2. the extra time in the remainder seems to be spent in the user mode
  3. Profiling in VS2010 beta 1 shows, the time is spent inside t std::_Equal(), but this function is mostly (profiler says 100%) waiting for I/O and other threads.

I tried

  • changing the VIEW_SIZE_FACTOR to another value
  • replacing the lambda functor with a member function
  • changing the file size under test
  • changing the order of execution of the remainder to before/after the loop

The result was quite consistent: it takes a lot more time in the remainder part and in the User Mode.

I suspect it has something to do with the fact that the mapped size is not a multiple of mapping alignment (64K on my system), but not sure how.

Below is the complete code for the routine and a timing measured for a 3G file.

Can anyone please explain it, Thanks?

// using memory-mapped file
template <size_t VIEW_SIZE_FACTOR>
struct is_equal_by_mmapT
{
public:
    bool operator()(const path_type& p1, const path_type& p2)
    {
        using boost::filesystem::exists;
        using boost::filesystem::file_size;

        try
        {
            if(!(exists(p1) && exists(p2))) return false;

            const size_t segment_size = mapped_file_source::alignment() * VIEW_SIZE_FACTOR;  

            // lanmbda 
            boost::function<bool(size_t, size_t)> segment_compare = 
            [&](size_t seg_size, size_t offset)->bool 
            {
                using boost::iostreams::mapped_file_source;
                boost::chrono::run_timer t;     

                mapped_file_source mf1, mf2;  

                mf1.open(p1, seg_size, offset);
                mf2.open(p2, seg_size, offset);

                if(! (mf1.is_open() && mf2.is_open())) return false;

                if(!equal (mf1.begin(), mf1.end(), mf2.begin())) return false;  

                return true;
            };

            boost::uintmax_t size = file_size(p1);
            size_t round     = size / segment_size;
            size_t remainder = size & ( segment_size - 1 );

            // compare the remainder
            if(remainder > 0)
            {
                cout << "segment size = " 
                     << remainder 
                     << " bytes for the remaining round";
                if(!segment_compare(remainder, segment_size * round)) return false;    
            }   

            //compare the main part.  take much less time, even 
            for(size_t i = 0; i < round; ++i)
            {
                cout << "segment size = " 
                     << segment_size 
                     << " bytes, round #" << i;
                if(!segment_compare(segment_size, segment_size * i))  return false;
            }
        }
        catch(std::exception& e)
        {
            cout << e.what();
            return false;
        }

        return true;                                      
    }
};

typedef is_equal_by_mmapT<(8<<10)> is_equal_by_mmap;  // 512MB  

output:

segment size = 354410496 bytes for the remaining round

real 116.892s, cpu 56.201s (48.1%), user 54.548s, system 1.652s

segment size = 536870912 bytes, round #0

real 72.258s, cpu 2.273s (3.1%), user 0.320s, system 1.953s

segment size = 536870912 bytes, round #1

real 75.304s, cpu 1.943s (2.6%), user 0.240s, system 1.702s

segment size = 536870912 bytes, round #2

real 84.328s, cpu 1.783s (2.1%), user 0.320s, system 1.462s

segment size = 536870912 bytes, round #3

real 73.901s, cpu 1.702s (2.3%), user 0.330s, system 1.372s


More observations after the suggestions by responders

Further split the remainder into body and tail(remainder = body + tail), where

  • body = N * alignment(), and tail < 1 * alignment()
  • body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is even.
  • body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is exponents of 2.
  • body = N * alignment(), and tail = remainder - body. N is random.

the total time remains unchanged, but I can see that the time does not necessary relate to tail, but to size of body and tail. the bigger part takes more time. The time is USER TIME, which is most incomprehensible to me.

I also look at the pages faults through Procexp.exe. the remainder does NOT take more faults than the main loop.


Updates 2

I've performed some test on other workstations, and it seem the issue is related to the hardware configuration.

Test Code

// compare the remainder, alternative way
if(remainder > 0)
{
    //boost::chrono::run_timer t;       
    cout << "Remainder size = " 
         << remainder 
         << " bytes \n";

    size_t tail = (alignment_size - 1) & remainder;
    size_t body = remainder - tail;

{
    boost::chrono::run_timer t;                               
    cout << "Remainder_tail size = " << tail << " bytes";
    if(!segment_compare(tail, segment_size * round + body)) return false;
}                        
{
    boost::chrono::run_timer t;                               
    cout << "Remainder_body size = " << body << " bytes";
    if(!segment_compare(body, segment_size * round)) return false; 
}                        

}

Observation:

On another 2 PCs with the same h/w configurations with mine, the result is consistent as following:

------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------

Remainder size = 44840960 bytes

Remainder_tail size = 14336 bytes

real 0.060s, cpu 0.040s (66.7%), user 0.000s, system 0.040s

Remainder_body size = 44826624 bytes

real 13.601s, cpu 7.731s (56.8%), user 7.481s, system 0.250s

segment size = 67108864 bytes, total round# = 19

real 172.476s, cpu 4.356s (2.5%), user 0.731s, system 3.625s

However, running the same code on a PC with a different h/w configuration yielded:

------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------ Remainder size = 44840960 bytes

Remainder_tail size = 14336 bytes

real 0.013s, cpu 0.000s (0.0%), user 0.000s, system 0.000s

Remainder_body size = 44826624 bytes

real 2.468s, cpu 0.188s (7.6%), user 0.047s, system 0.141s

segment size = 67108864 bytes, total round# = 19

real 65.587s, cpu 4.578s (7.0%), user 0.844s, system 3.734s

System Info

My workstation yielding imcomprehensible timing:

OS Name: Microsoft Windows XP Professional

OS Version: 5.1.2600 Service Pack 3 Build 2600

OS Manufacturer: Microsoft Corporation

OS Configuration: Member Workstation

OS Build Type: Uniprocessor Free

Original Install Date: 2004-01-27, 23:08

System Up Time: 3 Days, 2 Hours, 15 Minutes, 46 Seconds

System Manufacturer: Dell Inc.

System Model: OptiPlex GX520

System type: X86-based PC

Processor(s): 1 Processor(s) Installed.

                       [01]: x86 Family 15 Model 4 Stepping 3 GenuineIntel ~2992 Mhz

BIOS Version: DELL - 7

Windows Directory: C:\WINDOWS

System Directory: C:\WINDOWS\system32

Boot Device: \Device\HarddiskVolume2

System Locale: zh-cn;Chinese (China)

Input Locale: zh-cn;Chinese (China)

Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi

Total Physical Memory: 3,574 MB

Available Physical Memory: 1,986 MB

Virtual Memory: Max Size: 2,048 MB

Virtual Memory: Available: 1,916 MB

Virtual Memory: In Use: 132 MB

Page File Location(s): C:\pagefile.sys

NetWork Card(s): 3 NIC(s) Installed.

       [01]: VMware Virtual Ethernet Adapter for VMnet1

             Connection Name: VMware Network Adapter VMnet1

             DHCP Enabled:    No

             IP address(es)

             [01]: 192.168.75.1

       [02]: VMware Virtual Ethernet Adapter for VMnet8

             Connection Name: VMware Network Adapter VMnet8

             DHCP Enabled:    No

             IP address(es)

             [01]: 192.168.230.1

       [03]: Broadcom NetXtreme Gigabit Ethernet

             Connection Name: Local Area Connection 4

             DHCP Enabled:    Yes

             DHCP Server:     10.8.0.31

             IP address(es)

             [01]: 10.8.8.154

Another workstation yielding "correct" timing: OS Name: Microsoft Windows XP Professional

OS Version: 5.1.2600 Service Pack 3 Build 2600

OS Manufacturer: Microsoft Corporation

OS Configuration: Member Workstation

OS Build Type: Multiprocessor Free

Original Install Date: 5/18/2009, 2:28:18 PM

System Up Time: 21 Days, 5 Hours, 0 Minutes, 49 Seconds

System Manufacturer: Dell Inc.

System Model: OptiPlex 755

System type: X86-based PC

Processor(s): 1 Processor(s) Installed.

        [01]: x86 Family 6 Model 15 Stepping 13 GenuineIntel ~2194 Mhz

BIOS Version: DELL - 15

Windows Directory: C:\WINDOWS

System Directory: C:\WINDOWS\system32

Boot Device: \Device\HarddiskVolume1

System Locale: zh-cn;Chinese (China)

Input Locale: en-us;English (United States)

Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi

Total Physical Memory: 3,317 MB

Available Physical Memory: 1,682 MB

Virtual Memory: Max Size: 2,048 MB

Virtual Memory: Available: 2,007 MB

Virtual Memory: In Use: 41 MB

Page File Location(s): C:\pagefile.sys

NetWork Card(s): 3 NIC(s) Installed.

       [01]: Intel(R) 82566DM-2 Gigabit Network Connection

             Connection Name: Local Area Connection

             DHCP Enabled:    Yes

             DHCP Server:     10.8.0.31

             IP address(es)

             [01]: 10.8.0.137

       [02]: VMware Virtual Ethernet Adapter for VMnet1

             Connection Name: VMware Network Adapter VMnet1

             DHCP Enabled:    Yes

             DHCP Server:     192.168.154.254

             IP address(es)

             [01]: 192.168.154.1

       [03]: VMware Virtual Ethernet Adapter for VMnet8

             Connection Name: VMware Network Adapter VMnet8

             DHCP Enabled:    Yes

             DHCP Server:     192.168.2.254

             IP address(es)

             [01]: 192.168.2.1

Any explanation theory? Thanks.

like image 972
t.g. Avatar asked Sep 03 '09 07:09

t.g.


2 Answers

This behavior looks quite illogical. I wonder what would happen if we tried something stupid. Provided the overall file is larger than 512MB you could compare again a full 512MB for the last part instead of the remaining size.

something like:

        if(remainder > 0)
        {
            cout << "segment size = " 
                 << remainder 
                 << " bytes for the remaining round";
                if (size > segment_size){
                    block_size = segment_size;
                    offset = size - segment_size;
                }
                else{
                    block_size = remainder;
                    offset = segment_size * i
                }
            if(!segment_compare(block_size, offset)) return false;    
        }   

It seems a really dumb thing to do because we would be comparing part of the file two times but if your profiling figures are accurate it should be faster.

It won't give us an answer (yet) but if it is indeed faster it means the response we are looking for lies in what your program does for small blocks of data.

like image 154
kriss Avatar answered Nov 15 '22 18:11

kriss


How fragmented is the file you are comparing with? You can use FSCTL_GET_RETRIEVAL_POINTERS to get the ranges that the file maps to on disk. I suspect the last 25 MB will have a lot of small ranges to account for the performance you have measured.

like image 34
MSN Avatar answered Nov 15 '22 19:11

MSN