Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance gains in re-writing C# code in C/C++

I wrote part of a program that does some heavy work with strings in C#. I initially chose C# not only because it was easier to use .NET's data structures, but also because I need to use this program to analyse some 2-3 million text records in a database, and it is much easier to connect to databases using C#.

There was a part of the program that was slowing down the whole code, and I decided to rewrite it in C using pointers to access every character in the string, and now the part of the code that took some 119 seconds to analyse 10,000,000 strings in C# takes the C code only 5 seconds! Performance is a priority, so I am considering rewriting the whole program in C, compiling it into a dll (something which I didn't know how to do when I started writing the program) and using DllImport from C# to use its methods to work with the database strings.

Given that rewriting the whole program will take some time, and since using DllImport to work with C#'s strings requires marshalling and such things, my question is will the performance gains from the C dll's faster string handling outweigh the performance hit of having to repeatedly marshal strings to access the C dll from C#?

like image 741
Miguel Avatar asked Nov 17 '10 11:11

Miguel


4 Answers

First, profile your code. You might find some real headsmacker that speeds the C# code up greatly.

Second, writing the code in C using pointers is not really a fair comparison. If you are going to use pointers why not write it in assembly language and get real performance? (Not really, just reductio ad absurdam.) A better comparison for native code would be to use std::string. That way you still get a lot of help from the string class and C++ exception-safety.

Given that you have to read 2-3 million records from the DB to do this work, I very much doubt that the time spent cracking the strings is going to outweigh the elapsed time taken to load the data from the DB. So, consider instead how to structure your code so that you can begin string processing while the DB load is in progress.

If you use a SqlDataReader (say) to load the rows sequentially, it should be possible to batch up N rows as fast as possible and hand off to a separate thread for the post-processing that is your current headache and reason for this question. If you are on .Net 4.0 this is simplest to do using Task Parallel Library, and System.Collections.Concurrent could also be useful for collation of results between the threads.

This approach should mean that neither the DB latency nor the string processing is a show-stopping bottleneck, because they happen in parallel. This applies even if you are on a single-processor machine because your app can process strings while it's waiting for the next batch of data to come back from the DB over the network. If you find string processing is the slowest, use more threads (ie. Tasks) for that. If the DB is the bottleneck, then you have to look at external means to improve its performance - DB hardware or schema, network infrastructure. If you need some results in hand before processing more data, TPL allows dependencies to be created between Tasks and the coordinating thread.

My point is that I doubt it's worth the pain of re-engineering the entire app in native C or whatever. There are lots of ways to skin this cat.

like image 193
Steve Townsend Avatar answered Oct 23 '22 01:10

Steve Townsend


One option is to rewrite the C code as unsafe C#, which ought to have roughly the same performance and won't incur any interop penalties.

like image 32
Marcelo Cantos Avatar answered Oct 22 '22 23:10

Marcelo Cantos


There's no reason to write in C over C++, and C/C++ does not exist.

The performance implications of marshalling are fairly simple. If you have to marshal every string individually, then your performance is gonna suck. If you can marshal all ten million strings in one call, then marshalling isn't gonna make any difference at all. P/Invoke is not the fastest operation in the world but if you only invoke it a few times, it's not really gonna matter.

It might be easier to re-write your core application in C++ and then use C++/CLI to merge it with the C# database end.

like image 25
Puppy Avatar answered Oct 22 '22 23:10

Puppy


There are some pretty good answers here already, especially @Steve Townsend's.

However, I felt it worth underlining a key point: There is intrinisically no reason why C code "will be faster" than C# code. That idea is a myth. Under the bonnet they both produce machine code that runs on the same CPU. As long as you don't ask the C# to do more work than the C, then it can perform just as well.

By switching to C, you forced yourself to be more frugal (you avoided using high level features like managed strings, bounds-checking, garbage collection, exception handling, etc, and simply treated your strings as blocks of raw bytes). If you applied these low-level techniques to your C# code (i.e. treating your data as raw blocks of bytes as you did in C), you would find much less difference in the speed.

For example: Last week I re-wrote (in C#) a class that a junior had written (also in C#). I achieved a 25x speed improvement over the original code by applying the same approach I would use if I were writing it in C (i.e. thinking about performance). I achieved the same speedup you're claiming without having to change to a different language at all.

Finally, just because an isolated case can be made 24x faster, it does not mean you can make your whole program 24x faster across the board by porting it all to C. As Steve said, profile it to work out where it's slow, and expend your effort only where it'll provide significant benefits. If you blindly convert to C you'll probably find you've spent a lot of time making some already-working-code a lot less maintainable.

(P.S. My viewpoint comes from 29 years experience writing assembler, C, C++, and C# code, and understanding that the language is just a tool for generating machine-code - in the case of C# vs C++ vs C, it is primarily the programmer's skill, not the language used, that determines whether the code will run quickly or slowly. C/C++ programmers tend to be better than C# programmers because they have to be - C# allows you to be lazy and get the code written quickly, while C/C++ make you do more work and the code takes longer to write. But a good programmer can get great performance out of C#, and a poor programmer can wrest abysmal performance out of C/C++)

like image 36
Jason Williams Avatar answered Oct 23 '22 00:10

Jason Williams