Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing useless lines from c++ file

There are many times when as I am debugging, or reusing some code, the file starts to acquire lines that don't do anything, though they may have done something at one point.

Things like vectors and getting filled, and then go unused, classes/structs that are defined but never used, and functions that are declared, but never used.

I understand that in many cases, some of these things are not superfluous, as they might be visible from other files, but in my case, there are no other files, just extraneous code in my file.

While I understand that technically speaking, invoking push_back does something, and therefore the vector is not unused per se, in my case, its result goes unused.

So: Is there a way to do this, either using a compiler (clang, gcc, VS, etc) or an external tool?

Example:

#include<vector>
using namespace std;
void test() {
    vector<int> a;
    a.push_back(1);
}
int main() {
    test();
    return 0;
}

Should become: int main(){return 0};

like image 554
soandos Avatar asked Apr 05 '13 03:04

soandos


People also ask

How do you delete data from file handling in C++?

The remove() function in C++ deletes a specified file. It is defined in the cstdio header file.


1 Answers

Our DMS Software Reengineering Toolkit with its C++11 front end could be used to do this; it presently does not do this off the shelf. DMS is designed to provide custom tool construction for arbitrary source languages, and contains full parsers, name resolvers, and various flow analyzers to support analysis, as well as the ability to apply source-to-source transformations on the code based on analysis results.

In general, you want a static analysis that determines whether every computation (result, there may be several, consider just "x++") is used or not. For each unused computation, in effect you want to remove the unused computation, and repeat the analysis. For efficiency reasons, you want to do an analysis that determines all the (points of) usage of the result(s) just once; this is essentially a data flow analysis. When the usage set of a computation result goes empty, that computation result can be deleted (note that deleting "x++" value result may leave behind "x++" because the increment is still needed!) and the usage sets of computations on which it depends can be adjusted to remove references from the deleted one, possibly causing more removals.

To do this analysis for any language, you have to be able to trace results. For C (and C++) this can be pretty ugly; there are "obvious" uses where a computation result is used in a expression, and where it is assigned to a local/global variable (which is used somewhere else), and there are indirect assignments through pointers, object field updates, through arbitrary casts, etc. To know these effects, your dead code analysis tool has to be able to read the entire software system, and compute dataflows across it.

To be safe, you want that analysis to be conservative, e.g., if the tool does not have proof that a result is not used, then it must assume the result is used; you often have to do this with pointers (or array indexes which are just pointers in disguise) because in general you can't determine precisely where a pointer "points". One can obviously build a "safe" tool by assuming all results are used :-} You will also end up with sometimes very conservative but necessary assumptions for library routines for which you don't have the source. In this case, it is helpful to have a set of precomputed summaries of the library side effects (e.g., "strcmp" has none, "sprintf" overwrites a specific operand, "push_back" modifies its object...). Since libraries can be pretty big, this list can be pretty big.

DMS in general can parse and entire source code base, build symbol tables (so it knows which identifiers are local/global and their precise type), do control and local dataflow analysis, build a local "sideeffects" summary per function, build a call graph and global side effects, and do a global points-to analysis, providing this "computation used" information with appropriate conservatism.

DMS has been used to do this computation on C code systems of 26 million lines of code (and yes, that's a really big computation; it takes 100Gb VM to run). We did not implement the dead code elimination part (the project had another purpose) but that is straightforward once you have this data. DMS has done the dead code elimination on large Java codes with a more conservative analysis (e.g., "no use mentions of an identifier" which means assignments to the identifier are dead) which causes a surprising amount of code removal in many real codes.

DMS's C++ parser presently builds symbol tables and can do control flow analysis for C++98 with C++11 being close at hand. We still need local data flow analysis, which is some effort, but the global analyses already pre-exist in DMS and are available to be used for this effect. (The "no uses of an identifier" is easily available from the symbol table data, if you don't mind a more conservative analysis).

In practice, you don't want the tool to just silently rip things out; some might actually be computations you wish to preserve anyway. What the Java tool does is produce two results: a list of dead computations which you can inspect to decide if you believe it, and a dead-code-removed version of the source code. If you believe the dead code report, you keep the dead-code-removed version; if you see a "dead" computation you think shouldn't be dead, you modify the code to make it not dead and run the tool again. With a big code base, inspecting the dead code report itself can be trying; how do "you" know if some apparantly dead code isn't valued by "somebody else" on your team?. (Version control can be used to recover if you goof!)

A really tricky issue we do not (and no tool I know of) handle, is "dead code" in the presence of conditional compilation. (Java does not have this problem; C has it in spades, C++ systems much less). This can be truly nasty. Imagine a conditional in which arm has certain side effects and the other arm has different side effects, or another case in which one are is interpreted by GCC's C++ compiler, and the other arm interpreted by MS, and the compilers disagree on what the constructs do (yes, the C++ compilers do disagree in dark corners). At best we can be very conservative here.

CLANG has some ability to do flow analysis; and some ability to do source transformations, so it might be coerced into doing this. I don't know if it can do any global flow/points-to analysis. It seems to have a bias towards single compilation units since its principal use is compiling a single compilation unit.

like image 167
Ira Baxter Avatar answered Sep 21 '22 23:09

Ira Baxter