Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can one create a data flow graph (DFG/SDFG) for any application from its source code

I have done a lot of research to figure out how a DFG can be created for an application from its source code. There are DFG's available online for certain applications such as MP3 Decoder, JPEG compression and H.263 Decoder.

I haven't been able to figure out how I can create a DFG for an application such as HEVC from its source code? Are there any tools which can instantly generate data flow graphs for such elaborate applications or does it have to be done manually?

Please advise me regarding this matter.

EDIT: I used Doxygen for HEVC and I could see how different functions were interacting with each other. However, every function had many entry and exit points and output of Doxygen became too confusing to follow after a while.

I also looked at StreamIt: http://camlunity.ru/swap/Library/Conflux/Stream%20Programming/streamit-cc_stream_graph_programming_language.pdf

It seemed handy but the graphs it generated for even simpler applications (like MP3 Decoder) were too complex. In order to generate a coherent DFG, will I have to re-write the entire source code?

like image 300
a_sid Avatar asked Jan 04 '17 21:01

a_sid


People also ask

What is data flow graph in embedded system?

The Data Flow Graph (DFG) model translates the data processing requirements into a data flow graph. The Data Flow Graph (DFG) is a data driven model in which the program execution is determined by data. This model emphasizes on the data and operations on the data which transforms the input data to output data.

What is a data flow graph?

A data-flow graph is a collection of arcs and nodes in which the nodes are either places where variables are assigned or used, and the arcs show the relationship between the places where a variable is assigned and where the assigned value is subsequently used.

What is data flow graph in compiler design?

It is the analysis of flow of data in control flow graph, i.e., the analysis that determines the information regarding the definition and use of data in program. With the help of this analysis, optimization can be done.

What is CDFG in embedded system?

Our fundamental model for programs is the control/dataflow graph (CDFG). (We can also model hardware behaviorwiththe CDFG.) As the name implies, the CDFG has constructs that model bothdata operations (arithmetic and other computations) and controloperations (conditionals).


1 Answers

You want to extract data flow graphs from arbitrary languages. You imply you want a single way to do it. This isn't practical by hand... you need a tool.

Such a tool is singularly hard to build.

To do this, for each language you must be able to:

  • Define the language to the tool, in the form you find it in practice (not just the language reference manual version). C++ in the wild is bent in a lot of funny ways compared to the standard.
  • Parse programs in the language as found in the field, perhaps as one file, perhaps as tens of thousands; some programs aren't small.
  • Build structures representing the language elements and their relations to one another (this is often done as an abstract syntax tree)
  • Determine for each literal what its actual value is; "a\xbc" has very different values depending on whether the language thinks it is ascii or unicode text with escape sequences
  • Find all the identifiers in the code, and determine for each one which definitional/type information is associated with it according the language scoping rules
  • Determine the sources of data (literal values, inputs from the outside world, results of expressions) and track where those data values are used in other parts of the program across various control flow constructs
  • Presumably draw some picture of the resulting data flow.

Each of these tasks by themselves are hard, because the languages tend to be complex. Most language tools that can do this at all (mostly compilers) do it only for one dialect of the language.

To do it for more than one language/dialect, you need a tool that can be configured for all the details for each language, and you have to configure for all the languages of interest. [Realistically you cannot "do them all"; there's thousands of computer languages in use right now].

Even limiting yourself to the "everyday" common programming languages, this is an enormous amount of work; it can take a few years to do all this well for a single mainstream language. You won't succeed in doing this by yourself.

My company builds a single, unified tool that is intended to be capable of doing this: the DMS Software Reengineering Toolkit. The simple "secret" is to realize that the machinery needed to accomplish the above tasks is actually very similar across languages, and can be designed to configured for a particular language with relatively modest (that doesn't mean "small") effort.

After 20 linear years of engineering with a team of PhD level engineers, we have parsers (even this is hard) for a surprising variety of languages, with full up data flow analyzers of the type you are talking about for C++ (check this link for examples), C, COBOL and almost Java 8.

I don't know of any other unified tools that are this far down the path toward your ideal. Check my bio before you decide I'm clueless about this. (Rascal/MPL has some ambitions but is a research tool at this point; they don't do C or C++ at all) We're only part way there, with many languages and scale battles to fight remaining.

[The goal of DMS isn't data flow analysis; that's just a stepping stone. It is to do automated code transformation, which requires data flow analysis to do safely and correctly].

Of course, you could just hope to find a separate tool for each language. You wouldn't get consistent quality or consistent style/granularity of data flow graphs from separate tools from different authors, if you can indeed get a full set of such tools at all.

like image 55
Ira Baxter Avatar answered Nov 14 '22 01:11

Ira Baxter