I need to get the function blocks (definition and everything, not just declaration), in order to get a function dependency graph. From the function dependency graph, identify connected components and modularize my insanely huge C codebase, one file at a time.
Problem : I need a C parser to identify function blocks, just that, nothing more. We have custom types etc but signature goes
storage_class return_type function_name ( comma separated type value pairs )
{
//some content I view as generic stuff
}
Solution that I've come up with : Use sly and pycparser like any sane person would do, obviously.
Problem with pycparser : Needs to compile pre-processors from other files, just to identify the code-blocks. In my case, things go to depth of 6 levels. I am sorry I can't show the actual code.
Attempted Code with Sly :
from sly import Lexer, Parser
import re
def comment_remover(text):
def replacer(match):
s = match.group(0)
if s.startswith('/'):
return " " # note: a space and not an empty string
else:
return s
pattern = re.compile(
r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
re.DOTALL | re.MULTILINE
)
return re.sub(pattern, replacer, text)
class CLexer(Lexer):
ignore = ' \t\n'
tokens = {LEXEME, PREPROP, FUNC_DECL,FUNC_DEF,LBRACE,RBRACE, SYMBOL}
literals = {'(', ')',',','\n','<','>','-',';','&','*','=','!'}
LBRACE = r'\{'
RBRACE = r'\}'
FUNC_DECL = r'[a-z]+[ \n\t]+[a-zA-Z_0-9]+[ \n\t]+[a-zA-Z_0-9]+[ \n\t]*\([a-zA-Z_\* \,\t\n]+\)[ ]*\;'
FUNC_DEF = r'[a-zA-Z_0-9]+[ \n\t]+[a-zA-Z_0-9]+[ \n\t]*\([a-zA-Z_\* \,\t\n]+\)'
PREPROP = r'#[a-zA-Z_][a-zA-Z0-9_\" .\<\>\/\(\)\-\+]*'
LEXEME = r'[a-zA-Z0-9]+'
SYMBOL = r'[-!$%^&*\(\)_+|~=`\[\]\:\"\;\'\<\>\?\,\.\/]'
def __init__(self):
self.nesting_level = 0
self.lineno = 0
@_(r'\n+')
def newline(self, t):
self.lineno += t.value.count('\n')
@_(r'[-!$%^&*\(\)_+|~=`\[\]\:\"\;\'\<\>\?\,\.\/]')
def symbol(self,t):
t.type = 'symbol'
return t
def error(self, t):
print("Illegal character '%s'" % t.value[0])
self.index += 1
class CParser(Parser):
# Get the token list from the lexer (required)
tokens = CLexer.tokens
@_('PREPROP')
def expr(self,p):
return p.PREPROP
@_('FUNC_DECL')
def expr(self,p):
return p.FUNC_DECL
@_('func')
def expr(self,p):
return p.func
# Grammar rules and actions
@_('FUNC_DEF LBRACE stmt RBRACE')
def func(self, p):
return p.func_def + p.lbrace + p.stmt + p.rbrace
@_('LEXEME stmt')
def stmt(self, p):
return p.LEXEME
@_('SYMBOL stmt')
def stmt(self, p):
return p.SYMBOL
@_('empty')
def stmt(self, p):
return p.empty
@_('')
def empty(self, p):
pass
with open('inputfile.c') as f:
data = "".join(f.readlines())
data = comment_remover(data)
lexer = CLexer()
parser = CParser()
while True:
try:
result = parser.parse(lexer.tokenize(data))
print(result)
except EOFError:
break
Error :
None
None
None
.
.
.
.
None
None
yacc: Syntax error at line 1, token=PREPROP
yacc: Syntax error at line 1, token=LBRACE
yacc: Syntax error at line 1, token=PREPROP
yacc: Syntax error at line 1, token=LBRACE
yacc: Syntax error at line 1, token=PREPROP
.
.
.
.
.
INPUT:
#include <mycustomheader1.h> //defines type T1
#include <somedir/mycustomheader2.h> //defines type T2
#include <someotherdir/somefile.c>
MACRO_THINGY_DEFINED_IN_SOMEFILE(M1,M2)
static T1 function_name_thats_way_too_long_than_usual(int *a, float* b, T2* c)
{
//some code I don't even care about at this point
}
extern T2 function_name_thats_way_too_long_than_usual(int *a, char* b, T1* c)
{
//some code I don't even care about at this point
}
DESIRED OUTPUT:
function1 :
static T1 function_name_thats_way_too_long_than_usual(int *a, float* b, T2* c)
{
//some code I don't even care about at this point
}
function2 :
extern T2 function_name_thats_way_too_long_than_usual(int *a, char* b, T1* c)
{
//some code I don't even care about at this point
}
pycparser has a func_defs example to do exactly what you need, but IIUC you're having issues with the preprocessing?
This post describes in some detail why pycparser needs preprocessed files, and how to set it up. If you control the build system it's actually pretty easy. Once you have preprocessed files, the example mentioned above should work.
I will also note that statically finding function dependencies is not an easy problem, because of function pointers. You also won't be able to do this accurately with a single file - this needs multi-file analysis.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With