How to tokenize Perl source code?

Tags:

tokenize

I have some reasonable (not obfuscated) Perl source files, and I need a tokenizer, which will split it to tokens, and return the token type of each of them, e.g. for the script

print "Hello, World!\n";

it would return something like this:

keyword 5 bytes
whitespace 1 byte
double-quoted-string 17 bytes
semicolon 1 byte
whitespace 1 byte

Which is the best library (preferably written in Perl) for this? It has to be reasonably correct, i.e. it should be able to parse syntactic constructs like qq{{\}}}, but it doesn't have to know about special parsers like Lingua::Romana::Perligata. I know that parsing Perl is Turing-complete, and only Perl itself can do it right, but I don't need absolute correctness: the tokenizer can fail or be incompatible or assume some default in some very rare corner cases, but it should work correctly most of the time. It must be better than the syntax highlighting built into an average text editor.

FYI I tried the PerlLexer in pygments, which works reasonable for most constructs, except that it cannot find the 2nd print keyword in this one:

print length(<<"END"); print "\n";
String
END

702

asked Aug 19 '10 09:08

pts

2 Answers

PPI

119

answered Oct 21 '22 04:10

daxim

use PPI;

Yes, only perl can parse Perl, however PPI is the 95% correct solution.

answered Oct 21 '22 04:10

szbalint

Related questions
                            
                                Warning - "Odd number of elements in hash assignment" in perl
                            
                                How can I find elements that are in one array but not another in Perl?
                            
                                In Perl, what is the difference between @array[1] and $array[1]?
                            
                                What's the difference between various $SIG{CHLD} values?
                            
                                How can I change the case of a hash key?
                            
                                How can I determine if a script was called from the command line or as a cgi script?
                            
                                Store and read hash and array in files in Perl
                            
                                Declaring an array with incremental values in Perl
                            
                                Are there CPAN modules that will help me parse an RSS or ATOM feed in Perl?
                            
                                What are the reasons to use dos batch programs in Windows?
                            
                                How can I call a module in a Perl one-liner?
                            
                                How to handle filenames with spaces?
                            
                                Java equivalent of Perl's hash
                            
                                Perl syntax sigil
                            
                                append a text on the top of the file
                            
                                How do I escape single quotes with perl interpreter?
                            
                                Experimental values on scalar is now forbidden - perl
                            
                                In Perl, is there graceful way to convert undef to 0 manually?
                            
                                How do I start a new Perl module distribution?
                            
                                Why is the exit code 255 instead of -1 in Perl?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With