Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tokenize Perl source code?

Tags:

perl

tokenize

I have some reasonable (not obfuscated) Perl source files, and I need a tokenizer, which will split it to tokens, and return the token type of each of them, e.g. for the script

print "Hello, World!\n";

it would return something like this:

  • keyword 5 bytes
  • whitespace 1 byte
  • double-quoted-string 17 bytes
  • semicolon 1 byte
  • whitespace 1 byte

Which is the best library (preferably written in Perl) for this? It has to be reasonably correct, i.e. it should be able to parse syntactic constructs like qq{{\}}}, but it doesn't have to know about special parsers like Lingua::Romana::Perligata. I know that parsing Perl is Turing-complete, and only Perl itself can do it right, but I don't need absolute correctness: the tokenizer can fail or be incompatible or assume some default in some very rare corner cases, but it should work correctly most of the time. It must be better than the syntax highlighting built into an average text editor.

FYI I tried the PerlLexer in pygments, which works reasonable for most constructs, except that it cannot find the 2nd print keyword in this one:

print length(<<"END"); print "\n";
String
END
like image 702
pts Avatar asked Aug 19 '10 09:08

pts


People also ask

How do you tokenize source code?

You can tokenize source code using a lexical analyzer (or lexer, for short) like flex (under C) or JLex (under Java). The easiest way to get grammars to tokenize Java, C, and C++ may be to use (subject to licensing terms) the code from an open source compiler using your favorite lexer.

How do you tokenize?

Steps to tokenize assets: Selecting an asset to tokenize, creating a tokenomics model, choosing a blockchain platform for asset tokenization, developing smart contracts, crypto wallet integration, token launch for trading on primary and secondary markets.

What is Subword tokenization?

Tokenization in simple words is the process of splitting a phrase, sentence, paragraph, one or multiple text documents into smaller units. 🔪 Each of these smaller units is called a token. Now, these tokens can be anything — a word, a subword, or even a character.

What does it mean to tokenize input?

Tokenization is the process of replacing sensitive data with unique identification symbols that retain all the essential information about the data without compromising its security.


2 Answers

PPI

like image 119
daxim Avatar answered Oct 21 '22 04:10

daxim


use PPI;

Yes, only perl can parse Perl, however PPI is the 95% correct solution.

like image 32
szbalint Avatar answered Oct 21 '22 04:10

szbalint