Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What characters are allowed in Perl identifiers?

I'm working on regular expressions homework where one question is:

Using language reference manuals online determine the regular expressions for integer numeric constants and identifiers for Java, Python, Perl, and C.

I don't need help on the regular expression, I just have no idea what identifiers look like in Perl. I found pages describing valid identifiers for C, Python and Java, but I can't find anything about Perl.

EDIT: To clarify, finding the documentation was meant to be easy (like doing a Google search for python identifiers). I'm not taking a class in "doing Google searches".

like image 523
Brendan Long Avatar asked Jan 26 '11 00:01

Brendan Long


People also ask

What characters are allowed in an identifier?

Identifiers must start with a letter or underscore ( _ ). Identifiers may contain Unicode letter characters, decimal digit characters, Unicode connecting characters, Unicode combining characters, or Unicode formatting characters. For more information on Unicode categories, see the Unicode Category Database.

What are identifiers in Perl?

A Perl identifier is a name used to identify a variable, function, class, module, or other objects. A Perl variable name starts with either $, @ or % followed by zero or more letters, underscores, and digits (0 to 9). Perl does not allow punctuation characters such as @, $, and % within identifiers.

Can identifier start with special characters?

The first character of an identifier should always start with an alphabet or underscore, and then it can be followed by any of the characters, digit, or underscore. The special characters such as '*','#','@','$' are not allowed within an identifier.

Which is applicable to identifier?

A valid identifier must have characters [A-Z] or [a-z] or numbers [0-9], and underscore(_) or a dollar sign ($). for example, @javatpoint is not a valid identifier because it contains a special character which is @. There should not be any space in an identifier. For example, java tpoint is an invalid identifier.


1 Answers

Perl Integer Constants

Integer constants in Perl can be

  • in base 16 if they start with ^0x
  • in base 2 if they start with ^0b
  • in base 8 if they start with 0
  • otherwise they are in base 10.

Following that leader is any number of valid digits in that base and also optional underscores.

Note that digit does not mean \p{POSIX_Digit}; it means \p{Decimal_Number}, which is really quite different, you know.

Please note that any leading minus sign is not part of the integer constant, which is easily proven by:

$ perl -MO=Concise,-exec -le '$x = -3**$y'
1  <0> enter 
2  <;> nextstate(main 1 -e:1) v:{
3  <$> const(IV 3) s
4  <$> gvsv(*y) s
5  <2> pow[t1] sK/2
6  <1> negate[t2] sK/1
7  <$> gvsv(*x) s
8  <2> sassign vKS/2
9  <@> leave[1 ref] vKP/REFC
-e syntax OK

See the 3 const, and much later on the negate op-code? That tells you a bunch, including a curiosity of precedence.

Perl Identifiers

Identifiers specified via symbolic dereferencing have absolutely no restriction whatsoever on their names.

  • For example, 100->(200) calls the function named 100 with the arugments (100, 200).
  • For another, ${"What’s up, doc?"} refers to the scalar package variable by that name in the current package.
  • On the other hand, ${"What's up, doc?"} refers to the scalar package variable whose name is ${"s up, doc?"} and which is not in the current package, but rather in the What package. Well, unless the current package is the What package, of course. Similary $Who's is the $s variable in the Who package.

One can also have identifiers of the form ${^identifier}; these are not considered symbolic dereferences into the symbol table.

Identifiers with a single character alone can be a punctuation character, include $$ or %!.

Identifers can also be of the form $^C, which is either a control character or a circumflex folllowed by a non-control character.

If none of those things is true, a (non–fully qualified) identifier follows the Unicode rules related to characters with the properties ID_Start followed by those with the property ID_Continue. However, it overrules this in allowing all-digit identifiers and identifiers that start with (and perhaps have nothing else beyond) an underscore. You can generally pretend (but it’s really only pretending) that that is like saying \w+, where \w is as described in Annex C of UTS#18. That is, anything that has any of these:

  • the Alphabetic property — which includes far more than just Letters; it also contains various combining characters and the Letter_Number code points, plus the circled letters
  • the Decimal_Number property, which is rather more than merely [0-9]
  • Any and all characters with the Mark property, not just those marks that are deemed Other_Alphabetic
  • Any characters with the Connector_Puncutation property, of which underscore is just one such.

So either ^\d+$ or else

^[\p{Alphabetic}\p{Decimal_Number}\p{Mark}\p{Connector_Punctuation}]+$

ought to do it for the really simple ones if you don’t care to explore the intricacies of the Unicode ID_Start and ID_Continue properties. That’s how it’s really done, but I bet your instructor doesn’t know that. Perhaps one shan’t tell him, eh?

But you should cover the nonsimple ones I describe earlier.

And we haven’t talked about packages yet.

Perl Packages in Identifiers

Beyond those simple rules, you must also consider that identifiers may be qualified with a package name, and package names themselves follow the rules of identifiers.

The package separator is either :: or ' at your whim.

You do not have to specify a package if it is the first component in a fully qualified identifier, in which case it means the package main. That means things like $::foo and $'foo are equivalent to $main::foo, and isn't_it() is equivalent to isn::t_it(). (Typo removed)

Finally, as a special case, a trailing double-colon (but not a single-quote) at the end of a hash is permitted, and this then refers to the symbol table of that name.

Thus %main:: is the main symbol table, and because you can omit main, so too is %::.

Meanwhile %foo:: is the foo symbol table, as is %main::foo:: and also %::foo:: just for perversity’s sake.

Summary

It’s nice to see instructors giving people non-trivial assignments. The question is whether the instructor realized it was non-trivial. Probably not.

And it’s hardly just Perl, either. Regarding the Java identifiers, did you figure out yet that the textbooks lie? Here’s the demo:

$ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_\033 = "i am escape: ^\033"; System.out.println(var_\033); }})' > escape.java
$ javac escape.java
$ java escape | cat -v
i am escape: ^[

Yes, it’s true. It is also true for many other code points, especially if you use -encoding UTF-8 on the compile line. Your job is to find the pattern that describes these startlingly unforbidden Java identifiers. Hint: make sure to include code point U+0000.

There, aren’t you glad you asked? Hope this helps. Or something. ☺

like image 108
tchrist Avatar answered Oct 12 '22 15:10

tchrist