I'm working on regular expressions homework where one question is:
Using language reference manuals online determine the regular expressions for integer numeric constants and identifiers for Java, Python, Perl, and C.
I don't need help on the regular expression, I just have no idea what identifiers look like in Perl. I found pages describing valid identifiers for C, Python and Java, but I can't find anything about Perl.
EDIT: To clarify, finding the documentation was meant to be easy (like doing a Google search for python identifiers). I'm not taking a class in "doing Google searches".
Identifiers must start with a letter or underscore ( _ ). Identifiers may contain Unicode letter characters, decimal digit characters, Unicode connecting characters, Unicode combining characters, or Unicode formatting characters. For more information on Unicode categories, see the Unicode Category Database.
A Perl identifier is a name used to identify a variable, function, class, module, or other objects. A Perl variable name starts with either $, @ or % followed by zero or more letters, underscores, and digits (0 to 9). Perl does not allow punctuation characters such as @, $, and % within identifiers.
The first character of an identifier should always start with an alphabet or underscore, and then it can be followed by any of the characters, digit, or underscore. The special characters such as '*','#','@','$' are not allowed within an identifier.
A valid identifier must have characters [A-Z] or [a-z] or numbers [0-9], and underscore(_) or a dollar sign ($). for example, @javatpoint is not a valid identifier because it contains a special character which is @. There should not be any space in an identifier. For example, java tpoint is an invalid identifier.
Integer constants in Perl can be
^0x
^0b
0
Following that leader is any number of valid digits in that base and also optional underscores.
Note that digit does not mean \p{POSIX_Digit}
; it means \p{Decimal_Number}
, which is really quite different, you know.
Please note that any leading minus sign is not part of the integer constant, which is easily proven by:
$ perl -MO=Concise,-exec -le '$x = -3**$y'
1 <0> enter
2 <;> nextstate(main 1 -e:1) v:{
3 <$> const(IV 3) s
4 <$> gvsv(*y) s
5 <2> pow[t1] sK/2
6 <1> negate[t2] sK/1
7 <$> gvsv(*x) s
8 <2> sassign vKS/2
9 <@> leave[1 ref] vKP/REFC
-e syntax OK
See the 3 const
, and much later on the negate
op-code? That tells you a bunch, including a curiosity of precedence.
Identifiers specified via symbolic dereferencing have absolutely no restriction whatsoever on their names.
100->(200)
calls the function named 100
with the arugments (100, 200)
.${"What’s up, doc?"}
refers to the scalar package variable by that name in the current package.${"What's up, doc?"}
refers to the scalar package variable whose name is ${"s up, doc?"}
and which is not in the current package, but rather in the What
package. Well, unless the current package is the What
package, of course. Similary $Who's
is the $s
variable in the Who
package.One can also have identifiers of the form ${^
identifier}
; these are not considered symbolic dereferences into the symbol table.
Identifiers with a single character alone can be a punctuation character, include $$
or %!
.
Identifers can also be of the form $^C
, which is either a control character or a circumflex folllowed by a non-control character.
If none of those things is true, a (non–fully qualified) identifier follows the Unicode rules related to characters with the properties ID_Start
followed by those with the property ID_Continue
. However, it overrules this in allowing all-digit identifiers and identifiers that start with (and perhaps have nothing else beyond) an underscore. You can generally pretend (but it’s really only pretending) that that is like saying \w+
, where \w
is as described in Annex C of UTS#18. That is, anything that has any of these:
[0-9]
So either ^\d+$
or else
^[\p{Alphabetic}\p{Decimal_Number}\p{Mark}\p{Connector_Punctuation}]+$
ought to do it for the really simple ones if you don’t care to explore the intricacies of the Unicode ID_Start and ID_Continue properties. That’s how it’s really done, but I bet your instructor doesn’t know that. Perhaps one shan’t tell him, eh?
But you should cover the nonsimple ones I describe earlier.
And we haven’t talked about packages yet.
Beyond those simple rules, you must also consider that identifiers may be qualified with a package name, and package names themselves follow the rules of identifiers.
The package separator is either ::
or '
at your whim.
You do not have to specify a package if it is the first component in a fully qualified identifier, in which case it means the package main
. That means things like $::foo
and $'foo
are equivalent to $main::foo
, and isn't_it()
is equivalent to isn::t_it()
. (Typo removed)
Finally, as a special case, a trailing double-colon (but not a single-quote) at the end of a hash is permitted, and this then refers to the symbol table of that name.
Thus %main::
is the main
symbol table, and because you can omit main, so too is %::
.
Meanwhile %foo::
is the foo
symbol table, as is %main::foo::
and also %::foo::
just for perversity’s sake.
It’s nice to see instructors giving people non-trivial assignments. The question is whether the instructor realized it was non-trivial. Probably not.
And it’s hardly just Perl, either. Regarding the Java identifiers, did you figure out yet that the textbooks lie? Here’s the demo:
$ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_\033 = "i am escape: ^\033"; System.out.println(var_\033); }})' > escape.java
$ javac escape.java
$ java escape | cat -v
i am escape: ^[
Yes, it’s true. It is also true for many other code points, especially if you use -encoding UTF-8
on the compile line. Your job is to find the pattern that describes these startlingly unforbidden Java identifiers. Hint: make sure to include code point U+0000.
There, aren’t you glad you asked? Hope this helps. Or something. ☺
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With