I have a few-million-line PHP code base without true separation of display and logic, and I am trying to extract all of the strings that are represented in the code for the purposes of localization. Separation of display and logic is a long term goal, but for now I just want to be able to localize.
In the code, strings are represented in every possible format for PHP, so I need a theoretical (or practical) way to parse our entire source and at the very least LOCATE where each string lives. Ideally, of course, I'd replace every string with a function call, for example
"this is a string"
would be replaced with
_("this is a string")
Of course I'd need to support both single and double quote format. The others I'm not too concerned about, they appear so infrequently that I can manually change them.
Also, I wouldn't want to localize array indexes of course. So strings like
$arr["value"]
should not become
$arr[_("value")]
Can anyone help me get started in this?
You could use token_get_all()
to get all the tokens from a PHP file
e.g.
<?php
$fileStr = file_get_contents('file.php');
foreach (token_get_all($fileStr) as $token) {
if ($token[0] == T_CONSTANT_ENCAPSED_STRING) {
echo "found string {$token[1]}\r\n";
//$token[2] is line number of the string
}
}
You could do a really dirty check that it isn't being used as an array index by something like:
$fileLines = file('file.php');
//inside the loop and if
$line = $fileLines[$token[2] - 1];
if (false === strpos($line, "[{$token[1]}]")) {
//not an array index
}
but you will really struggle to do this properly because someone might have written something you might not be expecting e.g.:
$str = 'string that is not immediately an array index';
doSomething($array[$str]);
Edit
As Ant P says, you would probably be better off looking for [
and ]
in the surrounding tokens for the second part of this answer rather than my strpos
hack, something like this:
$i = 0;
$tokens = token_get_all(file_get_contents('file.php'));
$num = count($tokens);
for ($i = 0; $i < $num; $i++) {
$token = $tokens[$i];
if ($token[0] != T_CONSTANT_ENCAPSED_STRING) {
//not a string, ignore
continue;
}
if ($tokens[$i - 1] == '[' && $tokens[$i + 1] == ']') {
//immediately used as an array index, ignore
continue;
}
echo "found string {$token[1]}\r\n";
//$token[2] is line number of the string
}
There are some other situations that are likely to exist in the code base that you will utterly break by doing an automatic search and replace in addition to associative arrays.
SQL queries:
$myname = "steve";
$sql = "SELECT foo FROM bar WHERE name = " . $myname;
Indirect variable reference.
$bar = "Hello, World"; // a string that needs localization
$foo = "bar"; // a string that should not be localized
echo($$foo);
SQL string manipulation.
$sql = "SELECT CONCAT('Greetings, ', firstname) as greeting from users where id = ?";
There is no automatic way to filter for all possibilities. Perhaps the solution would be to write an application that creates a "moderation" queue of possible strings and displays each one highlighted and in context of several lines of code. You could then glance at the code to determine if it is a string that needs localization or not and hit a single key to localize or ignore the string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With