Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding all string in a PHP code base

Tags:

string

php

I have a few-million-line PHP code base without true separation of display and logic, and I am trying to extract all of the strings that are represented in the code for the purposes of localization. Separation of display and logic is a long term goal, but for now I just want to be able to localize.

In the code, strings are represented in every possible format for PHP, so I need a theoretical (or practical) way to parse our entire source and at the very least LOCATE where each string lives. Ideally, of course, I'd replace every string with a function call, for example

"this is a string"

would be replaced with

_("this is a string")

Of course I'd need to support both single and double quote format. The others I'm not too concerned about, they appear so infrequently that I can manually change them.

Also, I wouldn't want to localize array indexes of course. So strings like

$arr["value"]

should not become

$arr[_("value")]

Can anyone help me get started in this?

like image 607
Ray Avatar asked Dec 13 '22 05:12

Ray


2 Answers

You could use token_get_all() to get all the tokens from a PHP file e.g.

<?php

$fileStr = file_get_contents('file.php');

foreach (token_get_all($fileStr) as $token) {
    if ($token[0] == T_CONSTANT_ENCAPSED_STRING) {
        echo "found string {$token[1]}\r\n";
        //$token[2] is line number of the string
    }
}

You could do a really dirty check that it isn't being used as an array index by something like:

$fileLines = file('file.php');

//inside the loop and if
$line = $fileLines[$token[2] - 1];
if (false === strpos($line, "[{$token[1]}]")) {
    //not an array index
}

but you will really struggle to do this properly because someone might have written something you might not be expecting e.g.:

$str = 'string that is not immediately an array index';
doSomething($array[$str]);

Edit As Ant P says, you would probably be better off looking for [ and ] in the surrounding tokens for the second part of this answer rather than my strpos hack, something like this:

$i = 0;
$tokens = token_get_all(file_get_contents('file.php'));
$num = count($tokens);
for ($i = 0; $i < $num; $i++) {
    $token = $tokens[$i];

    if ($token[0] != T_CONSTANT_ENCAPSED_STRING) {
        //not a string, ignore
        continue;
    }

    if ($tokens[$i - 1] == '[' && $tokens[$i + 1] == ']') {
        //immediately used as an array index, ignore
        continue; 
    }

    echo "found string {$token[1]}\r\n";
    //$token[2] is line number of the string
}
like image 198
Tom Haigh Avatar answered Dec 23 '22 11:12

Tom Haigh


There are some other situations that are likely to exist in the code base that you will utterly break by doing an automatic search and replace in addition to associative arrays.

SQL queries:

$myname = "steve";
$sql = "SELECT foo FROM bar WHERE name = " . $myname;

Indirect variable reference.

$bar = "Hello, World"; // a string that needs localization
$foo = "bar"; // a string that should not be localized
echo($$foo);

SQL string manipulation.

$sql = "SELECT CONCAT('Greetings, ', firstname) as greeting from users where id = ?";

There is no automatic way to filter for all possibilities. Perhaps the solution would be to write an application that creates a "moderation" queue of possible strings and displays each one highlighted and in context of several lines of code. You could then glance at the code to determine if it is a string that needs localization or not and hit a single key to localize or ignore the string.

like image 21
postfuturist Avatar answered Dec 23 '22 12:12

postfuturist