Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sed regular expression works different through web browser

Basic issue

Encoding

As this issue might look like it is related to encoding, encoding of everything - text file, bash script file, terminal, web page serving PHP script, PHP script itself - is UTF-8.

Script

I do have quite long bash script which performs series of operations on text file. For the purpose of this issue, only one sed command matters:

#!/bin/bash   
sed -r 's: ([”]):\1:g' -i $1

What it is supposed to do is to remove space before closing smart quote. Squre brackets and parentheses are there as I was using longer regular expression with more characters and wanted to capture it for replacement.

Sample text file to recreate issue:

Lorem ipsum “dolor sit amet,” consectetur adipisicing elit. Numquam eos quos veniam iste.

Command line and web browser

I am using this bash script in 2 ways:

1) I am executing it from command line on Ubuntu 13.10 by typing ./script.sh text-file

2) I am executing it through web browser (Apache+PHP) by using following code to process web form and execute script:

<?php

$file = "text-file";

move_uploaded_file($_FILES["file"]["tmp_name"], $file); 
shell_exec("./script.sh $file > /dev/null");
rename("$file", "output.html");
header('Content-Disposition: attachment; filename=output.html');
readfile('output.html');

Problem is this - script gives another result when executed from command line (1) and other result when executed through web browser (2).

When executed from command line (1), it changes nothing (since there is nothing to change in this case), so result is same as input (which is output I want to achieve in this case):

Lorem ipsum “dolor sit amet,” consectetur adipisicing elit. Numquam eos quos veniam iste.

But when it is executed by PHP (2), it removes space before opening smart quote (which, according to regular expression used, should not happen):

Lorem ipsum“dolor sit amet,” consectetur adipisicing elit. Numquam eos quos veniam iste.

After many tests, I have figured out that instead of using:

#!/bin/bash   
sed -r 's: ([”]):\1:g' -i $1

I should use:

#!/bin/bash
sed -r 's: ”:”:g' -i $1

which works as expected both from command line and with the use of PHP.

However, even though I solved my problem and now it works how I want it to, I still do not know why PHP modified way my script worked.

Question

So question is - why is PHP modifying how my script (sed) works? Am I doing something wrong? It seems that capture group is part of the problem, but it is outside my understanding why this is not a case when script is simply executed from command line.


Discoveries

While I was trying to understand what is causing a problem, I have discovered few more interesting and surprising things about capture groups in sed and perl one-liners.

All examples below was used inside bash script.

#!/bin/bash
example code

Starting point was:

sed -r 's: ([”]):\1:g' -i $1

which (as described above) worked as expected with command line (1), but malfunctioned (removed space) when used with PHP (2).

I have used same regular expresion with perl one-liner to see if issue was sed-specific or was it broader (that is - something related to regexp or PHP):

perl -i -pe 's| ([”])|\1|smg' $1

What I found out was that it works bad (removes space) both from command line (1) and PHP (2).

After that, I have tried to remove capture group and leave only square brackets in sed expression:

sed -r 's: [”]:”:g' -i $1

which works fine from command line (1), but creates some gibberish in text with PHP (2). When same regexp was tested with perl:

perl -i -pe 's| [”]|”|smg' $1

it resulted in gibberish in output both with command line (1) and PHP (2).

So it seems that general issue (removing space before opening smart quote) is caused by combination of capture group (parentheses) and square brackets. Issue is present with both perl one-liner (both from command line and PHP) and sed (only with PHP).

Even though I know how to get rid of the issue (by removing capturing parentheses and brackets) I am still curious to know why it works in the weird way it does and what is actually causing an issue (PHP or Apache or combination of PHP/Apache and bash script).

like image 683
Rafal Avatar asked Nov 11 '22 09:11

Rafal


1 Answers

For perl at least, without utf8 enabled in the script source, it sees as several separate ASCII characters, and ends up splitting the smart quote in pieces. What you have used could be written as:

s/ [\xe2\x80\x9d]/\xe2\x80\x9d/g

Which will match some pieces of (\xe2\x80\xe2), replacing them with a closing quote, and leaving behind some unprintable garbage.

In perl this is solved by adding use utf8 at the top of your script. With the sed example I would expect that the LANG environment variable is different between apache and your shell, which would have a similar effect. This could be fixed by setting LANG explicitly for that command:

LANG="en_US.UTF-8" sed -r 's: [”]:\1:g' -i $1
like image 113
AKHolland Avatar answered Nov 15 '22 07:11

AKHolland