Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wikipedia weird encoding answer for file_get_contents

<?php

ini_set('user_agent', 'Mozilla/5.0 (Windows NT x.y; Win64; x64; rv:10.0.1) Gecko/20100101 Firefox/10.0.1');

echo file_get_contents('http://fr.wikipedia.org/wiki/Brazil', false, $context);

//echo file_get_contents('http://fr.wikipedia.org/wiki/Argentina');

//echo file_get_contents('http://fr.wikipedia.org/wiki/France');

Wikipedia's answer is something like an encoding issue (I can't post it because StackOverflow post rules but you can see it if you run the script).

(etc.)

That's for Brazil and Argentina. But with other pages (like France) it works well. Any idea of whats happening? The pages works well with a browser, by the way.

like image 328
hhaamm Avatar asked Jan 24 '26 10:01

hhaamm


2 Answers

Finally, I found the problem. I was receiving a gzip compressed HTML. I solved the problem using a php function to uncompress the HTML when the string appears to be binary.

I though cURL would handle the encryption in a transparent way for the developer, but I had the same problem. I think it's probably a Wikipedia issue.

like image 184
hhaamm Avatar answered Jan 26 '26 01:01

hhaamm


If you're running this from console, make sure that it uses UTF-8 (should be so in Linux, not sure if it's possible at all in Windows).

If you're getting it from web, add header('Content-Type: text/html; charset=UTF-8'); to the beginning of your script to inform the browser of the correct encoding.

like image 34
MaxSem Avatar answered Jan 25 '26 23:01

MaxSem