I am scraping a web page, and extracting a specific section from it. That section includes inverted commas (’, character 146). I'm trying to print my extracted data to a text file, but it's giving me ’ instead of the inverted comma. I have tried the following:
$content =~ s/’/'/g;my $invComma = chr 146;
$content =~ s/$invComma/'/g;$content =~ s/\x{0092}/'/g;None of it has worked. I can't decode('UTF-8', $content) because it has wide characters. When I try to encode('UTF-8', $content) the ’ changes to ’ instead. I have already tried use utf8 as well, to no effect.
I know that my text file viewer can display inverted commas, because I printed one to a test file and opened it. The problem is therefore in my script.
What am I doing wrong, and how do I fix it?
UPDATE: I am able to do $content =~ s/’/'/g to replace it with a simple apostrophe, but I still don't know why nothing else works. I'd also like a fix that actually solves the problem, instead of just solving one of the symptoms.
UPDATE 2: I have been informed by hobbs that the character is actually U+2019 RIGHT SINGLE QUOTATION MARK and changed my regex to use chr 0x2019 which now works.
The character you're trying to replace is only 0x92 / 146 in the Windows-1252 encoding. Perl uses Unicode, where that character is U+2019 RIGHT SINGLE QUOTATION MARK, aka "\x{2019}", chr(0x2019), or chr(8217).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With