Encode module and inverted commas

Question

I am scraping a web page, and extracting a specific section from it. That section includes inverted commas (’, character 146). I'm trying to print my extracted data to a text file, but it's giving me â€™ instead of the inverted comma. I have tried the following:

$content =~ s/’/'/g;
my $invComma = chr 146; $content =~ s/$invComma/'/g;
$content =~ s/\x{0092}/'/g;

None of it has worked. I can't decode('UTF-8', $content) because it has wide characters. When I try to encode('UTF-8', $content) the â€™ changes to Ã¢Â€Â™ instead. I have already tried use utf8 as well, to no effect.

I know that my text file viewer can display inverted commas, because I printed one to a test file and opened it. The problem is therefore in my script.

What am I doing wrong, and how do I fix it?

UPDATE: I am able to do $content =~ s/â€™/'/g to replace it with a simple apostrophe, but I still don't know why nothing else works. I'd also like a fix that actually solves the problem, instead of just solving one of the symptoms.

UPDATE 2: I have been informed by hobbs that the character is actually U+2019 RIGHT SINGLE QUOTATION MARK and changed my regex to use chr 0x2019 which now works.

hobbs · Accepted Answer

The character you're trying to replace is only 0x92 / 146 in the Windows-1252 encoding. Perl uses Unicode, where that character is U+2019 RIGHT SINGLE QUOTATION MARK, aka "\x{2019}", chr(0x2019), or chr(8217).

Encode module and inverted commas

Tags:

encoding

utf-8

perl

Lilith

1 Answers

hobbs

Recent Activity

Donate For Us

Encode module and inverted commas

Tags:

encoding

utf-8

perl

Lilith

1 Answers

hobbs

Related questions

Recent Activity

Donate For Us