Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encode module and inverted commas

I am scraping a web page, and extracting a specific section from it. That section includes inverted commas (, character 146). I'm trying to print my extracted data to a text file, but it's giving me ’ instead of the inverted comma. I have tried the following:

  • $content =~ s/’/'/g;
  • my $invComma = chr 146; $content =~ s/$invComma/'/g;
  • $content =~ s/\x{0092}/'/g;

None of it has worked. I can't decode('UTF-8', $content) because it has wide characters. When I try to encode('UTF-8', $content) the ’ changes to ’ instead. I have already tried use utf8 as well, to no effect.

I know that my text file viewer can display inverted commas, because I printed one to a test file and opened it. The problem is therefore in my script.

What am I doing wrong, and how do I fix it?

UPDATE: I am able to do $content =~ s/’/'/g to replace it with a simple apostrophe, but I still don't know why nothing else works. I'd also like a fix that actually solves the problem, instead of just solving one of the symptoms.

UPDATE 2: I have been informed by hobbs that the character is actually U+2019 RIGHT SINGLE QUOTATION MARK and changed my regex to use chr 0x2019 which now works.

like image 433
Lilith Avatar asked Jun 27 '26 04:06

Lilith


1 Answers

The character you're trying to replace is only 0x92 / 146 in the Windows-1252 encoding. Perl uses Unicode, where that character is U+2019 RIGHT SINGLE QUOTATION MARK, aka "\x{2019}", chr(0x2019), or chr(8217).

like image 84
hobbs Avatar answered Jun 29 '26 05:06

hobbs



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!