Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apostrophes are printing out as â\x80\x99

import requests
from bs4 import BeautifulSoup
import re

source_url = requests.get('http://www.nytimes.com/pages/business/index.html')
div_classes = {'class' :['ledeStory' , 'story']}
title_tags = ['h2','h3','h4','h5','h6']

source_text = source_url.text
soup = BeautifulSoup(source_text, 'html.parser')


stories = soup.find_all("div", div_classes)

h = []; h2 = []; h3 = []; h4 =[]

for x in range(len(stories)):

    for x2 in range(len(title_tags)):
        hold = []; hold2 = []
        hold = stories[x].find(title_tags[x2])

        if hold is not None:
            hold2 = hold.find('a')

            if hold2 is not None:
                hh = (((hold.text.strip('a'))).strip())
                h.append(hh)
                #h.append(re.sub(r'[^\x00-\x7f]',r'', ((hold.text.strip('a'))).strip()))
                #h2.append(hold2.get('href'))

    hold = []
    hold = stories[x].find('p')

    if hold is not None:
        h3.append(re.sub(r'[^\x00-\x7f]',r'',((hold.text.strip('p')).strip())))

    else:
        h3.append('None')


h4.append(h)
h4.append(h2)
h4.append(h3)
print(h4)

Hey everyone. I have been wanting to scrape some data, I almost completed my scraper when I noticed the printed output was replacing (') with (â\x80\x99). For example the title containing "China's" was coming out "Chinaâ\x80\x99s". I did some research and tried to use decode/encode (utf-8) with no avail. It would just tell me that you can not run decode on a str(). I tried using re.sub() which would let me delete (â\x80\x99) but would not let me replace it with a (') Since I want to use natural language processing to interpret the data a fear that not having apostrophes is greatly going to change the meaning. Help would be greatly appreciated, I feel like I have hit a block with this one.

like image 526
muraaby Avatar asked Oct 17 '22 08:10

muraaby


1 Answers

In ISO 8859-1 and related code sets (there are many of them), â has code point 0xE2. When you interpret the three bytes 0xE2, 0x80, 0x99 as a UTF-8 encoding, the character is U+2019, RIGHT SINGLE QUOTATION MARK (which is ’ or , as distinct from ' or ' — you may or may not be able to spot the difference).

I see a few possibilities for the source of your difficulties, any one or more of which could be the source of your trouble:

  1. Your terminal is not set up to interpret UTF-8.
  2. Your source code should use ' (U+0027, APOSTROPHE).
  3. You're using Python 2.x rather than Python 3.x and it is having issues because of the use of Unicode (UTF-8). Against this (as Cory Madden pointed out), the code ends with print(h4) which is Python 3, so it probably isn't the issue.

It may be simplest to change the quotation mark into an ASCII apostrophe.

On the other hand, if you are analyzing HTML from elsewhere, you may have to consider how your script is going to handle UTF-8. Using quote marks from the Unicode U+20xx range is a very common choice; maybe your scraper needs to handle it?

like image 62
Jonathan Leffler Avatar answered Oct 29 '22 21:10

Jonathan Leffler