Unable to retrieve Chinese texts while scraping

Question

I have created a script that scrape website: 1688.com and the problem is, the site is in Chinese so whenever i try to retrieve the text, it gives me a bunch of unicode and when i export to a CSV file, there's nothing in the file. My code:

# -*- coding: utf-8 -*-
import csv
from urllib import urlopen
from bs4 import BeautifulSoup as BS

csv_content = open('content.csv', 'w+')
writer_content = csv.writer(csv_content)

url = urlopen('https://fuzhuang.1688.com/nvzhuang?
spm=a260k.635.1998214976.1.7eqUGT')
html = BS(url, 'lxml')
container = html.find('ul', {'class' : 'ch-box fd-clr'})
offers = container.find_all('div', {'class' : 'ch-offer-body'})
lst = []

for offer in offers:
    offer_box = offer.find('div', {'component-name' : '@alife/ocms-
component-1688-pc-ch-offer-pic'})
    images = offer_box.find('img')['src']
    title = offer.find('div', {'class' : 'ocms-component-1688-pc-ch-offer-
title-0-1-11'}).text
    price = offer.find('div', {'class' : 'ocms-component-1688-pc-ch-offer-
price-0-1-14'}).text
    lst.append(price)

for item in lst: writer_content.writerow([item])

print lst

The output is

[u'


\xa5
109.00


\u6210\u4ea4
329
\u4ef6


', u'


\xa5
56.00


\u6210\u4ea4
195
\u4ef6


', u'


\xa5
83.00


\u6210\u4ea4
109
\u4ef6


', u'


\xa5
69.00


\u6210\u4ea4
208
\u4ef6


', u'


\xa5
46.00


\u6210\u4ea4
204
\u4ef6


', u'


\xa5
45.00


\u6210\u4ea4
54
\u4ef6


', u'


\xa5
82.00


\u6210\u4ea4
38
\u4ef6


', u'


\xa5
48.90


\u6210\u4ea4
318
\u4ef6


']

And i have already tried encoding and decoding utf-8, i would really appreciate it if you show me how to solve this problem.

Dmitriy Fialkovskiy · Accepted Answer

this code will save chinese symbols to txt:

for Python3:

         ...
(all your code above)
for i in range(len(lst)):    
    lst[i]=lst[i].replace('
','') #getting rig of `'
'` newlines

writing to txt:

with open(r'C:\Users\Username\list.txt','w',newline='',encoding='utf-8-sig') as f:
    for i in lst:
        f.write(i+'	')

for Python2:

import unicodecsv as ucsv
with open(r'C:\Users\Username\list1.txt','wb') as f:
    w = ucsv.writer(f,encoding='utf-8-sig')
    for i in lst:
        w.writerow([i+'	'])

Unable to retrieve Chinese texts while scraping

Tags:

python

beautifulsoup

web-scraping

V.Anh

1 Answers

Dmitriy Fialkovskiy

Recent Activity

Donate For Us

Unable to retrieve Chinese texts while scraping

Tags:

python

beautifulsoup

web-scraping

V.Anh

1 Answers

Dmitriy Fialkovskiy

Related questions

Recent Activity

Donate For Us