Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove 'u from a webscrape output

Hi ' im using Beautifulsoup to parse a website and get a name as output. But after running the script, i get a [u'word1', u'word2', u'word3'] output. What i'm looking for is 'word1 word2 word3'. how do get rid of this u' and make the result a single string?

from bs4 import BeautifulSoup
import urllib2
import re

myfile = open("base/dogs.txt","w+")
myfile.close()

url="http://trackinfo.com/entries-race.jsp?raceid=GBR$20140302A01"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
names=soup.findAll('a',{'href':re.compile("dog")})
myfile = open("base/dogs.txt","w+")
for eachname in names:
    d = (str(eachname.string.split()))+"\n"
    print [x.encode('ascii') for x in d]
    myfile.write(d)

myfile.close()
like image 836
user3319895 Avatar asked Mar 01 '26 12:03

user3319895


2 Answers

BeautifulSoup and Unicode, Dammit!

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("Sacré bleu!")
<html><body><p>Sacré bleu!</p></body></html>

Isn't that great? When making the soup the document is converted to Unicode, and HTML entities are converted to Unicode characters! So you get Unicode objects as results. Like intended. Nothing wrong with that.

So your question is about Unicode. And Unicode is explained in this video. Don't like video's? Read an Introduction to Unicode.

The u is short for 'The following sting is Unicode encoded'. Instead of 128 ASCII characters you now can use all Unicode characters. More than 110.000 at this moment. The u isn't saved to a file or database. It is visual feedback so you can see that you're dealing with a Unicode encoded string. Use it like it's a normal string, because it is a normal string.

Moral of this story:

☺ when you see a u'…'

like image 76
allcaps Avatar answered Mar 03 '26 00:03

allcaps


The answers here using .encode() are giving you what you ask for, but probably not what you need. You can keep the unicode encoding and not represent things in a way that shows you what their encoding or type is. Thus, they'll still be [u'word1', u'word2', u'word3'] -- which avoids breaking support for languages that can't be represented in ASCII -- but printed as word1 word2 word3.

Just do:

for eachname in names:
    d = ' '.join(eachname.string.split()) + '\n'
    print d
    myfile.write(d)
like image 36
Charles Duffy Avatar answered Mar 03 '26 02:03

Charles Duffy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!