New to python....Trying to get the parser to decode properly into a sqlite database but it just won't work :(
# coding: utf8
from pysqlite2 import dbapi2 as sqlite3
import urllib2
from bs4 import BeautifulSoup
from string import *
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()
# # create a table
def createTable():
cursor.execute("""CREATE TABLE characters
(rank INTEGER PRIMARY KEY, word TEXT, definition TEXT)
""")
def insertChar(rank,word,definition):
cursor.execute("""INSERT INTO characters (rank,word,definition)
VALUES (?,?,?)""",(rank,word,definition))
def main():
createTable()
# u = unicode("辣", "utf-8")
# insertChar(1,u,"123123123")
soup = BeautifulSoup(urllib2.urlopen('http://www.zein.se/patrick/3000char.html').read())
# print (html_doc.prettify())
tables = soup.blockquote.table
# print tables
rows = tables.find_all('tr')
result=[]
for tr in rows:
cols = tr.find_all('td')
character = []
x = cols[0].string
y = cols[1].string
z = cols[2].string
xx = unicode(x, "utf-8")
yy = unicode(y , "utf-8")
zz = unicode(z , "utf-8")
insertChar(xx,yy,zz)
conn.commit()
main()
I keep getting the follow error:
TypeError: decoding Unicode is not supported
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Traceback (most recent call last):
File "sqlitetestbed.py", line 64, in <module>
main()
File "sqlitetestbed.py", line 48, in main
xx = unicode(x, "utf-8")
Traceback (most recent call last):
File "sqlitetestbed.py", line 52, in <module>
main()
File "sqlitetestbed.py", line 48, in main
insertChar(x,y,z)
File "sqlitetestbed.py", line 20, in insertChar
VALUES (?,?,?)""",(rank,word,definition))
pysqlite2.dbapi2.IntegrityError: datatype mismatch
I'm probably doing something thats really stupid... :( Please tell me what I'm doing wrong... Thanks!
The Python "UnicodeDecodeError: 'ascii' codec can't decode byte in position" occurs when we use the ascii codec to decode bytes that were encoded using a different codec. To solve the error, specify the correct encoding, e.g. utf-8 .
The UnicodeDecodeError normally happens when decoding an str string from a certain coding. Since codings map only a limited number of str strings to unicode characters, an illegal sequence of str characters will cause the coding-specific decode() to fail.
Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.
x
is already unicode
, as the cols[0].string
field contains unicode
(just as documented).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With