Some Basic Python Questions

Question

I'm a total python noob so please bear with me. I want to have python scan a page of html and replace instances of Microsoft Word entities with something UTF-8 compatible.

My question is, how do you do that in Python (I've Googled this but haven't found a clear answer so far)? I want to dip my toe in the Python waters so I figure something simple like this is a good place to start. It seems that I would need to:

load text pasted from MS Word into a variable
run some sort of replace function on the contents
output it

In PHP I would do it like this:

$test = $_POST['pasted_from_Word']; //for example “Going Mobile”

function defangWord($string) 
{
    $search = array(
        (chr(0xe2) . chr(0x80) . chr(0x98)),
        (chr(0xe2) . chr(0x80) . chr(0x99)),
        (chr(0xe2) . chr(0x80) . chr(0x9c)), 
        (chr(0xe2) . chr(0x80) . chr(0x9d)), 
        (chr(0xe2) . chr(0x80) . chr(0x93)),
        (chr(0xe2) . chr(0x80) . chr(0x94)), 
        (chr(0x2d))
    ); 

    $replace = array(
        "&lsquo;",
        "&rsquo;",
        "&ldquo;",
        "&rdquo;",
        "&ndash;",
        "&mdash;",
        "&ndash;"
    );

    return str_replace($search, $replace, $string); 
} 

echo defangWord($test);

How would you do it in Python?

EDIT: Hmmm, ok ignore my confusion about UTF-8 and entities for the moment. The input contains text pasted from MS Word. Things like curly quotes are showing up as odd symbols. Various PHP functions I used to try and fix it were not giving me the results I wanted. By viewing those odd symbols in a hex editor I saw that they corresponded to the symbols I used above (0xe2, 0x80 etc.). So I simply swapped out the oddball characters with HTML entities. So if the bit I have above already IS UTF-8, what is being pasted in from MS Word that is causing the odd symbols?

EDIT2: So I set out to learn a bit about Python and found I don't really understand encoding. The problem I was trying to solve can be handled simply by having sonsistent encoding from end to end. If the input form is UTF-8, the database that stores the input is UTF-8 and the page that outputs it is UTF-8... pasting from Word works fine. No special functions needed. Now, about learning a little Python...

Miles · Accepted Answer

First of all, those aren't Microsoft Word entities—they are UTF-8. You're converting them to HTML entities.

The Pythonic way to write something like:

chr(0xe2) . chr(0x80) . chr(0x98)

would be:

'\xe2\x80\x98'

But Python already has built-in functionality for the type of conversion you want to do:

def defang(string):
    return string.decode('utf-8').encode('ascii', 'xmlcharrefreplace')

This will replace the UTF-8 codes in a string for characters like ‘ with numeric entities like “.

If you want to replace those numeric entities with named ones where possible:

import re
from htmlentitydefs import codepoint2name

def convert_match_to_named(match):
    num = int(match.group(1))
    if num in codepoint2name:
        return "&%s;" % codepoint2name[num]
    else:
        return match.group(0)

def defang_named(string):
    return re.sub('&#(\d+);', convert_match_to_named, defang(string))

And use it like so:

>>> defang_named('\xe2\x80\x9cHello, world!\xe2\x80\x9d')
'&ldquo;Hello, world!&rdquo;'

To complete the answer, the equivalent code to your example to process a file would look something like this:

# in Python, it's common to operate a line at a time on a file instead of
# reading the entire thing into memory

my_file = open("test100.html")
for line in my_file:
    print defang_named(line)
my_file.close()

Note that this answer is targeted at Python 2.5; the Unicode situation is dramatically different for Python 3+.

I also agree with bobince's comment below: if you can just keep the text in UTF-8 format and send it with the correct content-type and charset, do that; if you need it to be in ASCII, then stick with the numeric entities—there's really no need to use the named ones.

S.Lott · Answer

The Python code has the same outline.

Just replace all of the PHP-isms with Python-isms.

Start by creating a File object. The result of a file.read() is a string object. Strings have a "replace" operation.

Some Basic Python Questions

Tags:

python

replace

php

unicode

html-entities

rg88

2 Answers

Miles

S.Lott

Recent Activity

Donate For Us

Some Basic Python Questions

Tags:

python

replace

php

unicode

html-entities

rg88

2 Answers

Miles

S.Lott

Related questions

Recent Activity

Donate For Us