Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

jinja + form + unicode control characters + xml/docx integration

I am creating word documents based on a users input in a form. However, when the user inputs a unicode control character, and trying to make a word file out of this using the python-docx package, this error occurs:

File "src\lxml\apihelpers.pxi", line 1439, in lxml.etree._utf8
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

I managed to tackle this issue by checking the form for invalid xml characters before each request (I have many forms where this problem might occur), and removing any invalid xml characters from the fields. I then make a new Immutable Multi Dictionary, and fill it with the cleaned text.

from docx import Document
from docx.shared import Inches
from flask import Flask, render_template_string, request
from werkzeug.datastructures import ImmutableMultiDict

def valid_xml_char_ordinal(c):
    codepoint = ord(c)
    return (0x20 <= codepoint <= 0xD7FF or codepoint in (0x9, 0xA, 0xD) or
            0xE000 <= codepoint <= 0xFFFD or 0x10000 <= codepoint <= 0x10FFFF)

app = Flask(__name__)

@app.before_request
def before_request():
    if 'check_form_xml_validity' in request.form:
        tuple_list = []
        for field_name in request.form:
            all_field_values = request.form.getlist(field_name)
            for field_value in all_field_values:
                cleaned_field_value = ''.join(c for c in field_value if valid_xml_char_ordinal(c))
                tuple_list.append((field_name, cleaned_field_value))
        request.form = ImmutableMultiDict(tuple_list)

@app.route('/', methods=['GET', 'POST'])
def form_test():
    if request.method == 'GET':
        x = '' # this seemingly empty string is not empty, but contains a bunch of control characters
        return render_template_string(
            """<form action="{{ url_for('form_test') }}" method="post">
                <input name="some_field" value="{{x}}"><br>
                check the xml validity of this form? <br>
                <input type="checkbox" checked name="check_form_xml_validity"><br>
                <button>submit</button>
            </form>""",
            x=x)
    else:
        doc = Document()
        p = doc.add_paragraph(request.form['some_field'])
        return 'yay'

And this method works perfectly. However, it seems very unlikely that I'm the only one with this problem, but yet I couldn't find any clean solutions. So the question is, should I really be solving this problem in the current way? It's pretty tedious, and it feels like I'm overlooking some Flask or python-docx setting or argument somewhere which would solve this issue.

The example is fully functional, and if the checkbox is checked, the before_request function is executed. If the checkbox is not checked, it is not executed and the mentioned server error will show.

enter image description here

The control character is: U+000C : <control-000C> (FORM FEED [FF])

like image 976
Joost Avatar asked Aug 19 '18 20:08

Joost


1 Answers

There are tons of the control characters in the unicode. So, basically, you need to remove control characters, which is the one of the category in unicode chars. To do that I recommend you to use unicodedata.category from unicodedata module.

See code below:

import unicodedata


def remove_control_chars(s):
    return "".join(ch for ch in s if unicodedata.category(ch)[0] != "C")
like image 198
Andriy Ivaneyko Avatar answered Oct 24 '22 07:10

Andriy Ivaneyko