I'm using BeautifulSoup to build xml files. It seems like my two are options are 1) no formatting i.e. <pre class="prettyprint"><code><root><level1><level2><field1>val1</field1><field2>val2</field2><field3>val3</field3></level2></level1></root> </code></pre> or 2) with prettify i.e. <pre class="prettyprint"><code><root> <level1> <level2> <field1> val1 </field1> <field2> val2 </field2> <field3> val3 </field3> </level2> </level1> </root> </code></pre> But i would really prefer it to look like this: <pre class="prettyprint"><code><root> <level1> <level2> <field1>val1</field1> <field2>val2</field2> <field3>val3</field3> </level2> </level1> </root> </code></pre> I realise i could hack bs4 to achieve this result but i would like to hear if any options exist. I'm less bothered about the 4-space indent (although that would be nice) and more bothered about the newline after any closing tags or between two opening tags. I'm also intrigued is there a name for this way of formatting as it seems the most sensible way to me.

You can make simple <code>html.HTMLParser</code> to achieve what you want: <pre class="prettyprint"><code>from bs4 import BeautifulSoup from html import escape from html.parser import HTMLParser data = '''<root><level1><level2><field1>val1</field1><field2>val2</field2><field3>val3</field3></level2></level1></root>''' class MyHTMLParser(HTMLParser): def __init__(self): super().__init__() self.__t = 0 self.lines = [] self.__current_line = '' self.__current_tag = '' @staticmethod def __attr_str(attrs): return ' '.join('{}="{}"'.format(name, escape(value)) for (name, value) in attrs) def handle_starttag(self, tag, attrs): if tag != self.__current_tag: self.lines += [self.__current_line] self.__current_line = '\t' * self.__t + '<{}>'.format(tag + (' ' + self.__attr_str(attrs) if attrs else '')) self.__current_tag = tag self.__t += 1 def handle_endtag(self, tag): self.__t -= 1 if tag != self.__current_tag: self.lines += [self.__current_line] self.lines += ['\t' * self.__t + '</{}>'.format(tag)] else: self.lines += [self.__current_line + '</{}>'.format(tag)] self.__current_line = '' def handle_data(self, data): self.__current_line += data def get_parsed_string(self): return '\n'.join(l for l in self.lines if l) parser = MyHTMLParser() soup = BeautifulSoup(data, 'lxml') print('BeautifulSoup prettify():') print('*' * 80) print(soup.root.prettify()) print('custom html parser:') print('*' * 80) parser.feed(str(soup.root)) print(parser.get_parsed_string()) </code></pre> Prints: <pre class="prettyprint"><code>BeautifulSoup prettify(): ******************************************************************************** <root> <level1> <level2> <field1> val1 </field1> <field2> val2 </field2> <field3> val3 </field3> </level2> </level1> </root> custom html parser: ******************************************************************************** <root> <level1> <level2> <field1>val1</field1> <field2>val2</field2> <field3>val3</field3> </level2> </level1> </root> </code></pre>

BeautifulSoup Prettify custom new line option

Tags:

python

xml

beautifulsoup

prettify

I'm using BeautifulSoup to build xml files.

It seems like my two are options are 1) no formatting i.e.

<root><level1><level2><field1>val1</field1><field2>val2</field2><field3>val3</field3></level2></level1></root>

or 2) with prettify i.e.

<root>
 <level1>
  <level2>
   <field1>
    val1
   </field1>
   <field2>
    val2
   </field2>
   <field3>
    val3
   </field3>
  </level2>
 </level1>
</root>

But i would really prefer it to look like this:

<root>
    <level1>
        <level2>
            <field1>val1</field1>
            <field2>val2</field2>
            <field3>val3</field3>
        </level2>
    </level1>
</root>

I realise i could hack bs4 to achieve this result but i would like to hear if any options exist.

I'm less bothered about the 4-space indent (although that would be nice) and more bothered about the newline after any closing tags or between two opening tags. I'm also intrigued is there a name for this way of formatting as it seems the most sensible way to me.

614

asked Oct 16 '18 09:10

teebagz

1 Answers

You can make simple html.HTMLParser to achieve what you want:

from bs4 import BeautifulSoup
from html import escape
from html.parser import HTMLParser

data = '''<root><level1><level2><field1>val1</field1><field2>val2</field2><field3>val3</field3></level2></level1></root>'''

class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.__t = 0
        self.lines = []
        self.__current_line = ''
        self.__current_tag = ''

    @staticmethod
    def __attr_str(attrs):
        return ' '.join('{}="{}"'.format(name, escape(value)) for (name, value) in attrs)

    def handle_starttag(self, tag, attrs):
        if tag != self.__current_tag:
            self.lines += [self.__current_line]

        self.__current_line = '\t' * self.__t + '<{}>'.format(tag + (' ' + self.__attr_str(attrs) if attrs else ''))
        self.__current_tag = tag
        self.__t += 1

    def handle_endtag(self, tag):
        self.__t -= 1
        if tag != self.__current_tag:
            self.lines += [self.__current_line]
            self.lines += ['\t' * self.__t + '</{}>'.format(tag)]
        else:
            self.lines += [self.__current_line + '</{}>'.format(tag)]

        self.__current_line = ''

    def handle_data(self, data):
        self.__current_line += data

    def get_parsed_string(self):
        return '\n'.join(l for l in self.lines if l)


parser = MyHTMLParser()

soup = BeautifulSoup(data, 'lxml')
print('BeautifulSoup prettify():')
print('*' * 80)
print(soup.root.prettify())

print('custom html parser:')
print('*' * 80)
parser.feed(str(soup.root))
print(parser.get_parsed_string())

Prints:

BeautifulSoup prettify():
********************************************************************************
<root>
 <level1>
  <level2>
   <field1>
    val1
   </field1>
   <field2>
    val2
   </field2>
   <field3>
    val3
   </field3>
  </level2>
 </level1>
</root>
custom html parser:
********************************************************************************
<root>
    <level1>
        <level2>
            <field1>val1</field1>
            <field2>val2</field2>
            <field3>val3</field3>
        </level2>
    </level1>
</root>

185

answered Oct 19 '22 08:10

Andrej Kesely

Related questions
                            
                                Find the elements only after a specific text in html using selenium python
                            
                                How can I remove sharp jumps in data?
                            
                                Gaussian process with 2D feature array as input - scikit-learn
                            
                                Python Docker Remote Debugging VS Code
                            
                                How to execute python scripts from react-js?
                            
                                How to alter the LoginSerializer for one field for username/telephone/email?
                            
                                How to load s3 open dataset in google colaboratory?
                            
                                Streaming video in memory with OpenCV VideoWriter and Python BytesIO
                            
                                how to quit/close Anaconda Navigator
                            
                                Use multiple directories for flow_from_directory in Keras
                            
                                Python Selenium Wait for user to click a button
                            
                                How to rotate image before save in Django?
                            
                                Python Timedelta64 convert days to months
                            
                                Keras Model with Maxpooling1D and channel_first
                            
                                Tensorflow 1.10 TFRecordDataset - recovering TFRecords
                            
                                gdb.execute blocks all the threads in python scripts
                            
                                Does importing a Python file also import the imported files into shell?
                            
                                How to get all characters of an arbitrary encoding?
                            
                                Python's _winapi module
                            
                                Why do I fail to predict y=x**4 with Keras? (y=x**3 works)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With