Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Regex - Identifying the first and last items in a list

Tags:

python

regex

I need to transform some text files into HTML code. I'm stuck in transforming a list into an HTML unordered list. Example source:

some text in the document
* item 1
* item 2
* item 3
some other text

The output should be:

some text in the document
<ul>
    <li>item 1</li>
    <li>item 2</li>
    <li>item 3</li>
</ul>
some other text

Currently, I have this:

r = re.compile(r'\*(.*)\n')
r.sub('<li>\1</li>', the_text_document)

which creates an HTML list without < ul > tags.
How can I identify the first and last items and surround them with < ul > tags?

like image 201
user1102018 Avatar asked Jul 08 '12 14:07

user1102018


2 Answers

Or use BeautifulSoup

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

edit

I apparently have to give you some hint on how to read documentation.

  • Open the link
  • On the left there is a big menu (teal color)
  • If you look carefully you will notice that the documentation is divided in multiple sections
    • Stuffs
    • Navigation in the tree
    • Searching the tree
    • Modifying the tree (got it)
    • Output (got it!)

And many more things

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Don't stop reading after the first sentence... The last one is pretty important and what's in the middle to.

In other word, you can create an empty document... let say:

soup = BeautifulSoup("<div></div>")
document = soup.div

then you read each lines of you text.. and then do that whenever you have text.

document.append(line)

if the line starts with a `*``

ul = document.new_tag('ul')
document.append(ul)
document = ul

then push all the li on the document... and once you end up reading *, just pop the parent so the document gets back to the div. And keep doing that... you can even do it recursively to insert ul into uls.

Once you parsed everything... you can do

str(document)

or

document.prettify()

Edit

just realized that you weren't editing the html but a unformatted text.. You could try using markdown then.

http://daringfireball.net/projects/markdown/

like image 162
Loïc Faure-Lacroix Avatar answered Sep 19 '22 16:09

Loïc Faure-Lacroix


You could just process you data line by line .. this quick and dirty solution below could probably be tidied up, but for your data it does the trick.

with open('data.txt') as inf:
    star_count = 0
    for line in inf:
        line = line.strip()

        if not line.startswith('*'):
            if star_count == 1:
                print'</ul>'
            print line
        else:
            if star_count == 0:
                print '<ul>'
                star_count = 1
            print '  <li>%s</li>'  %line.split('*')[1].strip()

yields:

some text in the document
<ul>
  <li>item 1</li>
  <li>item 2</li>
  <li>item 3</li>
</ul>
some other text

Depending on how complex your data, or if you have repeating unumbered lists etc this will require modification and you may want to look for a more general solution, or modify this starter code to fill your needs, only you can decide.

Update:

Edited <li> .. </li> print line to get rid of * that were previously left.

like image 44
Levon Avatar answered Sep 19 '22 16:09

Levon