Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading in parts of file, stopping and starting with certain words

I'm using python 2.7, and I have been assigned (self-directed assignment, I wrote these instructions) to write a small static html generator, and I would like assistance finding new-to-python oriented resources for reading portions of files at a time. If someone provides code answers, that's great, but I want to understand why and how python works. I can buy books, but not expensive ones- I can afford to put thirty, maybe forty dollars into this specific research at the moment.

The way this program is supposed to work is that there is a template.html file, a message.txt file, an image file, an archive.html file, and an output.html file. This is more information than you need, but the basic idea I had was "go back and forth reading from template and message, putting their contents in output and then writing in archive that output exists". But I haven't got there yet, and I'm not asking you to solve this entire problem, as I detail below:

The program reads in html from template.html, stopping at the opening tag, then reads in what the title of the page is going to be from message.txt. That's where I am now. It works! I was so happy... hours ago, when I realized that was not the final boss.

#doctype to title
copyLine = False
for line in template.readlines():
    if not '<title>' in line:
       copyLine = True
       if copyLine:
            outputhtml.write(line)
            copyLine = False
else:
    templateSeek = template.tell()
    break

#read name of message
titleOut = message.readline()
print titleOut, " is the title of the new page"
#--------
##5. Put the title from the message file in the head>title tag of the output file
#--------
titleOut = str(titleOut)
titleTag = "<title>"+titleOut+"|Circuit Salsa</title>"
outputhtml.write(titleTag)

My problem is this: I don't understand regular expressions, and when I try various forms of for...in codes, I get all of the template, none of the template, some combination of the parts of the template I didn't want... anyway, how do I go back and forth reading these files and pick up where I left off? Any assistance finding easier-to-understand resources is greatly appreciated, I've spent about five hours researching this and I'm getting a headache, because I keep getting resources aimed at more advanced audiences and I don't understand them.

These are the last two methods I tried (with no success):

block = ""
found = False
print "0"
for line in template:
    if found:
        print "1"
        block += line
        if line.strip() == "<h1>": break
else:
    if line.strip() == "</title>":
        print "2"
        found = True
        block = "</title>"

print block + "3"

only points 0 and 3 got printed. I put the print # there because I couldn't figure out why my output file was unchanged.

template.seek(templateSeek)
copyLine = False
for line in template.readlines():
    if not '<a>' in line:
        copyLine = True
        if copyLine:
            outputhtml.write(line)
            copyLine = False
    else:
        templateSeek = template.tell()
        break 

With the other one, I'm pretty sure I'm just doing it all wrong.

like image 929
NMacKenzie Avatar asked Apr 19 '15 23:04

NMacKenzie


1 Answers

I would use BeautifulSoup for this. An alternative is to use regular expressions, which are good to know anyway. I know they look quite intimidating, but they're actually not that difficult to learn (it took me an hour or so). For example to get all of the link tags you can do something like

from re import findall, DOTALL

html = '''
<!DOCTYPE html>
<html>

<head>
    <title>My awesome web page!</title>
</head>

<body>
    <h2>Sites I like</h2>
    <ul>
        <li><a href="https://www.google.com/">Google</a></li>
        <li><a href="https://www.facebook.com">Facebook</a></li>
        <li><a href="http://www.amazon.com">Amazon</a></li>
    </ul>

    <h2>My favorite foods</h2>
    <ol>
        <li>Pizza</li>
        <li>French Fries</li>
    </ol>
</body>

</html>
'''

def find_tag(src, tag):
    return findall(r'<{0}.*?>.*?</{0}>'.format(tag), src, DOTALL)

print find_tag(html, 'a')
# ['<a href="https://www.google.com/">Google</a>', '<a href="https://www.facebook.com">Facebook</a>', '<a href="http://www.amazon.com">Amazon</a>']
print find_tag(html, 'li')
# ['<li><a href="https://www.google.com/">Google</a></li>', '<li><a href="https://www.facebook.com">Facebook</a></li>', '<li><a href="http://www.amazon.com">Amazon</a></li>', '<li>Pizza</li>', '<li>French Fries</li>']
print find_tag(html, 'body')
# ['<body>\n    <h2>Sites I like</h2>\n    <ul>\n        <li><a href="https://www.google.com/">Google</a></li>\n        <li><a href="https://www.facebook.com">Facebook</a></li>\n        <li><a href="http://www.amazon.com">Amazon</a></li>\n    </ul>\n\n    <h2>My favorite foods</h2>\n    <ol>\n        <li>Pizza</li>\n        <li>French Fries</li>\n    </ol>\n</body>']

I hope that you find at least some of this useful. If you have any follow up questions, please comment on my answer. Good luck!

like image 116
pzp Avatar answered Oct 23 '22 10:10

pzp