Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse each file in a directory with BeautifulSoup/Python, save out as new file

New to Python & BeautifulSoup. I have a Python program that opens a file called "example.html", runs a BeautifulSoup action on it, then runs a Bleach action on it, then saves the result as file "example-cleaned.html". So far it is working for all contents of "example.html".

I need to modify it so that it opens each file in folder "/posts/", runs the program on it, then saves it out as "/posts-cleaned/X-cleaned.html" where X is the original filename.

Here's my code, minimised:

from bs4 import BeautifulSoup
import bleach
import re

text = BeautifulSoup(open("posts/example.html"))
text.encode("utf-8")

tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}

# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())

# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)

fout = open("posts/example-cleaned.html", "w")
fout.write(cleaned.encode("utf-8"))
fout.close()
print "Done"

Assistance & pointers to existing solutions gladly received!

like image 844
Ila Avatar asked Oct 22 '12 15:10

Ila


People also ask

Is BeautifulSoup library is used to parse the document and for extracting HTML documents?

BeautifulSoup is a Python library for parsing HTML and XML documents. It is often used for web scraping. BeautifulSoup transforms a complex HTML document into a complex tree of Python objects, such as tag, navigable string, or comment.

What does BeautifulSoup return?

Basically, the BeautifulSoup 's text attribute will return a string stripped of any HTML tags and metadata.


2 Answers

You can use os.listdir() to get a list of all files in a directory. If you want to recurse all the way down the directory tree, you'll need os.walk().

I would move all this code to handle a single file to function, and then write a second function to handle parsing the whole directory. Something like this:

def clean_dir(directory):

    os.chdir(directory)

    for filename in os.listdir(directory):
        clean_file(filename)

def clean_file(filename):

    tag_black_list = ['iframe', 'script']
    tag_white_list = ['p','div']
    attr_white_list = {'*': ['title']}

    with open(filename, 'r') as fhandle:
        text = BeautifulSoup(fhandle)
        text.encode("utf-8")

        # Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
        [s.decompose() for s in text(tag_black_list)]
        pretty = (text.prettify())

        # Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
        cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)

        # this appends -cleaned to the file; 
        # relies on the file having a '.'
        dot_pos = filename.rfind('.')
        cleaned_filename = '{0}-cleaned{1}'.format(filename[:dot_pos], filename[dot_pos:])

        with open(cleaned_filename, 'w') as fout:
            fout.write(cleaned.encode("utf-8"))

    print "Done"

Then you just call clean_dir('/posts') or what not.

I'm appending "-cleaned" to the files, but I think I like your idea of using a whole new directory better. That way you won't have to handle conflicts if -cleaned already exists for some file, etc.

I'm also using the with statement to open files here as it closes them and handles exceptions automatically.

like image 54
NullUserException Avatar answered Nov 14 '22 23:11

NullUserException


Answer to my own question, for others who might find the Python docs for os.listdir a bit unhelpful:

from bs4 import BeautifulSoup
import bleach
import re
import os, os.path

tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}

postlist = os.listdir("posts/")

for post in postlist: 

        # HERE: you need to specify the directory again, the value of "post" is just the filename:
    text = BeautifulSoup(open("posts/"+post))
    text.encode("utf-8")

    # Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
    [s.decompose() for s in text(tag_black_list)]
    pretty = (text.prettify())

    # Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
    cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)

    fout = open("posts-cleaned/"+post, "w")
    fout.write(cleaned.encode("utf-8"))
    fout.close()

I cheated and made a separate folder called "posts-cleaned/" because savings files to there was easier than splitting the filename, adding "cleaned", and re-joining it, although if anyone wants to show me a good way to do that, that would be even better.

like image 25
Ila Avatar answered Nov 14 '22 23:11

Ila