Parse each file in a directory with BeautifulSoup/Python, save out as new file

Tags:

New to Python & BeautifulSoup. I have a Python program that opens a file called "example.html", runs a BeautifulSoup action on it, then runs a Bleach action on it, then saves the result as file "example-cleaned.html". So far it is working for all contents of "example.html".

I need to modify it so that it opens each file in folder "/posts/", runs the program on it, then saves it out as "/posts-cleaned/X-cleaned.html" where X is the original filename.

Here's my code, minimised:

from bs4 import BeautifulSoup
import bleach
import re

text = BeautifulSoup(open("posts/example.html"))
text.encode("utf-8")

tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}

# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())

# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)

fout = open("posts/example-cleaned.html", "w")
fout.write(cleaned.encode("utf-8"))
fout.close()
print "Done"

Assistance & pointers to existing solutions gladly received!

844

asked Oct 22 '12 15:10

Ila

2 Answers

You can use os.listdir() to get a list of all files in a directory. If you want to recurse all the way down the directory tree, you'll need os.walk().

I would move all this code to handle a single file to function, and then write a second function to handle parsing the whole directory. Something like this:

def clean_dir(directory):

    os.chdir(directory)

    for filename in os.listdir(directory):
        clean_file(filename)

def clean_file(filename):

    tag_black_list = ['iframe', 'script']
    tag_white_list = ['p','div']
    attr_white_list = {'*': ['title']}

    with open(filename, 'r') as fhandle:
        text = BeautifulSoup(fhandle)
        text.encode("utf-8")

        # Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
        [s.decompose() for s in text(tag_black_list)]
        pretty = (text.prettify())

        # Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
        cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)

        # this appends -cleaned to the file; 
        # relies on the file having a '.'
        dot_pos = filename.rfind('.')
        cleaned_filename = '{0}-cleaned{1}'.format(filename[:dot_pos], filename[dot_pos:])

        with open(cleaned_filename, 'w') as fout:
            fout.write(cleaned.encode("utf-8"))

    print "Done"

Then you just call clean_dir('/posts') or what not.

I'm appending "-cleaned" to the files, but I think I like your idea of using a whole new directory better. That way you won't have to handle conflicts if -cleaned already exists for some file, etc.

I'm also using the with statement to open files here as it closes them and handles exceptions automatically.

answered Nov 14 '22 23:11

NullUserException

Answer to my own question, for others who might find the Python docs for os.listdir a bit unhelpful:

from bs4 import BeautifulSoup
import bleach
import re
import os, os.path

tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}

postlist = os.listdir("posts/")

for post in postlist: 

        # HERE: you need to specify the directory again, the value of "post" is just the filename:
    text = BeautifulSoup(open("posts/"+post))
    text.encode("utf-8")

    # Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
    [s.decompose() for s in text(tag_black_list)]
    pretty = (text.prettify())

    # Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
    cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)

    fout = open("posts-cleaned/"+post, "w")
    fout.write(cleaned.encode("utf-8"))
    fout.close()

I cheated and made a separate folder called "posts-cleaned/" because savings files to there was easier than splitting the filename, adding "cleaned", and re-joining it, although if anyone wants to show me a good way to do that, that would be even better.

answered Nov 14 '22 23:11

Ila

Related questions
                            
                                Are Python error numbers associated with IOError stable?
                            
                                How to plot interrupted horizontal lines (segments) in matplotlib in a "cheap way" without using NaNs?
                            
                                Attempting to understand yield as an expression
                            
                                Popping a query from django Q query?
                            
                                Interpolation in vector-valued multi-variate function
                            
                                Interleave a numpy array with itself
                            
                                did my program on perfect numbers in python and not sure if i should use (1,1000) or (2, n+1) in range
                            
                                Retrieving the next smallest element of a Python set
                            
                                z3 const declaration
                            
                                Getting the block of commands that are to be executed in the with statement
                            
                                Python: practical introspection
                            
                                How should I import a class module in order to test it
                            
                                Python Image Library Image Resolution when Resizing
                            
                                Search API returns QueryError when Query contains ','(Comma) or = or ()
                            
                                Using PIL or a Numpy array, how can I remove entire rows from an image?
                            
                                "Master page" administration in Django
                            
                                Stereographic Sun Diagram matplotlib polar plot python
                            
                                Cpython interpretter / IronPython interpretter No module named clr
                            
                                matplotlib; fractional powers of ten; scientific notation
                            
                                OpenCL Matrix Multiplication - Getting wrong answer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parse each file in a directory with BeautifulSoup/Python, save out as new file

Tags:

python

html

html-parsing

beautifulsoup

Ila

People also ask

2 Answers

NullUserException

Ila

Recent Activity

Donate For Us