Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete first child node using BeautifulSoup

import os
from bs4 import BeautifulSoup
do = dir_with_original_files = 'C:\FOLDER'
dm = dir_with_modified_files = 'C:\FOLDER'
for root, dirs, files in os.walk(do):
    for f in files:
        print f.title()
        if f.endswith('~'): #you don't want to process backups
            continue
        original_file = os.path.join(root, f)
        mf = f.split('.')
        mf = ''.join(mf[:-1])+'_mod.'+mf[-1] # you can keep the same name 
                                             # if you omit the last two lines.
                                             # They are in separate directories
                                             # anyway. In that case, mf = f
        modified_file = os.path.join(dm, mf)
        with open(original_file, 'r') as orig_f, \
             open(modified_file, 'w') as modi_f:
            soup = BeautifulSoup(orig_f.read())

            for t in soup.find_all('table'):
                for child in t.find_all("table"):#*****this is fine for now, but how would I restrict it to find only the first element?
                    child.REMOVE() #******PROBLEM HERE********

            # This is where you create your new modified file.
            modi_f.write(soup.prettify().encode(soup.original_encoding)) 

Hi all,

I am trying to do some parsing of files using BeautifulSoup to clean them up slightly. The functionality I want is that I want to delete the first table which is anywhere within a table, eg :

<table>
  <tr>
    <td></td
  </tr>
  <tr>
    <td><table></table><-----This will be deleted</td
  </tr>
  <tr>
    <td><table></table> --- this will remain here.</td
  </tr>
</table>

At the moment, my code is set to find all tables within a table and I have a made up a .REMOVE() method to show what I wish to accomplish. How can I actually remove this element?

Tl;dr -

  • How can I adapt my code to find only the first nested table in a file.

  • How can I remove this table?

like image 806
Simon Kiely Avatar asked Dec 05 '14 15:12

Simon Kiely


1 Answers

Find the first table inside the table and call extract() on it:

inner_table = soup.find('table').find('table')  # or just soup.table.table
inner_table.extract() 
like image 50
alecxe Avatar answered Sep 21 '22 21:09

alecxe