Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: BeautifulSoup UnboundLocalError

I am trying to remove the HTML tags from some documents in a .txt format. However, there seems to be an error with the bs4 as far as I understand. The error that I am getting is the following:

 Traceback (most recent call last):
  File "E:/Google Drive1/Thesis stuff/Python/database/get_missing_10ks.py", line 13, in <module>
    text = BeautifulSoup(file_read, "html.parser")
  File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\__init__.py", line 282, in __init__
    self._feed()
  File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\__init__.py", line 343, in _feed
    self.builder.feed(self.markup)
  File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\builder\_htmlparser.py", line 247, in feed
    parser.feed(markup)
  File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 111, in feed
    self.goahead(0)
  File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 179, in goahead
    k = self.parse_html_declaration(i)
  File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 264, in parse_html_declaration
    return self.parse_marked_section(i)
  File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\_markupbase.py", line 160, in parse_marked_section
    if not match:
UnboundLocalError: local variable 'match' referenced before assignment

And the code that I am using is the following:

import os
from bs4 import BeautifulSoup

path_to_10k = "D:/10ks/list_missing_10k/"

path_to_saved_10k = "D:/10ks/list_missing_10kp/"

list_txt = os.listdir(path_to_10k)

for name in list_txt:
    file = open(path_to_10k + name, "r+", encoding="utf-8")
    file_read = file.read()
    text = BeautifulSoup(file_read, "html.parser")
    text = text.get_text("\n")
    file2 = open(path_to_saved_10k + name, "w+", encoding="utf-8")
    file2.write(str(text))
    file2.close()
    file.close()

The thing is that I have used this method on 51320 documents and it worked just fine, however, there are a few documents which it cannot do. When I open those HTML documents they seem the same to me.. If anyone could have any indication of what could be the problem and how to fix it it would be great. Thank you!

EXAMPLE OF FILE: https://files.fm/u/2s45uafp

like image 347
Adrian Avatar asked Nov 03 '18 06:11

Adrian


People also ask

What is unboundlocalerror in Python?

It is very irritating when a code which ran smoothly minutes ago, stucks due to a stupid mistake and hence, shows an error which is popular or rather common among Python developers called as “ UnboundLocalError ” .

What is beautifulsoup?

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. The latest Version of Beautifulsoup is v4.9.3 as of now. How to install Beautifulsoup?

What is attributeerror in beautifulsoup?

When during the execution of code we pass the wrong attribute to a function that attribute doesn’t have a relation with that function then AttributeError occurs. When we try to access the Tag using BeautifulSoup from a website and that tag is not present on that website then BeautifulSoup always gives an AttributeError.

What is Beautiful Soup in Python?

Project description Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.


2 Answers

https://github.com/scrapy/w3lib
https://w3lib.readthedocs.io/en/latest/

pip install w3lib

and

from w3lib.html import remove_tags

And then remove_tags(data) return clear data.

like image 143
Serhii Avatar answered Oct 21 '22 10:10

Serhii


Here is a solution which is using regular expression for removing the HTML tags.

import re

TAG_RE = re.compile(r'<[^>]+>')

f = open("C:\Temp\Data.txt", "r")
strHtml=f.read()

def remove_Htmltags(text):
    return TAG_RE.sub('', text)

strClearText=remove_Htmltags(strHtml)
print(strClearText)
like image 25
Chikku Jacob Avatar answered Oct 21 '22 09:10

Chikku Jacob