I am trying to remove the HTML tags from some documents in a .txt format. However, there seems to be an error with the bs4 as far as I understand. The error that I am getting is the following:
Traceback (most recent call last):
File "E:/Google Drive1/Thesis stuff/Python/database/get_missing_10ks.py", line 13, in <module>
text = BeautifulSoup(file_read, "html.parser")
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\__init__.py", line 282, in __init__
self._feed()
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\__init__.py", line 343, in _feed
self.builder.feed(self.markup)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\builder\_htmlparser.py", line 247, in feed
parser.feed(markup)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 111, in feed
self.goahead(0)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 179, in goahead
k = self.parse_html_declaration(i)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 264, in parse_html_declaration
return self.parse_marked_section(i)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\_markupbase.py", line 160, in parse_marked_section
if not match:
UnboundLocalError: local variable 'match' referenced before assignment
And the code that I am using is the following:
import os
from bs4 import BeautifulSoup
path_to_10k = "D:/10ks/list_missing_10k/"
path_to_saved_10k = "D:/10ks/list_missing_10kp/"
list_txt = os.listdir(path_to_10k)
for name in list_txt:
file = open(path_to_10k + name, "r+", encoding="utf-8")
file_read = file.read()
text = BeautifulSoup(file_read, "html.parser")
text = text.get_text("\n")
file2 = open(path_to_saved_10k + name, "w+", encoding="utf-8")
file2.write(str(text))
file2.close()
file.close()
The thing is that I have used this method on 51320 documents and it worked just fine, however, there are a few documents which it cannot do. When I open those HTML documents they seem the same to me.. If anyone could have any indication of what could be the problem and how to fix it it would be great. Thank you!
EXAMPLE OF FILE: https://files.fm/u/2s45uafp
It is very irritating when a code which ran smoothly minutes ago, stucks due to a stupid mistake and hence, shows an error which is popular or rather common among Python developers called as “ UnboundLocalError ” .
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. The latest Version of Beautifulsoup is v4.9.3 as of now. How to install Beautifulsoup?
When during the execution of code we pass the wrong attribute to a function that attribute doesn’t have a relation with that function then AttributeError occurs. When we try to access the Tag using BeautifulSoup from a website and that tag is not present on that website then BeautifulSoup always gives an AttributeError.
Project description Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
https://github.com/scrapy/w3lib
https://w3lib.readthedocs.io/en/latest/
pip install w3lib
and
from w3lib.html import remove_tags
And then remove_tags(data)
return clear data.
Here is a solution which is using regular expression for removing the HTML tags.
import re
TAG_RE = re.compile(r'<[^>]+>')
f = open("C:\Temp\Data.txt", "r")
strHtml=f.read()
def remove_Htmltags(text):
return TAG_RE.sub('', text)
strClearText=remove_Htmltags(strHtml)
print(strClearText)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With