Python: BeautifulSoup UnboundLocalError

Tags:

I am trying to remove the HTML tags from some documents in a .txt format. However, there seems to be an error with the bs4 as far as I understand. The error that I am getting is the following:

 Traceback (most recent call last):
  File "E:/Google Drive1/Thesis stuff/Python/database/get_missing_10ks.py", line 13, in <module>
    text = BeautifulSoup(file_read, "html.parser")
  File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\__init__.py", line 282, in __init__
    self._feed()
  File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\__init__.py", line 343, in _feed
    self.builder.feed(self.markup)
  File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\builder\_htmlparser.py", line 247, in feed
    parser.feed(markup)
  File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 111, in feed
    self.goahead(0)
  File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 179, in goahead
    k = self.parse_html_declaration(i)
  File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 264, in parse_html_declaration
    return self.parse_marked_section(i)
  File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\_markupbase.py", line 160, in parse_marked_section
    if not match:
UnboundLocalError: local variable 'match' referenced before assignment

And the code that I am using is the following:

import os
from bs4 import BeautifulSoup

path_to_10k = "D:/10ks/list_missing_10k/"

path_to_saved_10k = "D:/10ks/list_missing_10kp/"

list_txt = os.listdir(path_to_10k)

for name in list_txt:
    file = open(path_to_10k + name, "r+", encoding="utf-8")
    file_read = file.read()
    text = BeautifulSoup(file_read, "html.parser")
    text = text.get_text("\n")
    file2 = open(path_to_saved_10k + name, "w+", encoding="utf-8")
    file2.write(str(text))
    file2.close()
    file.close()

The thing is that I have used this method on 51320 documents and it worked just fine, however, there are a few documents which it cannot do. When I open those HTML documents they seem the same to me.. If anyone could have any indication of what could be the problem and how to fix it it would be great. Thank you!

EXAMPLE OF FILE: https://files.fm/u/2s45uafp

347

asked Nov 03 '18 06:11

Adrian

2 Answers

https://github.com/scrapy/w3lib
https://w3lib.readthedocs.io/en/latest/

pip install w3lib

and

from w3lib.html import remove_tags

And then remove_tags(data) return clear data.

143

answered Oct 21 '22 10:10

Serhii

Here is a solution which is using regular expression for removing the HTML tags.

import re

TAG_RE = re.compile(r'<[^>]+>')

f = open("C:\Temp\Data.txt", "r")
strHtml=f.read()

def remove_Htmltags(text):
    return TAG_RE.sub('', text)

strClearText=remove_Htmltags(strHtml)
print(strClearText)

answered Oct 21 '22 09:10

Chikku Jacob

Related questions
                            
                                Celery and RabbitMQ - queue priority vs. consumer priority vs. task priority
                            
                                Pylinter in Sublime text 3.1.1 still doesn't use Python2.7
                            
                                How do I run a single nosetest via setup.py in the python-active-directory module?
                            
                                How to add more metrics on the country_map in Apache-superset?
                            
                                How to solve view limit minimum is less than 1 and is an invalid Matplotlib date value error?
                            
                                Pandas v1.1.0: Groupby rolling count slower than rolling mean & sum
                            
                                Integrate Qt with Windows 7 taskbar using python?
                            
                                Phonon's VideoWidget show wrong colors on a QGLWidget (Qt, Python)
                            
                                Irregular, non-contiguous Periods in Pandas
                            
                                input() blocks other python processes in Windows 8 (python 3.3)
                            
                                Python: ImportError: No module named pkg_resources [duplicate]
                            
                                uWSGI / Flask / Python logs stop after some time
                            
                                How to write a proxy pool server (when a request comes, choose a proxy to get url content) in python?
                            
                                Sublime Text syntax: Python 3.6 f-strings
                            
                                TensorFlow: How can I evaluate a validation data queue multiple times during training?
                            
                                Decode Micro QR codes with Python
                            
                                Load saved checkpoint and predict not producing same results as in training
                            
                                How do I profile a tf.data.Dataset?
                            
                                Does any Python library support writing arrays of structs to Parquet files?
                            
                                Overwrite parquet file with pyarrow in S3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: BeautifulSoup UnboundLocalError

Tags:

python

html

parsing

text-files

beautifulsoup

Adrian

People also ask

2 Answers

Serhii

Chikku Jacob

Recent Activity

Donate For Us