I am using pandas library on Python 3.5.1. How can I remove html tags from field values? Here are my input and output:
My code returned an error:
import pandas as pd
code=[1,2,3]
overview =['<p>Environments subject.</p>',
'<ul><li> property ;</li></ul><ul><li>markets and exchange;</li></ul>',
'<p class="MsoNormal" style="margin: 0cm 0cm 0pt;">']
# '<p class="SSPBodyText" style="padding: 0cm; text-align: justify;">The subject.</p>']
df= pd.DataFrame(overview,code)
df.columns = ['overview']
df['overview_copy'] = df['overview']
# print(df)
tags_list = ['<p>' ,'</p>' , '<p*>',
'<ul>','</ul>',
'<li>','</li>',
'<br>',
'<strong>','</strong>',
'<span*>','</span>',
'<a href*>','</a>',
'<em>','</em>']
for tag in tags_list:
# df['overview_copy'] = df['overview_copy'].str.replace(tag, '')
df['overview_copy'].replace(to_replace=tag, value='', regex=True, inplace=True)
print(df)
Use the re. sub() method to remove the HTML tags from a string, e.g. result = re. sub(r'<.
replace('<[^<]+?> ', '') # Use regex to remove html tags.
Remove HTML tags from string in python Using the lxml Module The fromstring() method takes the original string as an input and returns a parser. After getting the parser, we can extract the text using the text_content() method, leaving behind the HTML tags. The text_content() method returns an object of lxml. etree.
Pandas DataFrame: drop() function The drop() function is used to drop specified labels from rows or columns. Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names.
Like so re.sub('<[^<]+?>', '', text)
You can find details answer there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With