I am scraping a page with BeautifulSoup, and part of the logic is that sometimes part of the contents of a <td>
tag can have a <br>
in it.
So sometimes it looks like this:
<td class="xyz">
text 1
<br>
text 2
</td>
and sometimes it looks like this:
<td class="xyz">
text 1
</td>
I am looping through this and adding to an output_row list that I eventually add to a list of lists. Whether I see the former format or the latter, I want the text to be in one cell.
I've found a way to determine if I am seeing the <br>
tag because the td.string shows up as none and I also know that text 2 always has 'ABC' in it. So:
elif td.string == None:
if 'ABC' in td.contents[2]:
new_string = td.contents[0] + ' ' + td.contents[2]
output_row.append(new_string)
print(new_string)
else:
#this is for another situation and it works fine
As I print this in a Jupyter Notebook, it shows up as "text 1 text 2" as one line. But when I open up my CSV, it is in two different columns. So when td.string has contents (meaning no <br>
tag), text 1 shows up in one column, but when I get to the pieces that have a <br>
tag, all my data gets shifted.
I'm not sure why it shows up as two different strings (two columns) when I concatenate them before appending them to the list.
I'm writing to file like this:
with open('C:/location/file.csv', 'w',newline='') as csv_file:
writer=csv.writer(csv_file,delimiter=',')
#writer.writerow(headers)
for row in output_rows:
writer.writerow(row)
csv_file.close
You can handle both cases using get_text()
with "strip" and "separator":
from bs4 import BeautifulSoup
dat="""
<table>
<tr>
<td class="xyz">
text 1
<br>
text 2
</td>
<td class="xyz">
text 1
</td>
</tr>
</table>
"""
soup = BeautifulSoup(dat, 'html.parser')
for td in soup.select("table > tr > td.xyz"):
print(td.get_text(separator=" ", strip=True))
Prints:
text 1 text 2
text 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With