Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python to CSV is splitting string into two columns when I want one

I am scraping a page with BeautifulSoup, and part of the logic is that sometimes part of the contents of a <td> tag can have a <br> in it.

So sometimes it looks like this:

<td class="xyz">
    text 1
    <br>
    text 2
</td>

and sometimes it looks like this:

<td class="xyz">
    text 1
</td>

I am looping through this and adding to an output_row list that I eventually add to a list of lists. Whether I see the former format or the latter, I want the text to be in one cell.

I've found a way to determine if I am seeing the <br> tag because the td.string shows up as none and I also know that text 2 always has 'ABC' in it. So:

    elif td.string == None:
        if 'ABC' in td.contents[2]:
            new_string = td.contents[0] + ' ' + td.contents[2]
            output_row.append(new_string)
            print(new_string)
        else:    
            #this is for another situation and it works fine

As I print this in a Jupyter Notebook, it shows up as "text 1 text 2" as one line. But when I open up my CSV, it is in two different columns. So when td.string has contents (meaning no <br> tag), text 1 shows up in one column, but when I get to the pieces that have a <br> tag, all my data gets shifted.

I'm not sure why it shows up as two different strings (two columns) when I concatenate them before appending them to the list.

I'm writing to file like this:

with open('C:/location/file.csv', 'w',newline='') as csv_file:
    writer=csv.writer(csv_file,delimiter=',')
    #writer.writerow(headers)
    for row in output_rows:
        writer.writerow(row)

csv_file.close
like image 287
strahanstoothgap Avatar asked Oct 30 '22 19:10

strahanstoothgap


1 Answers

You can handle both cases using get_text() with "strip" and "separator":

from bs4 import BeautifulSoup

dat="""
<table>
    <tr>
        <td class="xyz">
            text 1
            <br>
            text 2
        </td>

        <td class="xyz">
            text 1
        </td>
    </tr>
</table>
"""

soup = BeautifulSoup(dat, 'html.parser')
for td in soup.select("table > tr > td.xyz"):
    print(td.get_text(separator=" ", strip=True))

Prints:

text 1 text 2
text 1
like image 159
alecxe Avatar answered Nov 15 '22 07:11

alecxe