Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python lxml xpath returning escape characters in list with text

Tags:

python

xpath

lxml

Before last week, my experience with Python had been very limited to large database files on our network, and suddenly I am thrust into the world of trying to extract information from html tables.

After a lot of reading, I chose to use lxml and xpath with Python 2.7 to retrieve the data in question. I have retrieved one field using the following code:

xpath = "//table[@id='resultsTbl1']/tr[position()>1]/td[@id='row_0_partNumber']/child::text()" 

which produced the following list:

['\r\n\t\tBAR18FILM/BKN', '\r\n\t\t\r\n\t\t\t', '\r\n\t\t\t', '\r\n\t\t\t', '\r\n\t\t\t', '\r\n\t\t\t', '\r\n\t\t\t\r\n\t\t']

I recognized the CR/LF and tab escape characters, I was wondering how to avoid them?

like image 546
plg Avatar asked Jun 21 '26 21:06

plg


1 Answers

Those characters are part of the XML document, which is why they are being returned. You can't avoid them, but you can strip them out. You could call the .strip() method on each item returned:

results = [x.strip() for x in results]

This would strip leading and trailing whitespace. Without seeing your actual code and data it's harder to give a good answer.

For example, given this script:

#!/usr/bin/python

from lxml import etree

with open('data.xml') as fd:
    doc = etree.parse(fd)

results = doc.xpath(
    "//table[@id='results']/tr[position()>1]/td/child::text()")

print 'Before stripping'
print repr(results)

print 'After stripping'
results = [x.strip() for x in results]
print repr(results)

And this data:

<doc>
  <table id="results">
    <tr>
      <th>ID</th><th>Name</th><th>Description</th>
    </tr>

    <tr>
      <td>
      1
      </td>
      <td>
      Bob
      </td>
      <td>
      A person
      </td>
      </tr>
    <tr>
      <td>
      2
      </td>
      <td>
      Alice
      </td>
      <td>
      Another person
      </td>
    </tr>
  </table>
</doc>

We get these results:

Before stripping
['\n\t\t\t1\n\t\t\t', '\n\t\t\tBob\n\t\t\t', '\n\t\t\tA person\n\t\t\t', '\n\t\t\t2\n\t\t\t', '\n\t\t\tAlice\n\t\t\t', '\n\t\t\tAnother person\n\t\t\t']
After stripping
['1', 'Bob', 'A person', '2', 'Alice', 'Another person']
like image 191
larsks Avatar answered Jun 24 '26 15:06

larsks