I want to look for ©
in an HTML document, and basically get the entity the copyright is attributed to.
The copyright line shows up a couple of different ways:
<p class="bg-copy">© 2011 The New York Times Company</p>
or
<a href="http://www.nytimes.com/ref/membercenter/help/copyright.html">
© 2011</a>
<a href="http://www.nytco.com/">The New York Times Company</a>
or
<br>Published since 1996<br>Copyright © CounterPunch<br>
All rights reserved.<br>
I want to ignore the dates and intervening tags and just get "The New York Times Company" or "Counterpunch".
I haven't been able to find much on using regex with JavaScript or JQuery, though I get the impression that it can lead to major headaches. If there is a better approach to this, let me know.
For a robust solution, you will probably need a combination of DOM navigation and some heuristics. Your examples are solvable with regex, but there are so many more scenarios possible...
©[\s\d]*(?:<\/.+?>[^>]*>)?([^<]*)
works for your three samples. But ONLY for them and similar cases.
See on rubular
Explanation:
© // copyright symbol
[\s\d]* // followed by spaces or digits
(?:</.+?>[^>]*>)? // maybe followed by a closing tag and another opening one
([^<]*) // than match anything up to the next tag
See this answer on how to use in javascript with jquery. Basically you can use the match(/regex/) function:
var result = string.match(/©[\s\d]*(?:<\/.+?>[^>]*>)?([^<]*)/)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With