For example:
string = "This is a link http://www.google.com"
How could I extract 'http://www.google.com' ?
(Each link will be of the same format i.e 'http://')
URL extraction is achieved from a text file by using regular expression. The expression fetches the text wherever it matches the pattern. Only the re module is used for this purpose.
Open the file in Binary mode and it recognizes the pattern of URL in the file. Define a function to extract the link for a particular page. Iterate over all the pages and extract the text using extractText() function. To extract the hyperlinks from the PDF we generally use Pattern Matching Concept in Python.
There may be few ways to do this but the cleanest would be to use regex
>>> myString = "This is a link http://www.google.com" >>> print re.search("(?P<url>https?://[^\s]+)", myString).group("url") http://www.google.com
If there can be multiple links you can use something similar to below
>>> myString = "These are the links http://www.google.com and http://stackoverflow.com/questions/839994/extracting-a-url-in-python" >>> print re.findall(r'(https?://[^\s]+)', myString) ['http://www.google.com', 'http://stackoverflow.com/questions/839994/extracting-a-url-in-python'] >>>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With