Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove urls from strings

Tags:

url

replace

r

gsub

I have the following string, stored in the object sentence:

sentence <- "aazdlubtirol: RT @tradeDayTrades: sister articles \"$AAPL Dancing in a Burning Room\" January 2013  http://t.co/tkuCRfLy  \" $AAPL vs $AAPL \"  August 2011 http://t.co/863HkVjn"

I am trying to use gsub to remove urls beginning with http:

sentence <- gsub('http.*','',sentence)

However, it replaces everything after http:

aazdlubtirol: RT @tradeDayTrades: sister articles \"$AAPL Dancing in a Burning Room\" January 2013

What I want is:

aazdlubtirol: RT @tradeDayTrades: sister articles \"$AAPL Dancing in a Burning Room\" January 2013 \" $AAPL vs $AAPL \" August 2011

I am trying to clean up the urls so if a string includes http I want to remove the url. I found some solutions but they are not helping me.

like image 406
Aniks Avatar asked Feb 05 '14 21:02

Aniks


1 Answers

Add a space to your replacement group:

gsub('http.* *', '', sentence)

Or using \\s which is regex for space:

gsub('http.*\\s*', '', sentence)

As per the comment, .* will match anything and regular expressions are greedy. Instead we should match one or more non-whitespace character any number of times followed by zero or more spaces:

gsub('http\\S+\\s*', '', sentence)
like image 89
Justin Avatar answered Sep 30 '22 18:09

Justin