Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pattern replace in R

Tags:

regex

r

twitter

I'm working on a Twitter dataset in R and I'm finding it difficult to remove usernames from tweets.

This is an example of the tweets in the tweet column of my dataset:

[1] "@danimottale: 2 bad our inalienable rights offend their sensitivities. U cannot reason with obtuse zealotry. // So very well said."         
[2] "@FreeMktMonkey @drleegross Want to build HSA throughout lifetime for when older thus need HDHP not to deplete it if ill before 65y/o.thanks"

I want to remove/replace all words starting with "@" to get this output:

[1] "2 bad our inalienable rights offend their sensitivities. U cannot reason with obtuse zealotry. // So very well said."         
[2] "Want to build HSA throughout lifetime for when older thus need HDHP not to deplete it if ill before 65y/o.thanks"

This gsub function works for just removing the "@" symbol.

gsub("@", "", tweetdata$tweets)

I want to say, remove characters following text symbol until you encounter a space or punctuation mark.

I started trying to just deal with space but to no avail:

gsub("@.*[:space:]$", "", tweetdata$tweets)

this removes the second tweet entirely

gsub("@.*[:blank:]$", "", tweetdata$tweets)

this doesn't change the output.

I will be grateful for your help.

like image 913
user3722736 Avatar asked Jun 09 '14 15:06

user3722736


1 Answers

You can use the following. \S+ matches any non-whitespace character (1 or more times), followed by matching a single whitespace character.

gsub('@\\S+\\s', '', noRT$text)

Working Demo

EDIT: A negated match would work fine also (using just the space character)

gsub('@[^ ]+ ', '', noRT$text)
like image 162
hwnd Avatar answered Oct 04 '22 17:10

hwnd