Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sed to remove URLs from a file

Tags:

sed

I am trying to write a sed expression that can remove urls from a file

example

http://samgovephotography.blogspot.com/ updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)   

Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N https://hollywoodmomblog.com/?p=2442 Thx to HMB Contributor @kdpartak :)   

But I dont get it:

sed 's/[\w \W \s]*http[s]*:\/\/\([\w \W]\)\+[\w \W \s]*/ /g' posFile  

FIXED!!!!!

handles almost all cases, even malformed URLs

sed 's/[\w \W \s]*http[s]*[a-zA-Z0-9 : \. \/ ; % " \W]*/ /g' positiveTweets | grep "http" | more
like image 587
daydreamer Avatar asked Nov 26 '10 07:11

daydreamer


2 Answers

The accepted answer provides the approach that I used to remove URLs, etc. from my files. However it left "blank" lines. Here is a solution.

sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' input_file

perl -i -pe 's/^'`echo "\012"`'${2,}//g' input_file

The GNU sed flags, expressions used are:

-i    Edit in-place
-e    [-e script] --expression=script : basically, add the commands in script
      (expression) to the set of commands to be run while processing the input
 ^    Match start of line
 $    Match end of line


 ?    Match one or more of preceding regular expression
{2,}  Match 2 or more of preceding regular expression
\S*   Any non-space character; alternative to: [^[:space:]]*

However,

sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g'

leaves nonprinting character(s), presumably \n (newlines). Standard sed-based approaches to remove "blank" lines, tabs and spaces, e.g.

sed -i 's/^[ \t]*//; s/[ \t]*$//'

do not work, here: if you do not use a "branch label" to process newlines, you cannot replace them using sed (which reads input one line at a time).

The solution is to use the following perl expression:

perl -i -pe 's/^'`echo "\012"`'${2,}//g'

which uses a shell substitution,

  • '`echo "\012"`'

to replace an octal value

  • \012

(i.e., a newline, \n), that occurs 2 or more times,

  • {2,}

(otherwise we would unwrap all lines), with something else; here:

  • //

i.e., nothing.

[The second reference below provides a wonderful table of these values!]

The perl flags used are:

-p  Places a printing loop around your command,
    so that it acts on each line of standard input

-i  Edit in-place

-e  Allows you to provide the program as an argument,
    rather than in a file

References:

  • perl flags: Perl flags -pe, -pi, -p, -w, -d, -i, -t?
  • ASCII control codes: https://www.cyberciti.biz/faq/unix-linux-sed-ascii-control-codes-nonprintable/
  • remove URLs: sed to remove URLs from a file
  • branch labels: How can I replace a newline (\n) using sed?
  • GNU sed manual: https://www.gnu.org/software/sed/manual/sed.html
  • quick regex guide: https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html

Example:

$ cat url_test_input.txt

Some text ...
https://stackoverflow.com/questions/4283344/sed-to-remove-urls-from-a-file
https://www.google.ca/search?dcr=0&ei=QCsyWtbYF43YjwPpzKyQAQ&q=python+remove++citations&oq=python+remove++citations&gs_l=psy-ab.3...1806.1806.0.2004.1.1.0.0.0.0.61.61.1.1.0....0...1c.1.64.psy-ab..0.0.0....0.-cxpNc6youY
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
http://datasynce.org/2017/05/sentiment-analysis-on-python-through-textblob/
https://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
http://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
ftp://ftp.ncbi.nlm.nih.gov/
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/alignment_indices/20100804.alignment.index
Some more text.

$ sed -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' url_test_input.txt > a

$ cat a

Some text ...










Some more text.

$ perl -i -pe 's/^'`echo "\012"`'${2,}//g' a

Some text ...
Some more text.

$ 
like image 112
Victoria Stuart Avatar answered Sep 25 '22 05:09

Victoria Stuart


The following removes http:// or https:// and everything up until the next space:

sed -e 's!http\(s\)\{0,1\}://[^[:space:]]*!!g' posFile  
 updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)   

Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N  Thx to HMB Contributor @kdpartak :)

Edit:

I should have used:

sed -e 's!http[s]\?://\S*!!g' posFile

"[s]\?" is a far more readable way of writing "an optional s" compared to "\(s\)\{0,1\}"

"\S*" a more readable version of "any non-space characters" than "[^[:space:]]*"

I must have been using the sed that came installed with my Mac at the time I wrote this answer (brew install gnu-sed FTW).


There are better URL regular expressions out there (those that take into account schemes other than HTTP(S), for instance), but this will work for you, given the examples you give. Why complicate things?

like image 31
Johnsyweb Avatar answered Sep 21 '22 05:09

Johnsyweb