sed to remove URLs from a file

Question

I am trying to write a sed expression that can remove urls from a file

example

http://samgovephotography.blogspot.com/ updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)   

Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N https://hollywoodmomblog.com/?p=2442 Thx to HMB Contributor @kdpartak :)

But I dont get it:

sed 's/[\w \W \s]*http[s]*://$[\w \W]$\+[\w \W \s]*/ /g' posFile

FIXED!!!!!

handles almost all cases, even malformed URLs

sed 's/[\w \W \s]*http[s]*[a-zA-Z0-9 : \. / ; % " \W]*/ /g' positiveTweets | grep "http" | more

Victoria Stuart · Accepted Answer

The accepted answer provides the approach that I used to remove URLs, etc. from my files. However it left "blank" lines. Here is a solution.

sed -i -e 's/http[s]\?://\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' input_file

perl -i -pe 's/^'`echo "\012"`'${2,}//g' input_file

The GNU sed flags, expressions used are:

-i    Edit in-place
-e    [-e script] --expression=script : basically, add the commands in script
      (expression) to the set of commands to be run while processing the input
 ^    Match start of line
 $    Match end of line


 ?    Match one or more of preceding regular expression
{2,}  Match 2 or more of preceding regular expression
\S*   Any non-space character; alternative to: [^[:space:]]*

However,

sed -i -e 's/http[s]\?://\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g'

leaves nonprinting character(s), presumably (newlines). Standard sed-based approaches to remove "blank" lines, tabs and spaces, e.g.

sed -i 's/^[ 	]*//; s/[ 	]*$//'

do not work, here: if you do not use a "branch label" to process newlines, you cannot replace them using sed (which reads input one line at a time).

The solution is to use the following perl expression:

perl -i -pe 's/^'`echo "\012"`'${2,}//g'

which uses a shell substitution,

'`echo "\012"`'

to replace an octal value

\012

(i.e., a newline, ), that occurs 2 or more times,

{2,}

(otherwise we would unwrap all lines), with something else; here:

//

i.e., nothing.

[The second reference below provides a wonderful table of these values!]

The perl flags used are:

-p  Places a printing loop around your command,
    so that it acts on each line of standard input

-i  Edit in-place

-e  Allows you to provide the program as an argument,
    rather than in a file

References:

perl flags: Perl flags -pe, -pi, -p, -w, -d, -i, -t?
ASCII control codes: https://www.cyberciti.biz/faq/unix-linux-sed-ascii-control-codes-nonprintable/
remove URLs: sed to remove URLs from a file
branch labels: How can I replace a newline ( ) using sed?
GNU sed manual: https://www.gnu.org/software/sed/manual/sed.html
quick regex guide: https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html

Example:

$ cat url_test_input.txt

Some text ...
https://stackoverflow.com/questions/4283344/sed-to-remove-urls-from-a-file
https://www.google.ca/search?dcr=0&ei=QCsyWtbYF43YjwPpzKyQAQ&q=python+remove++citations&oq=python+remove++citations&gs_l=psy-ab.3...1806.1806.0.2004.1.1.0.0.0.0.61.61.1.1.0....0...1c.1.64.psy-ab..0.0.0....0.-cxpNc6youY
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
http://datasynce.org/2017/05/sentiment-analysis-on-python-through-textblob/
https://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
http://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
ftp://ftp.ncbi.nlm.nih.gov/
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/alignment_indices/20100804.alignment.index
Some more text.

$ sed -e 's/http[s]\?://\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' url_test_input.txt > a

$ cat a

Some text ...










Some more text.

$ perl -i -pe 's/^'`echo "\012"`'${2,}//g' a

Some text ...
Some more text.

$

Johnsyweb · Answer

The following removes http:// or https:// and everything up until the next space:

sed -e 's!http$s$\{0,1\}://[^[:space:]]*!!g' posFile  
 updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)   

Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N  Thx to HMB Contributor @kdpartak :)

Edit:

I should have used:

sed -e 's!http[s]\?://\S*!!g' posFile

"[s]\?" is a far more readable way of writing "an optional s" compared to "$s$\{0,1\}"

"\S*" a more readable version of "any non-space characters" than "[^[:space:]]*"

I must have been using the sed that came installed with my Mac at the time I wrote this answer (brew install gnu-sed FTW).

There are better URL regular expressions out there (those that take into account schemes other than HTTP(S), for instance), but this will work for you, given the examples you give. Why complicate things?

sed to remove URLs from a file

Tags:

sed

daydreamer

2 Answers

Victoria Stuart

Johnsyweb

Recent Activity

Donate For Us

sed to remove URLs from a file

Tags:

sed

daydreamer

2 Answers

Victoria Stuart

Johnsyweb

Related questions

Recent Activity

Donate For Us