I want to perform the title-named action under linux command-line(several ca bash script will also do). the command I tried is:
sed 's/href="([^"])"/$1/g' page.html > list.lst
but obviously it failed.
To be precise, here is my input:
<link rel="stylesheet" type="text/css" href="style/css/colors.css" />
<link rel="stylesheet" type="text/css" href="style/css/global.css" />
<link rel="stylesheet" type="text/css" href="style/css/icons.css" />
the output I want would be a comma-separated or space-separated list of all matches in the input file:
style/css/colors.css,style/css/global.css,style/css/icons.css
I think I got the right expression: href="([^"]*)"
but I have no clue how to perform this. sed would do a search/replace which is not exactly what I want.( to the contrary, I only need to keep matches and throw the rest away, and not to replace them )
Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses ( and ) metacharacters. Any subpattern inside a pair of parentheses will be captured as a group.
To get access to the text matched by each regex group, pass the group's number to the group(group_number) method. So the first group will be a group of 1. The second group will be a group of 2 and so on. So this is the simple way to access each of the groups as long as the patterns were matched.
The re. groups() method This method returns a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern.
grep href page.html | sed 's/^.*href="\([^"]*\)".*$/\1/' | xargs | sed 's/ /,/g'
This will extract all the lines that contain href
in them and will only get the first href
on each line. Also, refer to this post about parsing HTML with regular expressions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With