Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Putting line break using sed in bash, problems with regular expressions

Tags:

regex

bash

sed

awk

Hi everyone my data looks like this

  samplename 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 ...
  samplename2 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 ...

and I want it to look like this:

  >samplename
  0 1 1 1 1 1 1 1 1 1 
  1 0 0 0 0 0 0 0 0 ...
  >samplename2 
  0 0 0 0 0 1 1 1 1 1 
  1 1 1 1 1 1 0 0 0 ...

[note - showing a line break after every 10 digits; I actually want it after every 200, but I realize that showing a line like that would not be very helpful].

I could do it using regular expression on a text editor but I want to use the sed command in the bash because I have to do this several times and I need 200 characters per row.

I tried this but got an error:

sed -e "s/\(>\w+\)\s\([0-9]+\)/\1\n\2" < myfile > myfile2

sed: 1: "s/(>\w+)\s([0-9]+)/ ...": unescaped newline inside substitute pattern

One more note - I am doing this on a Mac; I know that sed on the Mac is a little bit different from gnu sed . If you are able to give me the solution that works for a Mac that would be great.

Thanks in advance.

like image 714
JM88 Avatar asked Jan 29 '26 08:01

JM88


2 Answers

With your added request for a line break after 200 numbers, you are much better off using awk.

echo "hello 1 2 3 4" | awk '{print ">"$1; for(i=2; i<=NF; i++) {printf("%d ",$i); if((i+1)%2 == 0) printf("\n");}}

prints out

>hello
1 2 
3 4 

If you want this to work only on lines that start with hello, you can modify as

echo "hello 1 2 3 4" | awk '/^hello / {print ">"$1; for(i=2; =NF; i++) {printf("%d ",$i); if((i+1)%2 == 0) printf("\n");}}

(the regular expression in the / / says "only do this on lines that match this expression".

You can modify the statement if( (i + 1) % 2 == 0) to be if( (i + 1) % 100 == 0 ) to get a newline after 100 digits... I just showed it for 2 because the printout is more readable.

update to make this all much cleaner, do the following.

Create a file call breakIt with the following contents: (leave out the /^hello / if you don't want to select only lines starting with "hello"; but leave the {} around the code, it matters).

/^hello/ { print ">"$1;
   for(i=2; i<=NF; i++)
   {
      printf("%d ",$i);
      if((i+1)%100 == 0) printf("\n");
   }
   print "";
}

Now you can issue the command

awk -f breakIt inputFile > outputFile

This says "use the contents of breakIt as the commands to process inputFile and put the results in outputFile".

Should do the trick nicely for you.

edit just in case you really do want a sed solution, here is a nice one (well I think so). Copy the following into a file called sedSplit

s/^([A-Za-z]+ )/>\1\
/g
s/([0-9 ]{10})/\1\
/g
s/$/\
/g

This has three consecutive sed commands; these are each on their own line, but since they insert newlines, they actually appear to take six lines.

s/^                  - substitute, starting from the beginning of the line
([A-Za-z]+ )/        - substitute the first word (letters only) plus space, replacing with 
>\1\
/g                   - the literal '>', then the first match, then a newline, as often as needed (g)

s/([0-9] ]{10})/     - substitute 10 repetitions of [digit followed by space]
\1\
/g                   - replace with itself, followed by newline, as often as needed

s/$/\
/g                   - replace the 'end of line' with a carriage return

You invoke this sed script like this:

sed -E -f sedSplit < inputFile > outputFile

This uses the

-E flag (use extended regular expressions - no need for escaping brackets and such)

-f flag ('get instructions from this file')

It makes the whole thing much cleaner - and gives you the output you asked for on a Mac (even with an extra carriage return to separate the groups; if you don't want that, leave out the last two lines).

like image 150
Floris Avatar answered Jan 31 '26 00:01

Floris


$ awk '{print ">" $1; for (i=2;i<=NF;i++) printf "%s%s", $i, ((i-1)%10 ? FS : RS)}' file
>samplename
0 1 1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0 0 ...
>samplename2
0 0 0 0 0 1 1 1 1 1
1 1 1 1 1 1 0 0 0 ...
like image 39
Ed Morton Avatar answered Jan 30 '26 23:01

Ed Morton