Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Print substrings every ith character

Tags:

bash

substr

awk

I have files that I would like to divide into substrings in a "sliding window" manner, in increments of 1 character. The files have only one line each, and I can print the substrings like this:

input="file.txt"
awk '{print substr($1,1,21)}' $input


awk '{print substr($1,2,21)}' $input

which give me the following output, respectively.

AATAAGGTGCCTGATTAAA-G   
ATAAGGTGCCTGATTAAA-GG

The input file contains about 17k characters and I managed to try and do a for loop to count the characters and try the above command within the for loop, like this:

count=`wc -c ${input} |cut -d' ' -f1`
for num in `seq ${count}`
   do
awk '{print substr($1,$num,21)}' $input
   done

But this returns empty outputs. I also wanted to run it as a bash scripts with the input and the size of the substrings and output file specified in the command line like:

script.sh input_file.txt 21 output.txt

And I tried this, but it also didn't work.

  input=$1
  kmer=$2
  output=$3
  count=`wc -c ${input} |cut -d' ' -f1`
  for num in `seq ${count}`
    do
 awk '{print substr($1,$num,$kmer)}' $input > $output
  done

Any tips on what I am doing wrong? I am pretty new to awk...

like image 924
Silvia Justi Avatar asked Jun 26 '18 19:06

Silvia Justi


People also ask

Is there a way to print all substrings of a string in O n time?

Is there a way to print all substrings of a string in O(N) time? No there isn't. It is mathematically impossible. There are O(N^2) substrings of a string of length N .

How do you find all substrings in length K?

If the length of a string is N, then there can be N – K + 1 substring of length K. Generating these substrings will require O(N) complexity, and checking each substring requires O(K) complexity, hence making the overall complexity like O(N*K). Efficient approach: The idea is to use Window Sliding Technique.

How many substrings can be formed from a character string?

The total number of substrings formed by string of length N is (N*(N+1))/2, initialise count as (N*(N+1))/2.


2 Answers

#!/usr/bin/env bash 

input=$1
kmer=$2
output=$3

data=$(<"$input")

for ((i=0;i<${#data};i++)); do
    echo "${data:i:kmer}"
done > "$output"

It uses only substring expansion, quoting from manual:

${parameter:offset:length}

This is referred to as Substring Expansion. It expands to up to length characters of the value of parameter starting at the character specified by offset.


Using gawk:

awk -v num="$kmer" '{for(i=1;i<=length($0);i++) print substr($0,i,num)}' "$input" > "$output"

This is a much faster solution. The speed difference is significant: Tested on 17k characters and a 30-char window: ~10s for the first solution, ~0.01s for the second solution.

like image 139
PesaThe Avatar answered Sep 30 '22 05:09

PesaThe


You can also do this with the GNU sed, as follows:

echo -n "123456789" | sed -r ':loop h;s/.//3g;p;x; s/.//; t loop'
12
23 
34
45
56
67
78
89 
9

3g is the "sliding window" size + 1.

to process data in a file instead of STDIN, just specify it after the sed command:

sed -r ':loop h;s/.//3g;p;x; s/.//; t loop' myfile
like image 23
zeppelin Avatar answered Sep 30 '22 06:09

zeppelin