I have files that I would like to divide into substrings in a "sliding window" manner, in increments of 1 character. The files have only one line each, and I can print the substrings like this:
input="file.txt"
awk '{print substr($1,1,21)}' $input
awk '{print substr($1,2,21)}' $input
which give me the following output, respectively.
AATAAGGTGCCTGATTAAA-G
ATAAGGTGCCTGATTAAA-GG
The input file contains about 17k characters and I managed to try and do a for loop to count the characters and try the above command within the for loop, like this:
count=`wc -c ${input} |cut -d' ' -f1`
for num in `seq ${count}`
do
awk '{print substr($1,$num,21)}' $input
done
But this returns empty outputs. I also wanted to run it as a bash scripts with the input and the size of the substrings and output file specified in the command line like:
script.sh input_file.txt 21 output.txt
And I tried this, but it also didn't work.
input=$1
kmer=$2
output=$3
count=`wc -c ${input} |cut -d' ' -f1`
for num in `seq ${count}`
do
awk '{print substr($1,$num,$kmer)}' $input > $output
done
Any tips on what I am doing wrong? I am pretty new to awk...
Is there a way to print all substrings of a string in O(N) time? No there isn't. It is mathematically impossible. There are O(N^2) substrings of a string of length N .
If the length of a string is N, then there can be N – K + 1 substring of length K. Generating these substrings will require O(N) complexity, and checking each substring requires O(K) complexity, hence making the overall complexity like O(N*K). Efficient approach: The idea is to use Window Sliding Technique.
The total number of substrings formed by string of length N is (N*(N+1))/2, initialise count as (N*(N+1))/2.
#!/usr/bin/env bash
input=$1
kmer=$2
output=$3
data=$(<"$input")
for ((i=0;i<${#data};i++)); do
echo "${data:i:kmer}"
done > "$output"
It uses only substring expansion, quoting from manual:
${parameter:offset:length}
This is referred to as Substring Expansion. It expands to up to
length
characters of the value ofparameter
starting at the character specified byoffset
.
Using gawk
:
awk -v num="$kmer" '{for(i=1;i<=length($0);i++) print substr($0,i,num)}' "$input" > "$output"
This is a much faster solution. The speed difference is significant: Tested on 17k characters and a 30-char window: ~10s for the first solution, ~0.01s for the second solution.
You can also do this with the GNU sed, as follows:
echo -n "123456789" | sed -r ':loop h;s/.//3g;p;x; s/.//; t loop'
12
23
34
45
56
67
78
89
9
3g
is the "sliding window" size + 1.
to process data in a file instead of STDIN, just specify it after the sed command:
sed -r ':loop h;s/.//3g;p;x; s/.//; t loop' myfile
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With