I have files that I would like to divide into substrings in a "sliding window" manner, in increments of 1 character. The files have only one line each, and I can print the substrings like this: <pre class="prettyprint"><code>input="file.txt" awk '{print substr($1,1,21)}' $input awk '{print substr($1,2,21)}' $input </code></pre> which give me the following output, respectively. <pre class="prettyprint"><code>AATAAGGTGCCTGATTAAA-G ATAAGGTGCCTGATTAAA-GG </code></pre> The input file contains about 17k characters and I managed to try and do a for loop to count the characters and try the above command within the for loop, like this: <pre class="prettyprint"><code>count=`wc -c ${input} |cut -d' ' -f1` for num in `seq ${count}` do awk '{print substr($1,$num,21)}' $input done </code></pre> But this returns empty outputs. I also wanted to run it as a bash scripts with the input and the size of the substrings and output file specified in the command line like: <pre class="prettyprint"><code>script.sh input_file.txt 21 output.txt </code></pre> And I tried this, but it also didn't work. <pre class="prettyprint"><code> input=$1 kmer=$2 output=$3 count=`wc -c ${input} |cut -d' ' -f1` for num in `seq ${count}` do awk '{print substr($1,$num,$kmer)}' $input > $output done </code></pre> Any tips on what I am doing wrong? I am pretty new to awk...

<pre class="prettyprint"><code>#!/usr/bin/env bash input=$1 kmer=$2 output=$3 data=$(<"$input") for ((i=0;i<${#data};i++)); do echo "${data:i:kmer}" done > "$output" </code></pre> It uses only substring expansion, quoting from manual: <blockquote> <code>${parameter:offset:length}</code> This is referred to as Substring Expansion. It expands to up to <code>length</code> characters of the value of <code>parameter</code> starting at the character specified by <code>offset</code>. </blockquote> <hr> Using <code>gawk</code>: <pre class="prettyprint"><code>awk -v num="$kmer" '{for(i=1;i<=length($0);i++) print substr($0,i,num)}' "$input" > "$output" </code></pre> This is a much faster solution. The speed difference is significant: Tested on 17k characters and a 30-char window: ~10s for the first solution, ~0.01s for the second solution.

You can also do this with the GNU sed, as follows: <pre class="prettyprint"><code>echo -n "123456789" | sed -r ':loop h;s/.//3g;p;x; s/.//; t loop' 12 23 34 45 56 67 78 89 9 </code></pre> <code>3g</code> is the "sliding window" size + 1. to process data in a file instead of STDIN, just specify it after the sed command: <pre class="prettyprint"><code>sed -r ':loop h;s/.//3g;p;x; s/.//; t loop' myfile </code></pre>

Print substrings every ith character

Tags:

bash

substr

awk

I have files that I would like to divide into substrings in a "sliding window" manner, in increments of 1 character. The files have only one line each, and I can print the substrings like this:

input="file.txt"
awk '{print substr($1,1,21)}' $input


awk '{print substr($1,2,21)}' $input

which give me the following output, respectively.

AATAAGGTGCCTGATTAAA-G   
ATAAGGTGCCTGATTAAA-GG

The input file contains about 17k characters and I managed to try and do a for loop to count the characters and try the above command within the for loop, like this:

count=`wc -c ${input} |cut -d' ' -f1`
for num in `seq ${count}`
   do
awk '{print substr($1,$num,21)}' $input
   done

But this returns empty outputs. I also wanted to run it as a bash scripts with the input and the size of the substrings and output file specified in the command line like:

script.sh input_file.txt 21 output.txt

And I tried this, but it also didn't work.

  input=$1
  kmer=$2
  output=$3
  count=`wc -c ${input} |cut -d' ' -f1`
  for num in `seq ${count}`
    do
 awk '{print substr($1,$num,$kmer)}' $input > $output
  done

Any tips on what I am doing wrong? I am pretty new to awk...

924

asked Jun 26 '18 19:06

Silvia Justi

2 Answers

#!/usr/bin/env bash 

input=$1
kmer=$2
output=$3

data=$(<"$input")

for ((i=0;i<${#data};i++)); do
    echo "${data:i:kmer}"
done > "$output"

It uses only substring expansion, quoting from manual:

${parameter:offset:length}

This is referred to as Substring Expansion. It expands to up to length characters of the value of parameter starting at the character specified by offset.

Using gawk:

awk -v num="$kmer" '{for(i=1;i<=length($0);i++) print substr($0,i,num)}' "$input" > "$output"

This is a much faster solution. The speed difference is significant: Tested on 17k characters and a 30-char window: ~10s for the first solution, ~0.01s for the second solution.

139

answered Sep 30 '22 05:09

PesaThe

You can also do this with the GNU sed, as follows:

echo -n "123456789" | sed -r ':loop h;s/.//3g;p;x; s/.//; t loop'
12
23 
34
45
56
67
78
89 
9

3g is the "sliding window" size + 1.

to process data in a file instead of STDIN, just specify it after the sed command:

sed -r ':loop h;s/.//3g;p;x; s/.//; t loop' myfile

answered Sep 30 '22 06:09

zeppelin

Related questions
                            
                                Process substitution into grep missing expected outputs
                            
                                bash: pipe-character in variable for command-substitution
                            
                                Bash MySQL query
                            
                                Bash Script: Changing file permissions recursively
                            
                                Pick up lines from a file based on line numbers in another file
                            
                                jq select objects that have specific value in nested json bash
                            
                                Check if a git repo exists in a shell script
                            
                                bad array subscript error in associative array
                            
                                How to parallelise a while loop in bash?
                            
                                bash function parameter construction ${1,,}
                            
                                How to install Ngrok 2.0 on linux subsystem on Windows 10
                            
                                bug? bash select-- typed data is returned as "unbound variable"
                            
                                How to lint all the files recursive while printing out only files that have an error?
                            
                                Shell: Capturing output files in variables
                            
                                Using Bash, is it possible to store an array in a dictionary
                            
                                How do I access arguments to functions if there are more than 9 arguments?
                            
                                Make sure int variable is 2 digits long, else add 0 in front to make it 2 digits long
                            
                                Break jq query string into lines
                            
                                How can I get the 1st and last date of the previous month in a Bash script?
                            
                                How to reset COMP_WORDBREAKS without affecting other completion script?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With