I am trying to filter a file using values in the 8 column >= 10. I am using awk but for some reason it doesn't work. Am I doing something wrong, what am I missing?
head df_TPM.csv
LQNS02136402.1_14821_3p,12680.71611,11346.42368,11686.28693,9067.797819,7429.467928,5551.660333,3246.956281
LQNS02000137.1_325_3p,8342.540984,5905.726173,4503.363041,3616.191278,3142.965662,3678.829299,6288.621969
LQNS02278148.1_40791_3p,4921.502758,2461.882836,429.824973,261.273116,132.0239748,68.6191655,70.8815385
LQNS02278089.1_34112_3p,4246.71324,4584.529009,8687.922574,7570.83746,5801.384953,2870.020801,734.3131465
LQNS02278075.1_32377_5p,4143.547577,4093.91803,10804.12323,10062.99269,7925.240969,4712.484455,1080.915573
LQNS02138569.1_14892_3p,2668.27957,2160.173542,837.2584183,233.2310273,84.62362925,64.6037895,23.456714
LQNS02278075.1_32324_5p,2331.608924,491.8868983,1527.312199,881.8683105,747.1474225,347.397634,74.07259175
LQNS02278075.1_32382_3p,2140.686095,2439.122353,10837.38169,12569.95295,9385.530878,6022.323737,1705.900969
LQNS02000138.1_777_5p,1819.275149,1762.009649,8565.396754,33280.90019,32176.07604,15849.37306,11872.99383
LQNS02278186.1_47223_3p,1687.843418,728.4288968,1328.048172,1306.424238,2102.27342,14.78892225,9.92647375
#Extract column 1 and 8 and print if $8>=10
cat df_TPM.csv |awk -F"," '{print $1, $8}' | grep -E "^LQN" | awk -F " " '$2>= 10'
LQNS02276925.1_23356_5p 5.352369
LQNS02277221.1_25158_5p 2.82778125
LQNS02277812.1_29775_3p 11.1090745
LQNS02278074.1_32154_3p 6.124789
LQNS02278139.1_39525_5p 22.6656355
#As you can see lots of numbers shouldn't be there (ex: 2.82778125 < 10)
                By seeing OP's comment, in case you don't want to search for LQN text in starting of line and want to check if 8th column is greater than 10 then try following(to check if lines have LQN remove ! from following codes).
awk -F"," '$8+0 >= 10 && !/^LQN/{print $1, $8}' df_TPM.csv
OR To get total lines try: counting those matched lines could be done in a single awk itself.
awk -F"," '$8+0 >= 10 && !/^LQN/{count++} END{print count}' df_TPM.csv
Explanation: Adding detailed explanation for above.
awk -F"," '               ##Starting awk program from here.
$8+0 >= 10 && !/^LQN/{    ##Checking condition if 8th field is greater than 10 and NOT LQN.
  count++                 ##Increasing count with 1 here.
}
END{                      ##Starting END block of this awk program from here.
  print count             ##Printing count value here.
}
' df_TPM.csv              ##Mentioning Input_file name here.
To handle control M characters in awk code itself try: considering that you don't want control M characters in your Input_file.
awk -F"," '{gsub(/\r/,"")} $8 >= 10 && !/^LQN/{count++} END{print count}' df_TPM.csv
                        You need to tell your awk to coerce $8 into a number by computing $8+0. It is recommended that you ensure you have GNU awk installed to avoid issues. Also, you may probably use dos2unix before working on the files to normalize the line endings.
The whole command can be written as
awk -F"," '/^LQN/ && $8+0 >= 10 {print $1, $8}' df_TPM.csv
See the online awk demo.
NOTE: To only count these lines, use The whole command can be written as
awk -F, '/^LQN/ && $8+0 >= 10 {cnt++} END{print cnt}' df_TPM.csv
To find the lines that do not start with LQN, just add the negation operator ! before /^LQN/:
awk -F, '!/^LQN/ && $8+0 >= 10 {cnt++} END{print cnt}' df_TPM.csv
Details
-F"," (= -F,) - set the field separator to a comma/^LQN/ && $8+0 >= 10 - if the current line starts with LQN and the eighth field is equal or larger than 10!/^LQN/ && $8+0 >= 10 - if the current line does not start with LQN and the eighth field is equal or larger than 10{print $1, $8} - print Field 1 and 8{cnt++} - increment the cnt variableEND{print cnt} - print the cnt variable once the awk finishes processing lines.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With