Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to print row(s) if they meet a certain range

Tags:

python

pandas

awk

I have two mega files that look like below:

f1:

chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC"
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"

f2:

chr,name,start,end
chr1,linc1320,3073300,3074300
chr3,linc2245,3077270,3078250
chr1,linc8956,4410501,4406025

What I want to do is to print the rows of file 2 in a separate column in file 1 IF the range of start and end column of file2 is inside the ranges in file1 (columns 2 and 3) and chr is the same. So based on the dummy example files I provided - the desired output should be (only the range of linc1320 is in the first row of the file1):

chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"

I am not a professional coder but I have been using this code to manually change the ranges based on the file2:

awk -F ',' '$2<=3073300,$3>=3074300, {print $1,$2,$3,$4,$5,$6,$7}' f1.csv

I do not have a particular preference for using a specific programming language - both Python and awk would be very helpful. Any help is appreciated thank you.

like image 706
Apex Avatar asked Apr 27 '21 09:04

Apex


People also ask

How do I print a specific row in pandas?

You can use the df. loc[[2]] to print a specific row of a pandas dataframe.

How do I print certain rows and columns?

On the Ribbon, click the Page Layout tab. In the Sheet Options group, under Headings, select the Print check box. , and then under Print, select the Row and column headings check box .

How do I print columns in Libreoffice?

The quickest way is to copy the cells, paste into Writer and click Format > Page Style > Columns and select 2, 3 or 4 columns as needed or will fit.

How do I print a selected range of cells?

This first option is the quickest method to print a selected range of cells. Select and highlight the range of cells you want to print. Next, click File > Print or press Ctrl+P to view the print settings. Click the list arrow for the print area settings and then select the “Print Selection” option.

How to copy rows based on column criteria in Excel?

1. Select the range that you will copy rows based on column criteria, and click the Enterprise > Split Data. 2. In the opening Split Data into Multiple Worksheets dialog box, you need to: (1) Check the Specific column option, and click the below drop-down list and specify the column that you will copy rows based on;

How to extract rows that meet criteria with kutools for Excel?

Extract rows that meet criteria with Kutools for Excel 1. Select the column you extract rows based on, and click Kutools > Select > Select Specific Cells. See screenshot: 2. Then in the Select Specific Cells dialog, check Entire row option in Selection type, and specify your criteria in the... 3. ...

How do I extract all rows from a range in Excel?

Extract all rows from a range that meet criteria in one column [Array formula] The array formula in cell B20 extracts records where column E equals either "South" or "East". To enter an array formula, type the formula in a cell then press and hold CTRL + SHIFT simultaneously, now press Enter once.


2 Answers

You may use this awk:

awk 'BEGIN{FS=OFS=","} FNR==NR {if (FNR>1) {chr[++n] = $1; id[n]=$2; r1[n]=$3; r2[n]=$4}; next} {for (i=1; i<=n; ++i) if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {$0 = $0 OFS id[i] OFS r1[i] OFS r2[i]; break}} 1' file2 file1

chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"

A more readable form:

awk '
BEGIN { FS = OFS = "," }
FNR == NR {
   if (FNR > 1) {
      chr[++n] = $1
      id[n] = $2
      r1[n] = $3
      r2[n] = $4
   }
   next
}
{
   for (i=1; i<=n; ++i)
      if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {
         $0 = $0 OFS id[i] OFS r1[i] OFS r2[i]
         break
      }
} 1' file2 file1
like image 129
anubhava Avatar answered Nov 05 '22 20:11

anubhava


Let us try approaching the problem pandas way, first read the csv files into a pandas dataframe

f1 = pd.read_csv('file1.csv', header=None)
f2 = pd.read_csv('file2.csv')

>>> f1

      0        1        2        3                     4              5                     6
0  chr1  3073253  3074322  gene_id  ENSMUSG00000102693.1      gene_type                   TEC
1  chr1  3074253  3075322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1
2  chr1  3077253  3078322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1
3  chr1  3102916  3103025  gene_id  ENSMUSG00000064842.1      gene_type                 snRNA
4  chr1  3105016  3106025  gene_id  ENSMUSG00000064842.1  transcript_id  ENSMUST00000082908.1


>>> f2

    chr      name    start      end
0  chr1  linc1320  3073300  3074300
1  chr3  linc2245  3077270  3078250
2  chr1  linc8956  4410501  4406025

Now we can merge and filter the rows which satisfy the given condition of interval inclusion then we can join the filtered rows with the file f1

m = f1.reset_index()\
      .merge(f2, left_on=0, right_on='chr')\
      .where(lambda x: x[1].le(x['start']) & x[2].ge(x['end']))\
      .set_index('index')[['name', 'start', 'end']]

f3 = f1.join(m)

>>> f3

      0        1        2        3                     4              5                     6      name      start        end
0  chr1  3073253  3074322  gene_id  ENSMUSG00000102693.1      gene_type                   TEC  linc1320  3073300.0  3074300.0
1  chr1  3074253  3075322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1       NaN        NaN        NaN
2  chr1  3077253  3078322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1       NaN        NaN        NaN
3  chr1  3102916  3103025  gene_id  ENSMUSG00000064842.1      gene_type                 snRNA       NaN        NaN        NaN
4  chr1  3105016  3106025  gene_id  ENSMUSG00000064842.1  transcript_id  ENSMUST00000082908.1       NaN        NaN        NaN

PS: You can also save the resulting dataframe f3 to a csv file using f3.to_csv('file3.csv')

like image 31
Shubham Sharma Avatar answered Nov 05 '22 20:11

Shubham Sharma