How to print row(s) if they meet a certain range

Tags:

I have two mega files that look like below:

f1:

chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC"
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"

f2:

chr,name,start,end
chr1,linc1320,3073300,3074300
chr3,linc2245,3077270,3078250
chr1,linc8956,4410501,4406025

What I want to do is to print the rows of file 2 in a separate column in file 1 IF the range of start and end column of file2 is inside the ranges in file1 (columns 2 and 3) and chr is the same. So based on the dummy example files I provided - the desired output should be (only the range of linc1320 is in the first row of the file1):

chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"

I am not a professional coder but I have been using this code to manually change the ranges based on the file2:

awk -F ',' '$2<=3073300,$3>=3074300, {print $1,$2,$3,$4,$5,$6,$7}' f1.csv

I do not have a particular preference for using a specific programming language - both Python and awk would be very helpful. Any help is appreciated thank you.

706

asked Apr 27 '21 09:04

Apex

2 Answers

You may use this awk:

awk 'BEGIN{FS=OFS=","} FNR==NR {if (FNR>1) {chr[++n] = $1; id[n]=$2; r1[n]=$3; r2[n]=$4}; next} {for (i=1; i<=n; ++i) if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {$0 = $0 OFS id[i] OFS r1[i] OFS r2[i]; break}} 1' file2 file1

chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"

A more readable form:

awk '
BEGIN { FS = OFS = "," }
FNR == NR {
   if (FNR > 1) {
      chr[++n] = $1
      id[n] = $2
      r1[n] = $3
      r2[n] = $4
   }
   next
}
{
   for (i=1; i<=n; ++i)
      if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {
         $0 = $0 OFS id[i] OFS r1[i] OFS r2[i]
         break
      }
} 1' file2 file1

129

answered Nov 05 '22 20:11

anubhava

Let us try approaching the problem pandas way, first read the csv files into a pandas dataframe

f1 = pd.read_csv('file1.csv', header=None)
f2 = pd.read_csv('file2.csv')

>>> f1

      0        1        2        3                     4              5                     6
0  chr1  3073253  3074322  gene_id  ENSMUSG00000102693.1      gene_type                   TEC
1  chr1  3074253  3075322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1
2  chr1  3077253  3078322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1
3  chr1  3102916  3103025  gene_id  ENSMUSG00000064842.1      gene_type                 snRNA
4  chr1  3105016  3106025  gene_id  ENSMUSG00000064842.1  transcript_id  ENSMUST00000082908.1


>>> f2

    chr      name    start      end
0  chr1  linc1320  3073300  3074300
1  chr3  linc2245  3077270  3078250
2  chr1  linc8956  4410501  4406025

Now we can merge and filter the rows which satisfy the given condition of interval inclusion then we can join the filtered rows with the file f1

m = f1.reset_index()\
      .merge(f2, left_on=0, right_on='chr')\
      .where(lambda x: x[1].le(x['start']) & x[2].ge(x['end']))\
      .set_index('index')[['name', 'start', 'end']]

f3 = f1.join(m)

>>> f3

      0        1        2        3                     4              5                     6      name      start        end
0  chr1  3073253  3074322  gene_id  ENSMUSG00000102693.1      gene_type                   TEC  linc1320  3073300.0  3074300.0
1  chr1  3074253  3075322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1       NaN        NaN        NaN
2  chr1  3077253  3078322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1       NaN        NaN        NaN
3  chr1  3102916  3103025  gene_id  ENSMUSG00000064842.1      gene_type                 snRNA       NaN        NaN        NaN
4  chr1  3105016  3106025  gene_id  ENSMUSG00000064842.1  transcript_id  ENSMUST00000082908.1       NaN        NaN        NaN

PS: You can also save the resulting dataframe f3 to a csv file using f3.to_csv('file3.csv')

answered Nov 05 '22 20:11

Shubham Sharma

Related questions
                            
                                Expand rows by date range having start and end in Pandas
                            
                                Google Sheets API appending extra apostrophe
                            
                                Fastest way to check if a string contains a string from a list
                            
                                zeep - disable warning "Forcing soap:address location to HTTPS"
                            
                                Pillow ImageDraw text coordinates to center
                            
                                What do I do with the fundamental matrix?
                            
                                Format variable in f-string with a variable number of decimal places
                            
                                Python3 mock replace function with another function
                            
                                Spyder 4 - deactivate automatic highlighting of last word after few seconds
                            
                                Missing userloader or request_loader error
                            
                                How to implement ajax request using Python Request
                            
                                Python FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) [duplicate]
                            
                                Error :cannot use a string pattern on a bytes-like object
                            
                                MemoryError: Unable to allocate MiB for an array with shape and data type, when using anymodel.fit() in sklearn
                            
                                VS Code not registering that ipykernel is installed
                            
                                Export Conda Environment with minimized requirements
                            
                                Binance Future api : APIError(code=-2015): Invalid API-key, IP, or permissions for action, request ip
                            
                                How to sample from DataFrame based on percentile of a column?
                            
                                Spyder on MacOS. Typing is very laggy
                            
                                Discord malware

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to print row(s) if they meet a certain range

Tags:

python

pandas

awk

Apex

People also ask

2 Answers

anubhava

Shubham Sharma

Recent Activity

Donate For Us