I have two mega files that look like below:
f1:
chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC"
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"
f2:
chr,name,start,end
chr1,linc1320,3073300,3074300
chr3,linc2245,3077270,3078250
chr1,linc8956,4410501,4406025
What I want to do is to print the rows of file 2 in a separate column in file 1 IF the range of start
and end
column of file2 is inside the ranges in file1 (columns 2 and 3) and chr
is the same. So based on the dummy example files I provided - the desired output should be (only the range of linc1320
is in the first row of the file1):
chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"
I am not a professional coder but I have been using this code to manually change the ranges based on the file2:
awk -F ',' '$2<=3073300,$3>=3074300, {print $1,$2,$3,$4,$5,$6,$7}' f1.csv
I do not have a particular preference for using a specific programming language - both Python
and awk
would be very helpful. Any help is appreciated thank you.
You can use the df. loc[[2]] to print a specific row of a pandas dataframe.
On the Ribbon, click the Page Layout tab. In the Sheet Options group, under Headings, select the Print check box. , and then under Print, select the Row and column headings check box .
The quickest way is to copy the cells, paste into Writer and click Format > Page Style > Columns and select 2, 3 or 4 columns as needed or will fit.
This first option is the quickest method to print a selected range of cells. Select and highlight the range of cells you want to print. Next, click File > Print or press Ctrl+P to view the print settings. Click the list arrow for the print area settings and then select the “Print Selection” option.
1. Select the range that you will copy rows based on column criteria, and click the Enterprise > Split Data. 2. In the opening Split Data into Multiple Worksheets dialog box, you need to: (1) Check the Specific column option, and click the below drop-down list and specify the column that you will copy rows based on;
Extract rows that meet criteria with Kutools for Excel 1. Select the column you extract rows based on, and click Kutools > Select > Select Specific Cells. See screenshot: 2. Then in the Select Specific Cells dialog, check Entire row option in Selection type, and specify your criteria in the... 3. ...
Extract all rows from a range that meet criteria in one column [Array formula] The array formula in cell B20 extracts records where column E equals either "South" or "East". To enter an array formula, type the formula in a cell then press and hold CTRL + SHIFT simultaneously, now press Enter once.
You may use this awk
:
awk 'BEGIN{FS=OFS=","} FNR==NR {if (FNR>1) {chr[++n] = $1; id[n]=$2; r1[n]=$3; r2[n]=$4}; next} {for (i=1; i<=n; ++i) if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {$0 = $0 OFS id[i] OFS r1[i] OFS r2[i]; break}} 1' file2 file1
chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"
A more readable form:
awk '
BEGIN { FS = OFS = "," }
FNR == NR {
if (FNR > 1) {
chr[++n] = $1
id[n] = $2
r1[n] = $3
r2[n] = $4
}
next
}
{
for (i=1; i<=n; ++i)
if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {
$0 = $0 OFS id[i] OFS r1[i] OFS r2[i]
break
}
} 1' file2 file1
Let us try approaching the problem pandas
way, first read the csv
files into a pandas
dataframe
f1 = pd.read_csv('file1.csv', header=None)
f2 = pd.read_csv('file2.csv')
>>> f1
0 1 2 3 4 5 6
0 chr1 3073253 3074322 gene_id ENSMUSG00000102693.1 gene_type TEC
1 chr1 3074253 3075322 gene_id ENSMUSG00000102693.1 transcript_id ENSMUST00000193812.1
2 chr1 3077253 3078322 gene_id ENSMUSG00000102693.1 transcript_id ENSMUST00000193812.1
3 chr1 3102916 3103025 gene_id ENSMUSG00000064842.1 gene_type snRNA
4 chr1 3105016 3106025 gene_id ENSMUSG00000064842.1 transcript_id ENSMUST00000082908.1
>>> f2
chr name start end
0 chr1 linc1320 3073300 3074300
1 chr3 linc2245 3077270 3078250
2 chr1 linc8956 4410501 4406025
Now we can merge
and filter
the rows which satisfy the given condition of interval inclusion then we can join
the filtered rows with the file f1
m = f1.reset_index()\
.merge(f2, left_on=0, right_on='chr')\
.where(lambda x: x[1].le(x['start']) & x[2].ge(x['end']))\
.set_index('index')[['name', 'start', 'end']]
f3 = f1.join(m)
>>> f3
0 1 2 3 4 5 6 name start end
0 chr1 3073253 3074322 gene_id ENSMUSG00000102693.1 gene_type TEC linc1320 3073300.0 3074300.0
1 chr1 3074253 3075322 gene_id ENSMUSG00000102693.1 transcript_id ENSMUST00000193812.1 NaN NaN NaN
2 chr1 3077253 3078322 gene_id ENSMUSG00000102693.1 transcript_id ENSMUST00000193812.1 NaN NaN NaN
3 chr1 3102916 3103025 gene_id ENSMUSG00000064842.1 gene_type snRNA NaN NaN NaN
4 chr1 3105016 3106025 gene_id ENSMUSG00000064842.1 transcript_id ENSMUST00000082908.1 NaN NaN NaN
PS: You can also save the resulting dataframe f3
to a csv file using f3.to_csv('file3.csv')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With