Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get unique lines based ONLY on 2 Columns

I have some large (50k line) files formatted like this

chr1    35276   35481   NR_026820_exon_1_0_chr1_35277_r 0       -       0.526829        0.473171        54      37      60      54      0       0       205
chr1    35720   36081   NR_026818_exon_2_0_chr1_35721_r 0       -       0.398892        0.601108        73      116     101     71      0       0       361
chr1    35720   36081   NR_026820_exon_2_0_chr1_35721_r 0       -       0.398892        0.601108        73      116     101     71      0       0       361
chr1    69090   70008   NM_001005484_exon_0_0_chr1_69091_f      0       +       0.571895        0.428105        212     218     175     313     0       0       918
chr1    134772  139696  NR_039983_exon_0_0_chr1_134773_r        0       -       0.366775        0.633225        997     1194    1924    809     0       0       4924
chr1    139789  139847  NR_039983_exon_1_0_chr1_139790_r        0       -       0.551724        0.448276        13      12      14      19      0       0       58
chr1    140074  140566  NR_039983_exon_2_0_chr1_140075_r        0       -       0.475610        0.524390        126     144     114     108     0       0       492
chr1    323891  324060  NR_028322_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    323891  324060  NR_028325_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    323891  324060  NR_028327_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    324287  324345  NR_028322_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324287  324345  NR_028325_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324287  324345  NR_028327_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324438  326938  NR_028327_exon_2_0_chr1_324439_f        0       +       0.375200        0.624800        400     1013    549     538     0       0       2500
chr1    324438  328581  NR_028322_exon_2_0_chr1_324439_f        0       +       0.378228        0.621772        678     1580    996     889     0       0       4143
chr1    324438  328581  NR_028325_exon_2_0_chr1_324439_f        0       +       0.378228        0.621772        678     1580    996     889     0       0       4143

The 2nd and 3rd columns are a start and end position. I want to delete all rows that have the same start AND end positions (rest of line doesn't matter) and keep only the first time it appears. For example, I'd want to keep lines 14 and 15 in the sample data because even though the start position is the same, the end is not. 15 and 16 have identical start and end, so I'd want to delete 16.I'm trying to do it in Python, but I'm really at a loss for how to handle the two column uniqueness requirement.

Any ideas on the best/easiest way to do this?

like image 340
Ethan Avatar asked Jun 09 '26 20:06

Ethan


1 Answers

You can use pandas to load your file and then delete rows based on the 2 columns as you want like so (a simple example):

import pandas as pd

raw_data = {'firstcolumn': ['chr1', 'chr1', 'chr1'],
            'start_position': [35276, 35720, 35720],
            'end_position': [35481, 36081, 36081],
            'A': [4, 4, 31],
            'B': [25, 25, 57]}
df = pd.DataFrame(raw_data, columns = ['firstcolumn', 'start_position','end_position', 'A', 'B'])

df.drop_duplicates(['start_position','end_position']) #drop duplicate rows based on these 2 columns
like image 196
Mir Ilias Avatar answered Jun 11 '26 10:06

Mir Ilias