Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

merge partial matched strings

I am struggling with trying to combine partially matched strings from two files.

File 1 contains a list of unique strings. These strings are partially matched to a number of strings in File 2. How do I merge the rows in file 1 with file 2 for every matched case

File1

mmu-miR-677-5p_MIMAT0017239
mmu-miR-181a-1-3p_MIMAT0000660

File2

mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC

Desired output

mmu-miR-677-5p_MIMAT0017239     mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_MIMAT0017239     mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_MIMAT0017239     mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_MIMAT0000660  mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_MIMAT0000660  mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC

I have tried using pmatch() in R, but don't get it right. I looks like something perl would handle??

Maybe something like this:

perl -ne'exec q;perl;, "-ne", q $print (/\Q$.$1.q;/?"$. YES":$. .q\; NO\;);, "file2" if m;^(.*)_pat1;' file1
like image 692
user3741035 Avatar asked Jun 15 '14 11:06

user3741035


People also ask

How to find out the results containing partial match strings in Excel?

We can find out the results containing partial match strings by using the IF ISNUMBER SEARCH combo. Consider a data set containing the column “Name”, “Match String”, “Status”. We need to identify the names that containing the partial match string from the column “Match String”. Apply the IF ISNUMBER SEARCH formula in the “Status” column in cell D4

How to use Asterisk (*) for partial match in Excel?

We use the Asterisk (*) as a wildcard that matches zero or more text strings. Table_array is $B$4:$C$9. Press “Enter”. The formula has performed the partial match string. Now apply the same formula 2 or more times to master this function. Read More: How to Use VLOOKUP for Partial Match in Excel (4 Ways) 4. XLOOKUP to Perform Partial Match String

How do I search for multiple partial strings in SQL Server?

Note that we can use the | operator to search for as many partial strings as we’d like. The following code shows how to use this operator to return the rows with partial strings ‘A’, ‘C’, ‘D’, ‘F’, or ‘G’ in the player column:

How to merge two datasets in Stata?

From that data, we shall create two datasets. One of the dataset will remain in the Stata memory, we shall call it data_memory. The other data set will be saved to a file, we shall call it data_file. The data_file has two variables, name and symbol. We shall merge the data_memory into data_file using variable name as the merging criterion.


2 Answers

This is a brief Perl solution, which saves all the data from file1 in a hash and then retrieves it as file2 is scanned

use strict;
use warnings;
use autodie;

my @files = qw/ file1.txt file2.txt /;

my %file1 = do {
  open my $fh, '<', $files[0];
  map /([^_]+)_(\S+)/, <$fh>;
};

open my $fh, '<', $files[1];
while (<$fh>) {
  my ($key) = /([^_]+)/;
  printf "%-32s%s", "${key}_$file1{$key}", $_;
}

output

mmu-miR-677-5p_MIMAT0017239     mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_MIMAT0017239     mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_MIMAT0017239     mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_MIMAT0000660  mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_MIMAT0000660  mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC
like image 183
Borodin Avatar answered Sep 17 '22 19:09

Borodin


Of course you may do it in R. Indeed, pmatching whole strings won't give you the desired result - you've got to match appropriate substrings.

I assume that in file 1 the first identifier is 677 and not 667, otherwise it's hard to guess the matching scheme (I assume your example is only a part of a bigger database).

file1 <- readLines(textConnection('mmu-miR-677-5p_MIMAT0017239
mmu-miR-181a-1-3p_MIMAT0000660'))

file2 <- readLines(textConnection('mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC'))

library(stringi)
file1_id <- stri_extract_first_regex(file1, "^.*?(?=_)")
file2_id <- stri_extract_first_regex(file2, "^.*?(?=_)")

cbind(file1=file1[match(file2_id, file1_id)], file2=file2)
##      file1                            file2                                     
## [1,] "mmu-miR-677-5p_MIMAT0017239"    "mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA"  
## [2,] "mmu-miR-677-5p_MIMAT0017239"    "mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT"
## [3,] "mmu-miR-677-5p_MIMAT0017239"    "mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT" 
## [4,] "mmu-miR-181a-1-3p_MIMAT0000660" "mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC" 
## [5,] "mmu-miR-181a-1-3p_MIMAT0000660" "mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC"
like image 26
gagolews Avatar answered Sep 19 '22 19:09

gagolews