I am struggling with trying to combine partially matched strings from two files.
File 1 contains a list of unique strings. These strings are partially matched to a number of strings in File 2. How do I merge the rows in file 1 with file 2 for every matched case
File1
mmu-miR-677-5p_MIMAT0017239
mmu-miR-181a-1-3p_MIMAT0000660
File2
mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC
Desired output
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC
I have tried using pmatch()
in R, but don't get it right. I looks like something perl would handle??
Maybe something like this:
perl -ne'exec q;perl;, "-ne", q $print (/\Q$.$1.q;/?"$. YES":$. .q\; NO\;);, "file2" if m;^(.*)_pat1;' file1
We can find out the results containing partial match strings by using the IF ISNUMBER SEARCH combo. Consider a data set containing the column “Name”, “Match String”, “Status”. We need to identify the names that containing the partial match string from the column “Match String”. Apply the IF ISNUMBER SEARCH formula in the “Status” column in cell D4
We use the Asterisk (*) as a wildcard that matches zero or more text strings. Table_array is $B$4:$C$9. Press “Enter”. The formula has performed the partial match string. Now apply the same formula 2 or more times to master this function. Read More: How to Use VLOOKUP for Partial Match in Excel (4 Ways) 4. XLOOKUP to Perform Partial Match String
Note that we can use the | operator to search for as many partial strings as we’d like. The following code shows how to use this operator to return the rows with partial strings ‘A’, ‘C’, ‘D’, ‘F’, or ‘G’ in the player column:
From that data, we shall create two datasets. One of the dataset will remain in the Stata memory, we shall call it data_memory. The other data set will be saved to a file, we shall call it data_file. The data_file has two variables, name and symbol. We shall merge the data_memory into data_file using variable name as the merging criterion.
This is a brief Perl solution, which saves all the data from file1
in a hash and then retrieves it as file2
is scanned
use strict;
use warnings;
use autodie;
my @files = qw/ file1.txt file2.txt /;
my %file1 = do {
open my $fh, '<', $files[0];
map /([^_]+)_(\S+)/, <$fh>;
};
open my $fh, '<', $files[1];
while (<$fh>) {
my ($key) = /([^_]+)/;
printf "%-32s%s", "${key}_$file1{$key}", $_;
}
output
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC
Of course you may do it in R. Indeed, pmatch
ing whole strings won't give you the desired result - you've got to match appropriate substrings.
I assume that in file 1 the first identifier is 677 and not 667, otherwise it's hard to guess the matching scheme (I assume your example is only a part of a bigger database).
file1 <- readLines(textConnection('mmu-miR-677-5p_MIMAT0017239
mmu-miR-181a-1-3p_MIMAT0000660'))
file2 <- readLines(textConnection('mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC'))
library(stringi)
file1_id <- stri_extract_first_regex(file1, "^.*?(?=_)")
file2_id <- stri_extract_first_regex(file2, "^.*?(?=_)")
cbind(file1=file1[match(file2_id, file1_id)], file2=file2)
## file1 file2
## [1,] "mmu-miR-677-5p_MIMAT0017239" "mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA"
## [2,] "mmu-miR-677-5p_MIMAT0017239" "mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT"
## [3,] "mmu-miR-677-5p_MIMAT0017239" "mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT"
## [4,] "mmu-miR-181a-1-3p_MIMAT0000660" "mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC"
## [5,] "mmu-miR-181a-1-3p_MIMAT0000660" "mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With