Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract specific data from a text file

I have a txt file that appears in notepad++ like this:

/a/apple 1
/b/bat 10
/c/cat 22
/d/dog 33
/h/human/female 34

Now I want to extract everything after second slash before the numbers at the end. So the output I want is:

out = {'apple'; 'bat'; 'cat'; 'dog'; 'human/female'}

I wrote this code:

file= fopen('file.txt');
out=  textscan(file,'%s','Delimiter','\n');
fclose(file);

it gives:

out =
   {365×1 cell}

out{1} = 

    '/a/apple 1'
    '/b/bat 10'
    '/c/cat 22'
    '/d/dog 33'
    '/h/human/female 34'

How can I get the required output from the text file (directly if possible)? Or any regular expression if directly getting the required output is not possible?

like image 243
Likeunknown Avatar asked Jan 29 '23 19:01

Likeunknown


2 Answers

You can get the desired output directly from textscan, without any further processing needed:

file = fopen('file.txt');
out = textscan(file, '/%c/%s %d');
fclose(file);
out = out{2}

out =

  5×1 cell array

    'apple'
    'bat'
    'cat'
    'dog'
    'human/female'

Note that the two slashes in the format specifier string will be treated as literal text to ignore in the output. Any additional slashes will be captured in the string (%s). Also, it is unnecessary to specify a delimiter argument since the default delimiter is whitespace, so the trailing number will be captured as a separate numeric value (%d).

like image 108
gnovice Avatar answered Feb 01 '23 08:02

gnovice


Another alternative would be to use regular expressions using the already created cell array of strings you have, but then cleverly pulling out what you need based on a specified input pattern that you want to search for within each of the strings in your cell array. Use the regexp function in MATLAB to do that:

% Your code
file= fopen('file.txt');
out =  textscan(file,'%s','Delimiter','\n');
fclose(file);

% Proposed changes
out = regexp(out{1}, '/\w*/(.+)\s', 'tokens', 'once');
out = [out{:}].';

Recall that textscan will return a cell array of a single element, so you'll need to unpack the cell by accessing the first element prior to using regexp. What the proposed code does is that for each string in your cell array, it searches for the corresponding combination:

  1. / - First looks for a beginning forward slash

  2. \w*/ - Then looks for characters that are alphabetic or numeric - at least one of these characters before another slash is encountered. The benefit of this is that you are not restricted to just one character after the first slash. They can be any characters that are alphanumeric.

  3. (.+) - Specifies a group where after the second slash, we collect all of the characters before a space (see next point). The reason why we look for all characters, rather than just alphanumeric is because there's a potential for more slashes to come. We only stop searching once we encounter a space (again see next point).

  4. \s - Look for a space

It will search for this specific collection of characters, which is actually the text before the space is encountered. Take note that I had to delimit with a space after the group (.+) or it would basically return the entire line after the second slash. You need that there to limit the search within the string.

The () in point 3 is important, because the 'tokens' attribute in regexp allows you to additionally extract strings that are located in groups. Using 'once' only extracts the first match. Note that the output will be a nested cell array of cells, where each cell is one element denoting the match within a group. We can unpack the cells by using comma-separated lists and concatenating them all into a single cell array. We transpose so that we maintain the column-shaped vector that you want.

When you do this, we get the following:

>> out

out =

  5×1 cell array

    'apple'
    'bat'
    'cat'
    'dog'
    'human/female'

However, I think you're more interested in the content rather than the shape of the data, so you can remove the transpose if you so desire. The benefit with this approach is that there is no need for cellfun as regexp implicitly loops.

like image 36
rayryeng Avatar answered Feb 01 '23 09:02

rayryeng