Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What do the parameters of the csvIterator mean in Mallet?

I am using mallet topic modelling sample code and though it runs fine, I would like to know what the parameters of this statement actually mean?

instances.addThruPipe(new CsvIterator(new FileReader(dataFile),
                                      "(\\w+)\\s+(\\w+)\\s+(.*)",
                                      3, 2, 1)  // (data, target, name) field indices                    
                     );
like image 741
London guy Avatar asked Jan 13 '15 17:01

London guy


1 Answers

From the documentation:

This iterator, perhaps more properly called a Line Pattern Iterator, reads through a file and returns one instance per line, based on a regular expression.

If you have data of the form

[name] [label] [data]

The call you are interested in is

CsvIterator(java.io.Reader input, java.lang.String lineRegex, 
            int dataGroup, int targetGroup, int uriGroup) 

The first parameter is how data is read in, like a file reader or a string reader. The second parameter is the regex that is used to extract data from each line that's read from the reader. In your example, you've got (\\w+)\\s+(\\w+)\\s+(.*) which translates to:

  • 1 or more alphanumeric characters (capture group, this is the name of the instance), followed by
  • 1 or more whitespace character (tab, space, ..), followed by
  • 1 or more alphanumeric characters (capture group, this is the label/target), followed by
  • 1 or more whitespace character (tab, space, ..), followed by
  • 0 or more characters (this is the data)

The numbers 3, 2, 1 indicate the data comes last, the target comes second, and the name comes first. The regex basically ensures the format of each line is as stated in the documentation:

test1 spam Wanna buy viagra?
test2 not-spam Hello, are you busy on Sunday?

CsvIterator is a terrible name, because it is not actually comma-separated values that this class reads in, it is whitespace-separated (space, tab, ...) values.

like image 199
mbatchkarov Avatar answered Nov 10 '22 16:11

mbatchkarov