Converting CSV to SequenceFile

Question

I have a CSV file which I would like to convert to a SequenceFile, which I would ultimately use to create NamedVectors to use in a clustering job. I've been using the seqdirectory command to try to make a SequenceFile, and then fed that output into seq2sparse with the -nv option to create NamedVectors. It seems like this is giving one big vector as an output, but I ultimately want each line of my CSV to become a NamedVector. Where am I going wrong?

Julian Ortega · Accepted Answer

seqdirectory command takes every file as a document, so in reality, you only have one document, hence you only get one vector. To make it work properly you would make each line of your CSV file a file itself, where the key of the document is the name of the file and the value are its content. Nonetheless, this is quite unpractical if your corpus is large as disk reading and writing can become painfully slow.

In practice you are better off following the links I share in this comment

Converting CSV to SequenceFile

Tags:

hadoop

mahout

sequencefile

Alison

1 Answers

Julian Ortega

Recent Activity

Donate For Us

Converting CSV to SequenceFile

Tags:

hadoop

mahout

sequencefile

Alison

1 Answers

Julian Ortega

Related questions

Recent Activity

Donate For Us