i have algorithm that will go through a large data set read some text files and search for specific terms in those lines. I have it implemented in Java, but I didnt want to post code so that it doesnt look i am searching for someone to implement it for me, but it is true i really need a lot of help!!! This was not planned for my project, but data set turned out to be huge, so teacher told me I have to do it like this.
EDIT(i did not clarified i previos version)The data set I have is on a Hadoop cluster, and I should make its MapReduce implementation
I was reading about MapReduce and thaught that i first do the standard implementation and then it will be more/less easier to do it with mapreduce. But didnt happen, since algorithm is quite stupid and nothing special, and map reduce...i cant wrap my mind around it.
So here is shortly pseudo code of my algorithm
LIST termList (there is method that creates this list from lucene index)
FOLDER topFolder
INPUT topFolder
IF it is folder and not empty
list files (there are 30 sub folders inside)
FOR EACH sub folder
GET file "CheckedFile.txt"
analyze(CheckedFile)
ENDFOR
END IF
Method ANALYZE(CheckedFile)
read CheckedFile
WHILE CheckedFile has next line
GET line
FOR(loops through termList)
GET third word from line
IF third word = term from list
append whole line to string buffer
ENDIF
ENDFOR
END WHILE
OUTPUT string buffer to file
Also, as you can see, each time when "analyze" is called, new file has to be created, i understood that map reduce is difficult to write to many outputs???
I understand mapreduce intuition, and my example seems perfectly suited for mapreduce, but when it comes to do this, obviously I do not know enough and i am STUCK!
Please please help.
You can just use an empty reducer, and partition your job to run a single mapper per file. Each mapper will create its own output file in your output folder.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With