Recently I have been working in a project, where we need to create solr index out of billions of records. I have been browsing around looking for good articles regarding solr indexing using Hadoop MapReduce. Finally, I got an article by a great person named Dan (likethecolor.com), who had a cool article regarding this. We discussed a lot and finally my problem now seems to be almost solved.
What our team here trying was, we employed Hadoop MapReduce to create a index out of those billions of records. With Hadoop MapReduce, we were able to distribute parts of those billions of records to many machines and create solr indexes in each of the machines using Mappers. After the successful completion of a map task in particular node, the index directory generated by that map task was uploaded to the HDFS in task cleanup. Similarly, all other tasks running in several machines would do the same.
In the reduce part ( previously, we set the reduce task to 1, so that only one reduce task exist throughout the cluster), we copy all the index directories created by all the map tasks to a node's local file system and in the cleanup we merge all of them together into a single solr index directory. Below is an architecture by me for the whole process.

Thank you very much!
What our team here trying was, we employed Hadoop MapReduce to create a index out of those billions of records. With Hadoop MapReduce, we were able to distribute parts of those billions of records to many machines and create solr indexes in each of the machines using Mappers. After the successful completion of a map task in particular node, the index directory generated by that map task was uploaded to the HDFS in task cleanup. Similarly, all other tasks running in several machines would do the same.
In the reduce part ( previously, we set the reduce task to 1, so that only one reduce task exist throughout the cluster), we copy all the index directories created by all the map tasks to a node's local file system and in the cleanup we merge all of them together into a single solr index directory. Below is an architecture by me for the whole process.
Thank you very much!
The diagram is not displayed (broken link). Pls fix.
ReplyDeleteI have tested this blog with many computers, might be a issue from your side. Which browser are you using? Can you kindly, give me some details?
ReplyDeleteThanks
Hi Sunayan,
ReplyDeleteIt will be great if you can share the code implementation for the above somewhere!
Thanks
Tim