TR-IT-0155

TR-IT-0155 :March 5, 1996

Detlef Koll, Eiichiro Sumita, Hitoshi Iida

Massively Parallel Document Retrieval in Clustered Databases

Abstract:Our goal has been to develop an effective and efficient document retrieval system for very big databases, based on the vector space model. Thus we (1) implemented a massively parallel retrieval kernel on a SIMD-machine and (2) devised a fast non- hierarchical clustering method for restricting its search scope without hurting retrieval effectiveness. This paper discusses score-computing algorithm and load-balancing method of the parallel kernel, the document clustering method and how those two parts combine to a large-scale retrieval system. Evidence for the efficiency and effectiveness of this approach is given for standard test suites: (i) Virginia-collections; (ii) Tipster-collection with some gigabyte of text.