The effects of size on the function of an information retrieval document collection
A feature of research into Information Retrieval has been the continued use of small test collections in experiments. The assumption that any results will remain valid when the system is used to interrogate a large operational database is examined critically particulaIly with regard to the difference in size of collections involved and the reasons for this. Experiments investigatinsg MEDLARS database with reference to several sub-collections containing varying numbers of documents are described. These include analyses of single term and two-term combination behaviour and actual retrieval searches. The effect cn the clustering structure of diffeIent small sub-collections is also studied. The results ottained for MEDLARS are examined in the context of some well-known test collections, namely Cranfield 2 and INSEC. Results for MEDLARS data indicate that very large collecticns ( > 20,000 documents) may be necessary in order to ensure that the experimental data is indeed representative and may therefore be used to accurately predict the performance of a particular system in the operational ervironment.