Automatic classification and metadata generation for world-wide web resources
The aims of this project are to investigate the possibility and potential of automatically classifying Web documents according to a traditional library classification scheme and to investigate the extent to which automatic classification can be used in automatic metadata generation on the web. The Wolverhampton Web Library (WWLib) is a search engine that classifies UK Web pages according to Dewey Decimal Classification (DDC). This search engine is introduced as an example application that would benefit from an automatic classification component such as that described in the thesis. Different approaches to information resource discovery and resource description on the Web are reviewed, as are traditional Information Retrieval (IR) techniques relevant to resource discovery on the Web. The design, implementation and evaluation of an automatic classifier, that classifies Web pages according to DDC, is documented. The evaluation shows that automatic classification is possible and could be used to improve the performance of a search engine. This classifier is then extended to perform automatic metadata generation using the Resource Description Framework (RDF) and Dublin Core. A proposed RDF data model, schema and automatically generated RDF syntax are documented. Automatically generated RDF metadata describing a range of automatically classified documents is shown. The research shows that automatic classification is possible and could potentially be used to enable context sensitive browsing in automated web search engines. The classifications could also be used in generating context sensitive metadata tailored specifically for the search engine domain.