An index or topic hierarchy of full-text documents can organize a domain and
speed information retrieval. The automatic generation of such an index is an ideal
application for unsupervised learning, where the learner creates an integrated summary
of a domain. Traditional indexes, like the Library of Congress system or
Dewey Decimal system, are generated by hand, updated infrequently, and applied
inconsistently. With machine learning, they can be generated automatically,
updated as new documents arrive, and applied consistently. Despite the appeal of
automatic indexing, organizing natural language documents is a difficult balance
between what we want to do and what we can do. For optimal performance, the
machine learner must know or acquire all that a human library patron knows about
natural language. This will be beyond the capabilities of machine learning for
many years to come. For the foreseeable future, we will have to apply approximate
solutions to the problem and do whatever data engineering is necessary to yield
good performance. This paper describes an application of clustering to full-text
databases, presents a new clustering method, and discusses the data engineering
necessary to use clustering for this application. In particular, the paper deals with
engineering the feature set to permit learning and otherwise engineering the data to
match assumptions underlying the learning algorithm.