Supporting the Construction and Evolution of Component Repositories

(Research Paper)

Scott Henninger
Department of Computer Science & Engineering
115 Ferguson Hall, CC 0115
University of Nebraska-Lincoln
Lincoln, NE 68588-0115
(402) 472-8394 FAX: (402) 472-7767
scotth@cse.unl.edu

Abstract


Repositories must be designed to meet the evolving and dynamic needs of software development organizations. Current software repository methods rely heavily on classification, which exacerbates acquisition and evolution problems by requiring costly classification and domain analysis efforts before a repository can be used effectively. This paper outlines an approach in which minimal initial structure is used to effectively find relevant software components while methods are employed to incrementally improve repository structures. The approach is demonstrated through PEEL, a tool to semi-automatically identify reusable components, and CodeFinder, a retrieval system that compensates for the lack of explicit knowledge structures through spreading activation retrieval and allows component representations to be incrementally improved while users are searching for information. The combination of these techniques yields a flexible software repository that minimizes up-front costs and improves its retrieval effectiveness as developers use it to find reusable software artifacts.

Table of Contents

1. Introduction
2. Repository Structure and Retrieval Effectiveness
3. Evolutionary Design of Repositories
3.1 The CodeFinder-PEEL Repository
4. Seeding the Repository
5. Compensating for Incomplete and Inconsistent Indexing
5.1 Retrieval with Spreading Activation
5.2 Defining Relationships Through Content Induced Structure
5.3 Retrieval by Reformulation
6. Supporting Incremental Refinement of Component Repositories
6.1 Adapting the Repository
6.2 Adjusting Link Weights
7. Evaluation and Future Directions
8. Conclusions

References

1. Introduction

As libraries of reusable software components continue to grow, the issue of retrieving components from software component repositories has captured the attention of the software reuse community [5, 8, 12, 13, 21, 25]. Methods using these repositories face an inherent dilemma: in order for the approach to be useful, the repository must contain enough components to support developers, but when many examples are available, finding and choosing an appropriate one becomes troublesome. Retrieval techniques as diverse as enumerated classification [1], facets [25], frame-based classification [23], free-text indexing [13], and relational databases [5], have been employed to address the problem of finding relevant components. But issues involving how effective repositories are built, populated, and evolved to meet the changing needs of developers have received considerably less attention.

Most retrieval algorithms require a value-added approach to building software repositories that require pre-defined structure before the repository can be effectively searched. At a minimum, components must be categorized before they are placed in the repository. This creates a barrier of real and intellectual capital investment that many organizations cannot overcome [4]. Methods are needed that can provide adequate retrieval effectiveness with minimal indexing and structuring efforts, allowing organizations to take advantage of valuable assets that have accumulated from previous development efforts without a large up-front investment.

In addition to being costly, the structures required for retrieval are often static and unable to adapt to dynamic development contexts. Once the structures are created, they are set in stone, changeable only through complex procedures and often involving re-designing the entire repository. This is a significant problem not only because it has proven exceedingly difficult to get the structure right the first time, but also because the domain is constantly evolving under the pressures of changing technology, project dynamics, and the fluid nature of customer needs.

This paper presents a set of tools designed to support an incremental refinement of component repositories. Initial component representations are created with minimal up-front structuring efforts. Retrieval techniques are then employed to compensate for minimal retrieval structure and help developers find reusable components. As developers use the repository, adaptive indexing techniques are employed to enhance retrieval structures and evolve the repository toward a representation of the ways its users reason about development problems. The following sections begin with a brief overview of current research in retrieval methods for software components. Then the evolutionary approach to constructing repositories is introduced followed by sections describing the repository seeding, retrieval, and adaptive indexing capabilities of CodeFinder, an experimental prototype designed to investigate issues in retrieving software components. A discussion of an empirical study of CodeFinder concludes the paper.

2. Repository Structure and Retrieval Effectiveness

The structure of a repository is generally regarded as key to obtaining good retrieval results. No matter how "intelligent" the matching algorithm, if components are indexed or otherwise structured poorly, it will be difficult to achieve good retrieval performance. The intuitive and widely held assumption is that up-front investments in structuring a repository result in a proportional increase in the ease with which components can be reused [2; p. 16]. While it may be true that good structure is crucial in a retrieval system's effectiveness, sophisticated information retrieval techniques have been created that finesse the relationship between repository structure and retrieval methods.

Retrieval methods used for software repositories can be divided into three categories: enumerated classification, faceted, and free-text indexing [12]. Hypertext systems have also been used, although some form of retrieval is necessary to make hypertext effective in large repositories [16, 29]. Enumerated classification is a well-known retrieval method used by the Dewy Decimal system and the Computing Reviews Classification System [1]. The appeal of a classification scheme is the ability to iteratively divide an information space into smaller pieces that reduces the amount of information that needs to be perused. Faceted classification avoids enumerating component definitions in a hierarchy by defining attribute classes that can be instantiated with different terms [25]. Users search for components by specifying terms for each of the facets. Free-text indexing (AKA automatic indexing) methods use the text from a document for indexing [21]. Document text is applied to a "stop list" to remove frequently occurring words such as 'and' and 'the'. The remaining text is used as an index to the document. Users specify a query using key words that are applied to the indices to find matching documents.

These methods define a continuum from enumerated classification that requires extensive structure, to faceted classification requiring less structure, to free-text indexing that requires no structure. Classification schemes are computationally simple to retrieve from (although usability problems exist), but difficult to build. Designers must get the taxonomy "right", and the definition of "right" is elusive, situation-specific, and dependent on the needs of individuals [17]. Faceted classification and attribute-value methods suffer from a reduced form of the same problem-one must get the attributes right, then it becomes a simple matter of choosing terms within the facets. Free-text indexing involves little up-front costs, but is most effective when a large corpus of linguistic text is available.

The strength of methods that use sophisticated information structures is that people can use the knowledge contained in the structures to lead them to relevant information. The weakness is that no support is possible if the information is not structured properly. Creating specialized information is a labor intensive and costly process that is prohibitory where resources are strained or where large information spaces are concerned. In what follows, a middle ground is achieved by utilizing retrieval methods that need very little structure, yet yield effective retrieval performance. The structures are then augmented in the context of use. As people find information, their interaction with the system is traced and used to improve the existing structure. The result is a flexible integration of retrieval and repository construction methods that improve the underlying structures as the repository is used.



Figure 1:
Evolutionary Construction of Repositories.

3. Evolutionary Design of Repositories

Figure 1 shows how repositories can be constructed with minimal up-front effort and then incrementally improved as the repository is used. In the initial stages of constructing a repository of software components it is important to populate the repository with components [6]. A repository is initially "seeded" [10] with a low-cost repository construction method that semi-automatically indexes components with terms and phrases. The system described in this paper, called PEEL (Parse and Extract Emacs Lisp), extracts components from text files and indexes them through a combination of automatic extraction and interactive user support [20].

The resulting repository, having been created with minimal retrieval structures, may suffer from incomplete and inconsistent indexing, making it difficult for keyword matching algorithms to retrieve relevant information. A combination of soft-matching and query reformulation techniques is employed to compensate for the minimal effort in designing the repository, making it easier to find information. The CodeFinder system is a retrieval tool that uses innovative retrieval techniques to help users find reusable software [18].

Even with innovative and intelligent retrieval tools, a repository's effectiveness is directly proportional to the quality of indexing and structures for retrieval. Relevance feedback [27] and adaptive indexing [14] are two methods that can be used to improve repository structures to reflect the mental models used by developers when searching for components. Since the developers on a team or in an organization will have similar concerns and use similar vocabulary, the evolving structure will over time become more effective, while naturally adapting to changes in an organization's development environment.

The overall advantage of this approach is that costs are incurred incrementally on an as-needed basis instead of requiring a repository design effort before the repository becomes useful. We initially rely on whatever certification process a component went through to become part of a production system. Then, as the component becomes widely re-used, additional certification procedures can be applied when necessary. This allows a process of natural selection to take place, and avoids the pure overhead cost of pre-certifying components that may never be re-used.

3.1 The CodeFinder-PEEL Repository

CodeFinder and PEEL were designed to investigate the cost/benefit tradeoffs associated with building and using a repository to support the design process. These issues have been investigated in the domain of Emacs Lisp customizations for the GNU Emacs text editor. Components are Emacs Lisp functions, variables, and constants that define e-mail and network news reader applications that can be executed from the Emacs text editor environment. Emacs Lisp is a variant of Lisp that has specialized constructs and primitives specially suited for editing and displaying text on windows. The repository was built from files downloaded from FTP sites and news groups using PEEL to extract components and translate them into a CodeFinder representation [20]. Source code was extracted from various newsgroups devoted to Emacs and Emacs LISP issues, such as comp.emacs, and gnu.emacs.gnus. A modest effort with PEEL has resulted in the creation of a repository consisting of more than 1800 Emacs LISP functions, variables, macros, and constants, and about 2900 distinct terms.

4. Seeding the Repository

Seeding a repository is a matter of creating component representations by indexing them with key terms and phrases. Components can take on any size or form, depending on the needs of repository users. PEEL is a re-engineering tool that translates Emacs Lisp files into individual, reusable, components in a frame-based knowledge representation language named Kandor [8, 24] that is used by CodeFinder to index components and create a frame-based hierarchy of retrieval concepts [18]. PEEL extracts source code definitions of functions, variables, constants, and macros from a source code file.

Information is extracted from each of the components and translated into Kandor objects. Kandor enables the expression of constraints on members of a defined frame. Frame definitions are organized hierarchically and defined through restrictions on super-frames. Part of the hierarchy for the Emacs Lisp repository is shown in Figure 3. CodeFinder [18], as well as LaSSIE [8], Argon [24] and Helgon [11], have used the inferencing capabilities of inheritance and classification to retrieve information with Kandor representations. Kandor's inferencing capabilities are described in detail elsewhere [24]. For our purposes, Kandor representations can be viewed as a set of attribute/value slots that contain information about a given component. PEEL is used to capture that information and place it in Kandor slots.



Figure 2: Highlighting and Other Indexing Operations.

A knowledge representation system is only as good as the knowledge it contains. Automatic extraction is certainly cost-effective, but there are some important slots that demand human attention. PEEL therefore allows the user to augment the TERMS, AUTHORS, and DESCRIPTION slots. As shown in Figure 2, for each component, PEEL will display the source code (window on left-hand side of Figure 2) and allow the user to edit part of the representation (window on the right-hand side).

Given the use of non-words and esoteric abbreviations that are so common in source code, terms are especially important for retrieving software components. For example, one of the mail systems parsed by PEEL uses the abbreviation 'mp' for a "message pointer" throughout the system's source code and documentation. PEEL uses a three-step procedure to process terms similar to the process of extracting components in Basili's software factory [6]. The first step is fully automated. Terms are extracted from the name of the function and any strings or comments within or immediately preceding the definition. A common stop list is applied to the extracted terms to remove common words with little semantic meaning such as 'and', 'of', 'the' and others. PEEL users can also augment this stop list or add terms to a nonsense word list.

The second step is to display the terms to the user. Users can remove the terms, truncate the list (a feature that proved to be useful for many components in which only the name was descriptive), and highlight the term in the source code window to show how and where the term has been used (see the pop-up window in the lower part of Figure 2). The third step allows users to add their own terms, which can include phrases (words separated by spaces) and any other punctuation.

PEEL uses a rudimentary algorithm for identifying components that simply parses the source file and finds objects with 'defun', 'defvar', etc. as the first element in a list. PEEL users then have the option of including the identified component in the repository. More sophisticated component identification methods have been designed, although some human intervention is also needed in these approaches [9].


5. Compensating for Incomplete and Inconsistent Indexing

Users trying to find relevant objects using a retrieval system are faced with two interrelated problems. The first is the nature of design problems, in which people begin the process with an ill-defined notion of what is needed and how they should go about solving the problem [18]. The second problem is that document indexing is often inconsistent and incomplete. Studies have shown that human indexers will disagree on terms used to index documents and the same indexer will use different terms to index the same document at different times [28]. Other studies have shown that people use a surprisingly diverse set of descriptors to describe the same thing [15].

Therefore, retrieval systems that are built on the assumption that the quality of the indexing is the paramount factor in the efficacy of an information retrieval service will fall short of supporting its users. The search process is more than finding an exact match [4]. It includes finding components that provide a partial or analogous solution to the problem. Inconsistent indexing will always be a problem, making it necessary to use retrieval methods and algorithms that go beyond simplistic keyword matching schemes. The downfall of the traditional matching strategy is that queries are viewed as precise specifications of user needs and document representations are precise descriptions of repository objects. An alternative strategy is to assume a degree of uncertainty in document and query representations [22].


Figure 3: CodeFinder User Interface

In the CodeFinder user interface, the Category Hierarchy window displays a graphical hierarchy of a repository. The Query pane shows the current query. The top part of the query specifies two categories ('thing' and 'E-mail') to search within. The bottom part specifies key terms and related components. The Retrieved Items pane shows all components matching the current query, by order of relevance to the query. The Example of the Retrieved Items pane shows the full entry for the top matching component or a component chosen by the user. The Bookmarks pane holds a history of viewed objects. The Related Terms pane shows terms retrieved by the query.
CodeFinder (see Figure 3) has been designed to address these issues through the employment of intelligent retrieval methods and support for query construction. "Intelligent" retrieval is provided by an associative spreading activation retrieval algorithm [22] that transcends matching algorithms to retrieve items that are associated with a query. Query construction is supported through retrieval by reformulation [32], a technique that allows users to incrementally construct queries as they explore the information space. This combination of techniques yields a flexible retrieval mechanism that achieves good performance in the face of indexing problems and ill-defined information needs. It is designed to work effectively with the minimally structured repository created by PEEL.

5.1 Retrieval with Spreading Activation

CodeFinder uses the KANDOR representation shown in Figure 2 to construct an associative network with the TERMS, PARAMATERS and DESCRIPTION fields (see Figure 4). An associative spreading activation process, which is based on a connectionist relaxation procedure for retrieval [22], is then applied to the network to retrieve information. This algorithm is specified elsewhere [20], and works in the following manner: Users specify a query consisting of either or both term and component nodes. The query nodes are given an activation value of 1.0, which remains constant during the spreading activation process. If a node has a positive activation value, the value is passed through each of its links, otherwise no activation is passed. Each node computes the sum of incoming activation values, modified by link weight. The sum of received activation values is then modulated by fan-in, decay and other parameters. The resulting value is fed into a squashing function to keep activation values between boundaries of -.2 and 1 (the asymmetrical values are needed to encourage the flow of positive activation).



Figure 4: An Associative Network
An associative network in CodeFinder consists of two layers of nodes: terms (represented by circles) and components (rectangles). Boxed lines are connections between components and terms, as defined by the Terms field of the component. Link weights are shown in link boxes.

For example, a query specifying the term remove in Figure 4 will spread a value of .499 to the component nodes 'vm-quit' and 'vm-delete-message'. On the next cycle, 'vm-quit' and 'vm-delete-message' will propagate their computed activation values to all nodes they are connected to. For instance, 'vm-delete-message' will activate message and delete. On the third cycle, all of the previously activated nodes will continue to propagate their values. Note that message and delete will work together to activate 'mail-delete-forward' and 'gnus-kill', retrieving these components even though they are not directly indexed by terms used in the query. Keywords are dynamically related through the items they index, compensating for inconsistent indexing through co-occurrence relationships.

5.2 Defining Relationships Through Content Induced Structure

The power of the combination of PEEL and CodeFinder is that indexing does not need to be exhaustive for CodeFinder to effectively locate software objects. A minimal indexing with PEEL combined with the spreading activation and browsing tools of CodeFinder adequately supports the process of locating relevant source code. The advantage of the spreading activation process used by CodeFinder is that local decisions on indexing a component can lead to effective retrieval performance. This is in contrast with classification approaches that require more global knowledge of the repository to effectively classify and retrieve components.


Figure 5: Capturing Word Relationships Through Content Induced Structure.

It is the structure of the repository, and not pre-defined term or component relationships, that spreading activation uses to retrieve information and induce relationships between components. A particularly salient way to demonstrate this effect is to show the related keywords that are retrieved by single-word queries. Figure 5 shows that the query delete will retrieve a number of synonyms. The semantic relationship between delete and the components retrieved has not been programmed into the system. Given only the structure of the repository in which the terms delete, remove, and kill co-occur between components, spreading activation is able to induce the semantic relationship between these words. This is referred to as "content induced structure" [16]. Figure 5c shows that when the term list is specified, terms related to the underlying Lisp language such as sequence, cl, set, alist, member, and others are found. This accurately reflects the contents of the repository, which are all written in Emacs Lisp. Other repositories using more conventional definitions of 'list' would find different associated terms. The screen images shown in Figure 5 were retrieved directly from the Emacs Lisp repository created by PEEL, and are only a sampling of the kinds of strong semantic relationships that can be derived by spreading activation.

Not only is spreading activation able to find semantically related words, it captures idiosyncratic term usage in the repository. For example, the Emacs Lisp community have over time created their own vernacular to describe objects and programming methods. In Figure 5, the relationship between delete and kill captures a programming idiom in which the term kill often refers to deleting an object. In addition, one of the programs, View Mail by Kyle Jones, uses the term delete when deleting messages, and kill when referring to removing files. Therefore, as shown in the figure, the delete query retrieved terms related to messages while the kill query found a number of words related to files. This kind of idiosyncratic term usage will be missed by traditional thesaurus methods based on equivalence classes for words, but are important to domains that generate idiomatic terminology to describe complex concepts, such as software development [18], medical fields, and others. Associative spreading activation uses the structure of the repository to find idiomatic word relationships with no additional effort or labor-intensive efforts to pre-define word relationships.

5.3 Retrieval by Reformulation

Beyond soft matching algorithms, retrieval systems can reduce the effects of poor repository structure by supporting the process of query construction. Retrieval by reformulation is a method that supports incremental query formation by building on query results [32]. Each time a user specifies a query, the system responds with query reformulation cues that give users an indication of how the repository is structured and what terms are used to index objects. Users can then incrementally improve a query by critiquing the results of previous queries. Rabbit [32] and Helgon [11] are examples of retrieval systems based on the retrieval by reformulation paradigm.

CodeFinder supports retrieval by reformulation by providing a number of retrieval cues in its interface. The most important of these is the example components appearing in the Example of the Retrieved Items pane (see Figure 3). Query reformulation is accomplished through direct manipulation techniques which allow users to simply point and click on components to be included in the query. For example, if a user wanted to add the term 'subject' shown in the "Terms" attribute of the component 'vm-kill-subject' in Figure 3 would simply click on the term. The term would be placed in the Terms field of the Query pane and a new retrieval set can be retrieved by clicking on the "Retrieve" button.

CodeFinder enhances the retrieval by reformulation paradigm in three ways. First, items are automatically placed in the appropriate part of the query. As opposed to query languages, where users are responsible for both the structure and content of queries, users are spared the cognitive overhead of deciding where to put an attribute, category or term. Secondly, CodeFinder imposes a ranking criteria on the retrieval set, automatically displaying the highest rated item for the query. This is important because the example provides cues that characterize what kinds of items are retrieved by the query and how it can be improved. The better the example, the easier it is to converge on a satisfactory solution. Selecting a good example has not been addressed by previous systems. Helgon, for example, organized the retrieval set by alphabetic order and displayed the first example in this ordering [11]. Third, CodeFinder displays terms in the Related Terms pane (shown in Figures 3 and 5) that provides cues for query reformulation and gives users an indication of the terminology and concepts supported by the repository.

6. Supporting Incremental Refinement of Component Repositories

Once components have been captured with minimal structure, it is necessary to evolve the structure to meet the needs of a development organization. This is necessary not only to improve the indexing and structure of the repository, but also to keep information current as knowledge in the organization evolves [19].

Curtis notes that empirical studies have shown that as programmers gain experience, their understanding of the knowledge in a particular domain gravitates toward a common structure [7]. In other words, while it is difficult to anticipate all design needs or uses for design artifacts in advance, a group of users working in a moderately homogeneous environment will tend to approach problems in the same way as they work together. Studies have shown that a great deal of effort in a project is directed at creating a mutual understanding of the problem among the developers [30]. The best opportunity to capture typical use situations is to provide the means to easily modify the repository during use.

6.1 Adapting the Repository

Adaptive techniques to improve the repository are applied by recording user actions and incrementally changing representations based on the actions. As users interact with the system, the repository migrates toward an indexing scheme using the kinds of goals typically encountered by users. The system gradually builds a consensual representation of its user's knowledge [26; p. 3] that changes as knowledge in an organization evolves.

Adaptive indexing collects term usage data interactively, in the process of real information seeking sessions, by adding terms to an object representation when they are used in a search that was satisfied by the document [14]. This allows items in the repository to evolve toward an indexing structure that supports typical usage situations.

CodeFinder supports unlimited aliasing and adaptive indexing by keeping track of the terms used during a retrieval session and displaying them to the user when they have successfully retrieved an item. During the process of constructing a query and browsing the information space, a user might remove a number of terms from the query and entered terms that were not in the system lexicon. There may also be some terms in the current query that are not in the item's current representation. Adding all of these terms to a component's representation may enhance the future retrievability of the item, but we must guard against situations in which misspellings or undesired terms are used in the retrieval session.


Figure 6: Adaptive Indexing in CodeFinder.

CodeFinder addresses this issue by displaying the information to the user for inspection and possible modification. These terms are presented to the user when they click on the "Choose This" button in the Example of the Retrieved Items" pane (see Figure 3). An adaptive indexing session is then invoked, as shown in Figure 6. Users are given the option to add these items to the current representation of the current component and are allowed to enter any other terms they wish.
Repository users can also modify the repository structure used by the Kandor-based part of CodeFinder through pull-down menus such as the one shown in the middle-top of Figure 3. This kind of modification is perhaps best handled by a repository librarian, but is an important facility that enables evolution of the repository.

6.2 Adjusting Link Weights

In addition to assigning new terms to a component representation, CodeFinder uses relevance feedback to adjust the term-component link weight. Relevance feedback is typically used to reformulate a query, usually by adding a document's representation to the query [27]. CodeFinder accomplishes this by allowing users to click on the circles on the right-hand side of the Retrieved Items pane (see Figure 3). Each item chosen is placed in the query (see the Related Items portion of the Query pane), where it is used as part of the spreading activation query.

CodeFinder also uses this action as an indication that the components chosen are relevant to the terms shown in the Terms portion of the Query pane. Strengthening the link weight between these components and terms will increase the probability that similar queries in the future will get better retrieval results. A connectionist learning paradigm is used which is referred to as localized reinforcement [3]. The association we wish to strengthen is between terms in the query and components chosen in relevance feedback. Therefore, if a component chosen in relevance feedback is indexed by any of the terms in the query, then each of the weights are adjusted with the following learning rule [26; p. 21]:

where is the activation value of the document node j, is the feedback given to node i by the user (a binary yes/no value in CodeFinder), and is the learning rate, usually a value between 0 and 1.0. The values are adjusted by an arctan() squashing function to keep values between -1.0 and 1.0. Since "the knowledge is in the link weights", this is an important contribution to the effectiveness of spreading activation. Empirical studies have shown that retrieval performance using this algorithm does indeed improve over time, especially when used in groups with homogeneous interests [3].

7. Evaluation and Future Directions

An experiment was designed to compare CodeFinder's effectiveness at supporting retrieval with varying levels of task definition, and to investigate how the systems fared when user and system terminology mismatched [18, 20]. The study compared CodeFinder to Helgon (a predecessor of CodeFinder) and a subset of the Document Examiner, the retrieval system delivered with the Symbolics Lisp Machine. Subjects were given well-defined, directed, and ill-defined tasks that were adapted from observations of people developing software in the Emacs Lisp environment. For example, a well-defined task would state "Find the source code that counts new messages in the Rmail system," while an ill-defined task would state "You are writing a mail system in which users can answer messages. Find whatever information you can that could potentially help you design code to answer messages." The tasks were also manipulated by vocabulary match. Some tasks referred to items in the repository that were indexed by terms that matched the wording of the tasks given to subjects and the others did not match task wording. The repository used for all three systems was the Emacs Lisp repository created by PEEL that was manipulated for the vocabulary match/mismatch tasks.

The results indicate that subjects had problems with systems using straightforward matching algorithms when problems were ill-defined or the vocabulary mismatched [18, 20]. Differences between the systems were more pronounced when the tasks were in the ill-defined or vocabulary mismatch categories. This serves to validate the effectiveness of spreading activation with minimally structured repositories. The greater degree of success with CodeFinder indicates that soft-matching retrieval algorithms, such as spreading activation, coupled with query construction techniques are necessary to adequately support the kinds of tasks typically encountered by developers searching for information about reusable components.

There are also a number of improvements needed in the PEEL/CodeFinder tool set before it can become a viable alternative for development organizations. At this time, PEEL makes no attempt to assess the reusability of components. If the repository becomes overwhelmed with large numbers of components, we may wish to extract only components that score well on various measures such as Halstead and McCabe indicators [6] or others. Work is also under way to apply the methods to larger-gain software components and non-source code artifacts. More studies are also needed to compare CodeFinder to other repository retrieval methods such as faceted classification [25], although the up-front repository construction costs of these methods makes such comparisons difficult.

8. Conclusions

The work described in this paper is rooted in the need for software development tools that support the process of finding components for reuse. Although it is true that component repositories that are structured and indexed perfectly will make it easy to find relevant components, the reality is that pragmatic and inherent circumstances prevent the creation of such repositories. It is both difficult and costly to anticipate all the needs a given component in a repository will be used for, and even if perfection could be achieved, the fact that people use diverse vocabulary to describe their information needs will prevent users from always finding the proper components. Methods are needed that can provide adequate retrieval effectiveness with minimal indexing and structuring efforts, allowing organizations to take advantage of valuable assets that have accumulated from previous development efforts without a large up-front investment.

This paper presents an approach to supporting a kind of "lifecycle" for repositories. Repositories are initially seeded with structure and index terms with PEEL, a tool that extracts repository information from Emacs Lisp Source code. CodeFinder, which supports the retrieval process through retrieval by reformulation and spreading activation, is then used to help users find components in the face of less-than-perfect repository structures and indexing. While using CodeFinder to find components, users are given the opportunity to add structure and index terms to the repository, improving the repository to reflect the kinds of problems repository users typically encounter. This process has the advantage of avoiding costly repository populating efforts while capturing information that may be useful to people encountering similar information needs in the future. Empirical studies have shown that CodeFinder is able to adequately support the process of finding relevant software components, even with the minimal structuring and indexing performed by PEEL.

In order for software reuse methodologies to be feasible and effective, methods are needed to exploit valuable assets that exist in development organizations in the form of existing code modules that have been developed and saved with no thought of re-use. Tools for building a repository must be developed that extract these modules and represent them in a repository in a form that is amenable to a specific set of retrieval tools. The work presented here takes some steps toward addressing these issues. Future work is needed to refine the ideas presented here and subject them to rigorous empirical studies to assess their effectiveness and learn more about the nature of component repositories built to support software reuse techniques.

Acknowledgments. I thank Gerhard Fischer and colleagues at the University of Colorado, especially Peter Foltz, Walter Kintsch, David Redmiles, and Curt Stevens, with whom many discussions were shared about the ideas in this paper. Generous donations of software by Helga Nieper-Lemke contributed to the design of CodeFinder. Grant No. MDA903-86-C0143 from the Army Research Institute partially supported this research.

References

[1] ACM, "The Full Computing Reviews Classification System," : ACM, New York, 1992.

[2] B. H. Barnes, T. B. Bollinger, "Making Reuse Cost-Effective," IEEE Software, 8(1), 1991, pp. 13-24.

[3] R. K. Belew, Adaptive Information Retrieval: Machine Learning in Associative Networks, Cognitive Science and Machine Intelligence Laboratory, University of Michigan, Ann Arbor, MI,1987.

[4] T. J. Biggerstaff, C. Richter, "Reusability Framework, Assessment, and Directions," IEEE Software, 4(2), 1987, pp. 41-49.

[5] B. A. Burton, R. W. Aragon, S. A. Bailey, K. D. Koehler, L. A. Mayes, "The Reusable Software Library," IEEE Software, 4(4), 1987, pp. 25-33.

[6] G. Caldiera, V. R. Basili, "Identifying and Qualifying Reusable Software Components," Computer, 24(2), 1991, pp. 61-70.

[7] B. Curtis, "Cognitive Issues in Reusing Software Artifacts," in Software Reusability; Volume II: Applications and Experience, T. J. Biggerstaff and A. J. Perlis, Eds.: Addison-Wesley, Reading, M, 1989, pp. 269-287.

[8] P. Devanbu, R. J. Brachman, P. G. Selfridge, B. W. Ballard, "LaSSIE: A Knowledge-Based Software Information System," Communications of the ACM, 34(5), 1991, pp. 34-49.

[9] J. C. Esteva, "Automatic Identification of Reusable Components," IEEE Seventh International Workshop on Computer-Aided Software Engineering - CASE '95, Toronto, Ca, 1995, pp. 80-87.

[10] G. Fischer, R. McCall, J. Ostwald, B. Reeves, F. Shipman, "Seeding, Evolutionary Growth and Reseeding: Supporting the Incremental Development of Design Environments," Proceeding of the Conference on Computer-Human Interaction (CHI '94), Boston, MA, 1994, pp. 292-298.

[11] G. Fischer, H. Nieper-Lemke, "HELGON: Extending the Retrieval by Reformulation Paradigm," Human Factors in Computing Systems, CHI '89 Conference Proceedings, 1989, pp. 333-352.

[12] W. B. Frakes, P. B. Gandel, "Representing Reusable Software," Information and Software Technology, 32(10), 1990, pp. 653-664.

[13] W. B. Frakes, B. A. Nejmeh, "Software Reuse Through Information Retrieval," The 20th Hawaii International Conference on System Sciences, Hawaii, 1987, pp. 530-535.

[14] G. W. Furnas, "Experience with an Adaptive Indexing Scheme," Human Factors in Computing Systems, CHI '85 Conference Proceedings, 1985, pp. 131-135.

[15] G. W. Furnas, T. K. Landauer, L. M. Gomez, S. T. Dumais, "The Vocabulary Problem in Human-System Communication," Communications of the ACM, 30(11), 1987, pp. 964-971.

[16] F. G. Halasz, "Reflections on Notecards: Seven Issues for the Next Generation of Hypermedia Systems," Communications of the ACM, 31(7), 1988, pp. 836-852.

[17] S. P. Harter, "Psychological Relevance and Information Science," Journal of the American Society for Information Science, 43(9), 1992, pp. 602-615.

[18] S. Henninger, "Using Iterative Refinement to Find Reusable Software," IEEE Software, 11(5), 1994.

[19] S. Henninger, K. Lappala, A. Raghavendran, "An Organizational Learning Approach to Domain Analysis," Seventeenth International Conference on Software Engineering, Seattle, WA, 1995, pp. 95-104.

[20] S. R. Henninger, Locating Relevant Examples for Example-Based Software Design, Depart. of Computer Science, University of Colorado, Boulder, CO,1993.

[21] Y. S. Maarek, D. M. Berry, G. E. Kaiser, "An Information Retrieval Approach for Automatically Constructing Software Libraries," IEEE Transactions on Software Engineering, 17(8), 1991, pp. 800-813.

[22] M. C. Mozer, Inductive Information Retrieval Using Parallel Distributed Computation, Institute for Cognitive Science, University of California San Diego, La Jolla, CA, ICS Report 8406, 1984.

[23] E. Ostertag, J. Hendler, R. Prieto-Diaz, C. Braun, "Computing Similarity in a Reuse Library System: An AI-Based Approach," ACM Transactions on Software Engineering and Methodology, 1(3), 1992, pp. 205-228.

[24] P. F. Patel-Schneider, R. J. Brachman, H. J. Levesque, "ARGON: Knowledge Representation Meets Information Retrieval," Proceedings of The First Conference on Artificial Intelligence Applications (CAIA '84), 1984, pp. 280-286.

[25] R. Prieto-Diaz, P. Freeman, "Classifying Software for Reusability," IEEE Software, 4(1), 1987, pp. 6-16.

[26] D. E. Rose, R. K. Belew, "A Connectionist and Symbolic Hybrid for Improving Legal Research," International Journal of Man-Machine Studies, 35(1), 1991, pp. 1-33.

[27] G. Salton, C. Buckley, "Improving Retrieval Performance by Relevance Feedback," Journal of the American Society for Information Science, 41(4), 1990, pp. 288-297.

[28] G. Salton, M. J. McGill, Introduction to Modern Information Retrieval: McGraw Hill, New York, 1983.

[29] R. H. Thompson, W. B. Croft, "Support for Browsing in an Intelligent Text Retrieval System," International Journal of Man-Machine Studies, 30, 1989, pp. 639-668.

[30] D. B. Walz, J. J. Elam, B. Curtis, "Inside A Software Design Team: Knowledge Acquisition, Sharing, and Integration," Communications of the ACM, 36(10), 1993, pp. 62-77.

[31] G. Wiederhold, P. Wegner, S. Ceri, "Toward Megaprogramming," Communications of the ACM, 35(11), 1992, pp. 89-99.

[32] M. D. Williams, "What Makes RABBIT Run?," International Journal of Man-Machine Studies, 21, 1984, pp. 333-35.