Supporting the Construction and Evolution of Component Repositories
(Research Paper)
Scott Henninger
Department of Computer Science & Engineering
115 Ferguson Hall, CC 0115
University of Nebraska-Lincoln
Lincoln, NE 68588-0115
(402) 472-8394 FAX: (402) 472-7767
scotth@cse.unl.edu
Abstract
Repositories must be designed to meet
the evolving and dynamic needs of software development
organizations. Current software repository methods rely
heavily on classification, which exacerbates acquisition
and evolution problems by requiring costly classification
and domain analysis efforts before a repository
can be used effectively. This paper outlines an
approach in which minimal initial structure is used
to effectively find relevant software components while
methods are employed to incrementally improve repository
structures. The approach is demonstrated through PEEL,
a tool to semi-automatically identify reusable components,
and CodeFinder, a retrieval system that compensates
for the lack of explicit knowledge structures through
spreading activation retrieval and allows component
representations to be incrementally improved while
users are searching for information. The combination
of these techniques yields a flexible software
repository that minimizes up-front costs and improves
its retrieval effectiveness as developers use it
to find reusable software artifacts.
Table of Contents
1. Introduction
2. Repository Structure and Retrieval Effectiveness
3. Evolutionary Design of Repositories
3.1 The CodeFinder-PEEL Repository
4. Seeding the Repository
5. Compensating for Incomplete and Inconsistent
Indexing
5.1 Retrieval with Spreading Activation
5.2 Defining Relationships Through Content Induced
Structure
5.3 Retrieval by Reformulation
6. Supporting Incremental Refinement of
Component Repositories
6.1 Adapting the Repository
6.2 Adjusting Link Weights
7. Evaluation and Future Directions
8. Conclusions
References
1. Introduction
As libraries of reusable software components continue to grow, the issue
of retrieving components from software component repositories has captured
the attention of the software reuse community [5, 8, 12, 13, 21, 25]. Methods
using these repositories face an inherent dilemma: in order for the approach
to be useful, the repository must contain enough components to support developers,
but when many examples are available, finding and choosing an appropriate
one becomes troublesome. Retrieval techniques as diverse as enumerated classification
[1], facets [25], frame-based classification [23], free-text indexing [13],
and relational databases [5], have been employed to address the problem
of finding relevant components. But issues involving how effective repositories
are built, populated, and evolved to meet the changing needs of developers
have received considerably less attention.
Most retrieval algorithms require a value-added approach to building software
repositories that require pre-defined structure before the repository can
be effectively searched. At a minimum, components must be categorized before
they are placed in the repository. This creates a barrier of real and intellectual
capital investment that many organizations cannot overcome [4]. Methods
are needed that can provide adequate retrieval effectiveness with minimal
indexing and structuring efforts, allowing organizations to take advantage
of valuable assets that have accumulated from previous development efforts
without a large up-front investment.
In addition to being costly, the structures required for retrieval are often
static and unable to adapt to dynamic development contexts. Once the structures
are created, they are set in stone, changeable only through complex procedures
and often involving re-designing the entire repository. This is a significant
problem not only because it has proven exceedingly difficult to get the
structure right the first time, but also because the domain is constantly
evolving under the pressures of changing technology, project dynamics, and
the fluid nature of customer needs.
This paper presents a set of tools designed to support an incremental refinement
of component repositories. Initial component representations are created
with minimal up-front structuring efforts. Retrieval techniques are then
employed to compensate for minimal retrieval structure and help developers
find reusable components. As developers use the repository, adaptive indexing
techniques are employed to enhance retrieval structures and evolve the repository
toward a representation of the ways its users reason about development problems.
The following sections begin with a brief overview of current research in
retrieval methods for software components. Then the evolutionary approach
to constructing repositories is introduced followed by sections describing
the repository seeding, retrieval, and adaptive indexing capabilities of
CodeFinder, an experimental prototype designed to investigate issues in
retrieving software components. A discussion of an empirical study of CodeFinder
concludes the paper.
2. Repository Structure and Retrieval Effectiveness
The structure of a repository is generally regarded as key to obtaining
good retrieval results. No matter how "intelligent" the matching
algorithm, if components are indexed or otherwise structured poorly, it
will be difficult to achieve good retrieval performance. The intuitive and
widely held assumption is that up-front investments in structuring a repository
result in a proportional increase in the ease with which components can
be reused [2; p. 16]. While it may be true that good structure is crucial
in a retrieval system's effectiveness, sophisticated information retrieval
techniques have been created that finesse the relationship between repository
structure and retrieval methods.
Retrieval methods used for software repositories can be divided into three
categories: enumerated classification, faceted, and free-text indexing [12].
Hypertext systems have also been used, although some form of retrieval is
necessary to make hypertext effective in large repositories [16, 29]. Enumerated
classification is a well-known retrieval method used by the Dewy Decimal
system and the Computing Reviews Classification System [1]. The appeal of
a classification scheme is the ability to iteratively divide an information
space into smaller pieces that reduces the amount of information that needs
to be perused. Faceted classification avoids enumerating component
definitions in a hierarchy by defining attribute classes that can be instantiated
with different terms [25]. Users search for components by specifying terms
for each of the facets. Free-text indexing (AKA automatic indexing)
methods use the text from a document for indexing [21]. Document text is
applied to a "stop list" to remove frequently occurring words
such as 'and' and 'the'. The remaining text is used as an index to the document.
Users specify a query using key words that are applied to the indices to
find matching documents.
These methods define a continuum from enumerated classification that requires
extensive structure, to faceted classification requiring less structure,
to free-text indexing that requires no structure. Classification schemes
are computationally simple to retrieve from (although usability problems
exist), but difficult to build. Designers must get the taxonomy "right",
and the definition of "right" is elusive, situation-specific,
and dependent on the needs of individuals [17]. Faceted classification and
attribute-value methods suffer from a reduced form of the same problem-one
must get the attributes right, then it becomes a simple matter of choosing
terms within the facets. Free-text indexing involves little up-front costs,
but is most effective when a large corpus of linguistic text is available.
The strength of methods that use sophisticated information structures is
that people can use the knowledge contained in the structures to lead them
to relevant information. The weakness is that no support is possible if
the information is not structured properly. Creating specialized information
is a labor intensive and costly process that is prohibitory where resources
are strained or where large information spaces are concerned. In what follows,
a middle ground is achieved by utilizing retrieval methods that need very
little structure, yet yield effective retrieval performance. The structures
are then augmented in the context of use. As people find information, their
interaction with the system is traced and used to improve the existing structure.
The result is a flexible integration of retrieval and repository construction
methods that improve the underlying structures as the repository is used.

Figure 1: Evolutionary Construction of Repositories.
3. Evolutionary Design of Repositories
Figure 1 shows how repositories can be constructed with minimal up-front
effort and then incrementally improved as the repository is used. In the
initial stages of constructing a repository of software components it is
important to populate the repository with components [6]. A repository is
initially "seeded" [10] with a low-cost repository construction
method that semi-automatically indexes components with terms and phrases.
The system described in this paper, called PEEL (Parse and Extract Emacs
Lisp), extracts components from text files and indexes them through a combination
of automatic extraction and interactive user support [20].
The resulting repository, having been created with minimal retrieval structures,
may suffer from incomplete and inconsistent indexing, making it difficult
for keyword matching algorithms to retrieve relevant information. A combination
of soft-matching and query reformulation techniques is employed to compensate
for the minimal effort in designing the repository, making it easier to
find information. The CodeFinder system is a retrieval tool that uses innovative
retrieval techniques to help users find reusable software [18].
Even with innovative and intelligent retrieval tools, a repository's effectiveness
is directly proportional to the quality of indexing and structures for retrieval.
Relevance feedback [27] and adaptive indexing [14] are two methods that
can be used to improve repository structures to reflect the mental models
used by developers when searching for components. Since the developers on
a team or in an organization will have similar concerns and use similar
vocabulary, the evolving structure will over time become more effective,
while naturally adapting to changes in an organization's development environment.
The overall advantage of this approach is that costs are incurred incrementally
on an as-needed basis instead of requiring a repository design effort before
the repository becomes useful. We initially rely on whatever certification
process a component went through to become part of a production system.
Then, as the component becomes widely re-used, additional certification
procedures can be applied when necessary. This allows a process of natural
selection to take place, and avoids the pure overhead cost of pre-certifying
components that may never be re-used.
3.1 The CodeFinder-PEEL Repository
CodeFinder and PEEL were designed to investigate the cost/benefit tradeoffs
associated with building and using a repository to support the design process.
These issues have been investigated in the domain of Emacs Lisp customizations
for the GNU Emacs text editor. Components are Emacs Lisp functions, variables,
and constants that define e-mail and network news reader applications that
can be executed from the Emacs text editor environment. Emacs Lisp is a
variant of Lisp that has specialized constructs and primitives specially
suited for editing and displaying text on windows. The repository was built
from files downloaded from FTP sites and news groups using PEEL to extract
components and translate them into a CodeFinder representation [20]. Source
code was extracted from various newsgroups devoted to Emacs and Emacs LISP
issues, such as comp.emacs, and gnu.emacs.gnus. A modest effort
with PEEL has resulted in the creation of a repository consisting of more
than 1800 Emacs LISP functions, variables, macros, and constants, and about
2900 distinct terms.
4. Seeding the Repository
Seeding a repository is a matter of creating component representations by
indexing them with key terms and phrases. Components can take on any size
or form, depending on the needs of repository users. PEEL is a re-engineering
tool that translates Emacs Lisp files into individual, reusable, components
in a frame-based knowledge representation language named Kandor [8, 24]
that is used by CodeFinder to index components and create a frame-based
hierarchy of retrieval concepts [18]. PEEL extracts source code definitions
of functions, variables, constants, and macros from a source code file.
Information is extracted from each of the components and translated into
Kandor objects. Kandor enables the expression of constraints on members
of a defined frame. Frame definitions are organized hierarchically and defined
through restrictions on super-frames. Part of the hierarchy for the Emacs
Lisp repository is shown in Figure 3. CodeFinder [18], as well as LaSSIE
[8], Argon [24] and Helgon [11], have used the inferencing capabilities
of inheritance and classification to retrieve information with Kandor representations.
Kandor's inferencing capabilities are described in detail elsewhere [24].
For our purposes, Kandor representations can be viewed as a set of attribute/value
slots that contain information about a given component. PEEL is used to
capture that information and place it in Kandor slots.

Figure 2: Highlighting and Other Indexing Operations.
A knowledge representation system is only as good as the knowledge it
contains. Automatic extraction is certainly cost-effective, but there are
some important slots that demand human attention. PEEL therefore allows
the user to augment the TERMS, AUTHORS, and DESCRIPTION slots. As shown
in Figure 2, for each component, PEEL will display the source code (window
on left-hand side of Figure 2) and allow the user to edit part of the representation
(window on the right-hand side).
Given the use of non-words and esoteric abbreviations that are so common
in source code, terms are especially important for retrieving software components.
For example, one of the mail systems parsed by PEEL uses the abbreviation
'mp' for a "message pointer" throughout the system's source code
and documentation. PEEL uses a three-step procedure to process terms similar
to the process of extracting components in Basili's software factory [6].
The first step is fully automated. Terms are extracted from the name of
the function and any strings or comments within or immediately preceding
the definition. A common stop list is applied to the extracted terms to
remove common words with little semantic meaning such as 'and', 'of', 'the'
and others. PEEL users can also augment this stop list or add terms to a
nonsense word list.
The second step is to display the terms to the user. Users can remove the
terms, truncate the list (a feature that proved to be useful for many components
in which only the name was descriptive), and highlight the term in the source
code window to show how and where the term has been used (see the pop-up
window in the lower part of Figure 2). The third step allows users to add
their own terms, which can include phrases (words separated by spaces) and
any other punctuation.
PEEL uses a rudimentary algorithm for identifying components that simply
parses the source file and finds objects with 'defun', 'defvar', etc. as
the first element in a list. PEEL users then have the option of including
the identified component in the repository. More sophisticated component
identification methods have been designed, although some human intervention
is also needed in these approaches [9].
5. Compensating for Incomplete and Inconsistent
Indexing
Users trying to find relevant objects using a retrieval system are faced
with two interrelated problems. The first is the nature of design problems,
in which people begin the process with an ill-defined notion of what is
needed and how they should go about solving the problem [18]. The second
problem is that document indexing is often inconsistent and incomplete.
Studies have shown that human indexers will disagree on terms used to index
documents and the same indexer will use different terms to index the same
document at different times [28]. Other studies have shown that people use
a surprisingly diverse set of descriptors to describe the same thing [15].
Therefore, retrieval systems that are built on the assumption that the quality
of the indexing is the paramount factor in the efficacy of an information
retrieval service will fall short of supporting its users. The search process
is more than finding an exact match [4]. It includes finding components
that provide a partial or analogous solution to the problem. Inconsistent
indexing will always be a problem, making it necessary to use retrieval
methods and algorithms that go beyond simplistic keyword matching schemes.
The downfall of the traditional matching strategy is that queries are viewed
as precise specifications of user needs and document representations are
precise descriptions of repository objects. An alternative strategy is to
assume a degree of uncertainty in document and query representations [22].
Figure 3: CodeFinder User Interface
In the CodeFinder user interface, the Category Hierarchy
window displays a graphical hierarchy of a repository. The Query
pane shows the current query. The top part of the query specifies two categories
('thing' and 'E-mail') to search within. The bottom part specifies key terms
and related components. The Retrieved Items pane shows all components
matching the current query, by order of relevance to the query. The Example
of the Retrieved Items pane shows the full entry for the
top matching component or a component chosen by the user. The Bookmarks
pane holds a history of viewed objects. The Related Terms pane shows
terms retrieved by the query.
CodeFinder (see Figure 3) has been designed to address these issues through
the employment of intelligent retrieval methods and support for query construction.
"Intelligent" retrieval is provided by an associative spreading
activation retrieval algorithm [22] that transcends matching algorithms
to retrieve items that are associated with a query. Query construction is
supported through retrieval by reformulation [32], a technique that allows
users to incrementally construct queries as they explore the information
space. This combination of techniques yields a flexible retrieval mechanism
that achieves good performance in the face of indexing problems and ill-defined
information needs. It is designed to work effectively with the minimally
structured repository created by PEEL.
5.1 Retrieval with Spreading Activation
CodeFinder uses the KANDOR representation shown in Figure 2 to construct
an associative network with the TERMS, PARAMATERS and DESCRIPTION fields
(see Figure 4). An associative spreading activation process, which is based
on a connectionist relaxation procedure for retrieval [22], is then applied
to the network to retrieve information. This algorithm is specified elsewhere
[20], and works in the following manner: Users specify a query consisting
of either or both term and component nodes. The query nodes are given an
activation value of 1.0, which remains constant during the spreading activation
process. If a node has a positive activation value, the value is passed
through each of its links, otherwise no activation is passed. Each node
computes the sum of incoming activation values, modified by link weight.
The sum of received activation values is then modulated by fan-in, decay
and other parameters. The resulting value is fed into a squashing function
to keep activation values between boundaries of -.2 and 1 (the asymmetrical
values are needed to encourage the flow of positive activation).

Figure 4: An Associative Network
An associative network in CodeFinder consists of two layers
of nodes: terms (represented by circles) and components (rectangles). Boxed
lines are connections between components and terms, as defined by the Terms
field of the component. Link weights are shown in link boxes.
For example, a query specifying the term remove in Figure 4 will
spread a value of .499 to the component nodes 'vm-quit' and 'vm-delete-message'.
On the next cycle, 'vm-quit' and 'vm-delete-message' will propagate their
computed activation values to all nodes they are connected to. For instance,
'vm-delete-message' will activate message and delete. On the
third cycle, all of the previously activated nodes will continue to propagate
their values. Note that message and delete will work together
to activate 'mail-delete-forward' and 'gnus-kill', retrieving these components
even though they are not directly indexed by terms used in the query. Keywords
are dynamically related through the items they index, compensating for inconsistent
indexing through co-occurrence relationships.
5.2 Defining Relationships Through Content
Induced Structure
The power of the combination of PEEL and CodeFinder is that indexing does
not need to be exhaustive for CodeFinder to effectively locate software
objects. A minimal indexing with PEEL combined with the spreading activation
and browsing tools of CodeFinder adequately supports the process of locating
relevant source code. The advantage of the spreading activation process
used by CodeFinder is that local decisions on indexing a component can lead
to effective retrieval performance. This is in contrast with classification
approaches that require more global knowledge of the repository to effectively
classify and retrieve components.

Figure 5: Capturing Word Relationships Through Content Induced Structure.
It is the structure of the repository, and not pre-defined term or component
relationships, that spreading activation uses to retrieve information and
induce relationships between components. A particularly salient way to demonstrate
this effect is to show the related keywords that are retrieved by single-word
queries. Figure 5 shows that the query delete will retrieve a number
of synonyms. The semantic relationship between delete and the components
retrieved has not been programmed into the system. Given only the structure
of the repository in which the terms delete, remove, and kill
co-occur between components, spreading activation is able to induce the
semantic relationship between these words. This is referred to as "content
induced structure" [16]. Figure 5c shows that when the term list
is specified, terms related to the underlying Lisp language such as sequence,
cl, set, alist, member, and others are found.
This accurately reflects the contents of the repository, which are all written
in Emacs Lisp. Other repositories using more conventional definitions of
'list' would find different associated terms. The screen images shown in
Figure 5 were retrieved directly from the Emacs Lisp repository created
by PEEL, and are only a sampling of the kinds of strong semantic relationships
that can be derived by spreading activation.
Not only is spreading activation able to find semantically related words,
it captures idiosyncratic term usage in the repository. For example, the
Emacs Lisp community have over time created their own vernacular to describe
objects and programming methods. In Figure 5, the relationship between delete
and kill captures a programming idiom in which the term kill
often refers to deleting an object. In addition, one of the programs, View
Mail by Kyle Jones, uses the term delete when deleting messages,
and kill when referring to removing files. Therefore, as shown in
the figure, the delete query retrieved terms related to messages
while the kill query found a number of words related to files. This
kind of idiosyncratic term usage will be missed by traditional thesaurus
methods based on equivalence classes for words, but are important to domains
that generate idiomatic terminology to describe complex concepts, such as
software development [18], medical fields, and others. Associative spreading
activation uses the structure of the repository to find idiomatic word relationships
with no additional effort or labor-intensive efforts to pre-define word
relationships.
5.3 Retrieval by Reformulation
Beyond soft matching algorithms, retrieval systems can reduce the effects
of poor repository structure by supporting the process of query construction.
Retrieval by reformulation is a method that supports incremental query formation
by building on query results [32]. Each time a user specifies a query, the
system responds with query reformulation cues that give users an indication
of how the repository is structured and what terms are used to index objects.
Users can then incrementally improve a query by critiquing the results of
previous queries. Rabbit [32] and Helgon [11] are examples of retrieval
systems based on the retrieval by reformulation paradigm.
CodeFinder supports retrieval by reformulation by providing a number of
retrieval cues in its interface. The most important of these is the example
components appearing in the Example of the Retrieved Items
pane (see Figure 3). Query reformulation is accomplished through direct
manipulation techniques which allow users to simply point and click on components
to be included in the query. For example, if a user wanted to add the term
'subject' shown in the "Terms" attribute of the component 'vm-kill-subject'
in Figure 3 would simply click on the term. The term would be placed in
the Terms field of the Query pane and a new retrieval set can be retrieved
by clicking on the "Retrieve" button.
CodeFinder enhances the retrieval by reformulation paradigm in three ways.
First, items are automatically placed in the appropriate part of the query.
As opposed to query languages, where users are responsible for both the
structure and content of queries, users are spared the cognitive overhead
of deciding where to put an attribute, category or term. Secondly, CodeFinder
imposes a ranking criteria on the retrieval set, automatically displaying
the highest rated item for the query. This is important because the example
provides cues that characterize what kinds of items are retrieved by the
query and how it can be improved. The better the example, the easier it
is to converge on a satisfactory solution. Selecting a good example has
not been addressed by previous systems. Helgon, for example, organized the
retrieval set by alphabetic order and displayed the first example in this
ordering [11]. Third, CodeFinder displays terms in the Related Terms
pane (shown in Figures 3 and 5) that provides cues for query reformulation
and gives users an indication of the terminology and concepts supported
by the repository.
6. Supporting Incremental Refinement of Component
Repositories
Once components have been captured with minimal structure, it is necessary
to evolve the structure to meet the needs of a development organization.
This is necessary not only to improve the indexing and structure of the
repository, but also to keep information current as knowledge in the organization
evolves [19].
Curtis notes that empirical studies have shown that as programmers gain
experience, their understanding of the knowledge in a particular domain
gravitates toward a common structure [7]. In other words, while it is difficult
to anticipate all design needs or uses for design artifacts in advance,
a group of users working in a moderately homogeneous environment will tend
to approach problems in the same way as they work together. Studies have
shown that a great deal of effort in a project is directed at creating a
mutual understanding of the problem among the developers [30]. The best
opportunity to capture typical use situations is to provide the means to
easily modify the repository during use.
6.1 Adapting the Repository
Adaptive techniques to improve the repository are applied by recording user
actions and incrementally changing representations based on the actions.
As users interact with the system, the repository migrates toward an indexing
scheme using the kinds of goals typically encountered by users. The system
gradually builds a consensual representation of its user's knowledge [26;
p. 3] that changes as knowledge in an organization evolves.
Adaptive indexing collects term usage data interactively, in the process
of real information seeking sessions, by adding terms to an object representation
when they are used in a search that was satisfied by the document [14].
This allows items in the repository to evolve toward an indexing structure
that supports typical usage situations.
CodeFinder supports unlimited aliasing and adaptive indexing by keeping
track of the terms used during a retrieval session and displaying them to
the user when they have successfully retrieved an item. During the process
of constructing a query and browsing the information space, a user might
remove a number of terms from the query and entered terms that were not
in the system lexicon. There may also be some terms in the current query
that are not in the item's current representation. Adding all of these terms
to a component's representation may enhance the future retrievability of
the item, but we must guard against situations in which misspellings or
undesired terms are used in the retrieval session.

Figure 6: Adaptive Indexing in CodeFinder.
CodeFinder addresses this issue by displaying the information to the
user for inspection and possible modification. These terms are presented
to the user when they click on the "Choose This" button in the
Example of the Retrieved Items" pane (see Figure 3). An adaptive indexing
session is then invoked, as shown in Figure 6. Users are given the option
to add these items to the current representation of the current component
and are allowed to enter any other terms they wish.
Repository users can also modify the repository structure used by the Kandor-based
part of CodeFinder through pull-down menus such as the one shown in the
middle-top of Figure 3. This kind of modification is perhaps best handled
by a repository librarian, but is an important facility that enables evolution
of the repository.
6.2 Adjusting Link Weights
In addition to assigning new terms to a component representation, CodeFinder
uses relevance feedback to adjust the term-component link weight. Relevance
feedback is typically used to reformulate a query, usually by adding a document's
representation to the query [27]. CodeFinder accomplishes this by allowing
users to click on the circles on the right-hand side of the Retrieved Items
pane (see Figure 3). Each item chosen is placed in the query (see the Related
Items portion of the Query pane), where it is used as part of the spreading
activation query.
CodeFinder also uses this action as an indication that the components chosen
are relevant to the terms shown in the Terms portion of the Query pane.
Strengthening the link weight between these components and terms will increase
the probability that similar queries in the future will get better retrieval
results. A connectionist learning paradigm is used which is referred to
as localized reinforcement [3]. The association we wish to strengthen is
between terms in the query and components chosen in relevance feedback.
Therefore, if a component chosen in relevance feedback is indexed by any
of the terms in the query, then each of the weights are adjusted with the
following learning rule [26; p. 21]:
where is the activation value of the document node j, is the feedback
given to node i by the user (a binary yes/no value in CodeFinder), and is
the learning rate, usually a value between 0 and 1.0. The values are adjusted
by an arctan() squashing function to keep values between -1.0 and 1.0. Since
"the knowledge is in the link weights", this is an important contribution
to the effectiveness of spreading activation. Empirical studies have shown
that retrieval performance using this algorithm does indeed improve over
time, especially when used in groups with homogeneous interests [3].
7. Evaluation and Future Directions
An experiment was designed to compare CodeFinder's effectiveness at supporting
retrieval with varying levels of task definition, and to investigate how
the systems fared when user and system terminology mismatched [18, 20].
The study compared CodeFinder to Helgon (a predecessor of CodeFinder) and
a subset of the Document Examiner, the retrieval system delivered with the
Symbolics Lisp Machine. Subjects were given well-defined, directed, and
ill-defined tasks that were adapted from observations of people developing
software in the Emacs Lisp environment. For example, a well-defined task
would state "Find the source code that counts new messages in the Rmail
system," while an ill-defined task would state "You are writing
a mail system in which users can answer messages. Find whatever information
you can that could potentially help you design code to answer messages."
The tasks were also manipulated by vocabulary match. Some tasks referred
to items in the repository that were indexed by terms that matched the wording
of the tasks given to subjects and the others did not match task wording.
The repository used for all three systems was the Emacs Lisp repository
created by PEEL that was manipulated for the vocabulary match/mismatch tasks.
The results indicate that subjects had problems with systems using straightforward
matching algorithms when problems were ill-defined or the vocabulary mismatched
[18, 20]. Differences between the systems were more pronounced when the
tasks were in the ill-defined or vocabulary mismatch categories. This serves
to validate the effectiveness of spreading activation with minimally structured
repositories. The greater degree of success with CodeFinder indicates that
soft-matching retrieval algorithms, such as spreading activation, coupled
with query construction techniques are necessary to adequately support the
kinds of tasks typically encountered by developers searching for information
about reusable components.
There are also a number of improvements needed in the PEEL/CodeFinder tool
set before it can become a viable alternative for development organizations.
At this time, PEEL makes no attempt to assess the reusability of components.
If the repository becomes overwhelmed with large numbers of components,
we may wish to extract only components that score well on various measures
such as Halstead and McCabe indicators [6] or others. Work is also under
way to apply the methods to larger-gain software components and non-source
code artifacts. More studies are also needed to compare CodeFinder to other
repository retrieval methods such as faceted classification [25], although
the up-front repository construction costs of these methods makes such comparisons
difficult.
8. Conclusions
The work described in this paper is rooted in the need for software development
tools that support the process of finding components for reuse. Although
it is true that component repositories that are structured and indexed perfectly
will make it easy to find relevant components, the reality is that pragmatic
and inherent circumstances prevent the creation of such repositories. It
is both difficult and costly to anticipate all the needs a given component
in a repository will be used for, and even if perfection could be achieved,
the fact that people use diverse vocabulary to describe their information
needs will prevent users from always finding the proper components. Methods
are needed that can provide adequate retrieval effectiveness with minimal
indexing and structuring efforts, allowing organizations to take advantage
of valuable assets that have accumulated from previous development efforts
without a large up-front investment.
This paper presents an approach to supporting a kind of "lifecycle"
for repositories. Repositories are initially seeded with structure and index
terms with PEEL, a tool that extracts repository information from Emacs
Lisp Source code. CodeFinder, which supports the retrieval process through
retrieval by reformulation and spreading activation, is then used to help
users find components in the face of less-than-perfect repository structures
and indexing. While using CodeFinder to find components, users are given
the opportunity to add structure and index terms to the repository, improving
the repository to reflect the kinds of problems repository users typically
encounter. This process has the advantage of avoiding costly repository
populating efforts while capturing information that may be useful to people
encountering similar information needs in the future. Empirical studies
have shown that CodeFinder is able to adequately support the process of
finding relevant software components, even with the minimal structuring
and indexing performed by PEEL.
In order for software reuse methodologies to be feasible and effective,
methods are needed to exploit valuable assets that exist in development
organizations in the form of existing code modules that have been developed
and saved with no thought of re-use. Tools for building a repository must
be developed that extract these modules and represent them in a repository
in a form that is amenable to a specific set of retrieval tools. The work
presented here takes some steps toward addressing these issues. Future work
is needed to refine the ideas presented here and subject them to rigorous
empirical studies to assess their effectiveness and learn more about the
nature of component repositories built to support software reuse techniques.
Acknowledgments. I thank Gerhard Fischer and colleagues
at the University of Colorado, especially Peter Foltz, Walter Kintsch, David
Redmiles, and Curt Stevens, with whom many discussions were shared about
the ideas in this paper. Generous donations of software by Helga Nieper-Lemke
contributed to the design of CodeFinder. Grant No. MDA903-86-C0143 from
the Army Research Institute partially supported this research.
References
[1] ACM, "The Full Computing Reviews Classification System," :
ACM, New York, 1992.
[2] B. H. Barnes, T. B. Bollinger, "Making Reuse Cost-Effective,"
IEEE Software, 8(1), 1991, pp. 13-24.
[3] R. K. Belew, Adaptive Information Retrieval: Machine Learning in
Associative Networks, Cognitive Science and Machine Intelligence Laboratory,
University of Michigan, Ann Arbor, MI,1987.
[4] T. J. Biggerstaff, C. Richter, "Reusability Framework, Assessment,
and Directions," IEEE Software, 4(2), 1987, pp. 41-49.
[5] B. A. Burton, R. W. Aragon, S. A. Bailey, K. D. Koehler, L. A. Mayes,
"The Reusable Software Library," IEEE Software, 4(4),
1987, pp. 25-33.
[6] G. Caldiera, V. R. Basili, "Identifying and Qualifying Reusable
Software Components," Computer, 24(2), 1991, pp. 61-70.
[7] B. Curtis, "Cognitive Issues in Reusing Software Artifacts,"
in Software Reusability; Volume II: Applications and Experience,
T. J. Biggerstaff and A. J. Perlis, Eds.: Addison-Wesley, Reading, M, 1989,
pp. 269-287.
[8] P. Devanbu, R. J. Brachman, P. G. Selfridge, B. W. Ballard, "LaSSIE:
A Knowledge-Based Software Information System," Communications
of the ACM, 34(5), 1991, pp. 34-49.
[9] J. C. Esteva, "Automatic Identification of Reusable Components,"
IEEE Seventh International Workshop on Computer-Aided Software Engineering
- CASE '95, Toronto, Ca, 1995, pp. 80-87.
[10] G. Fischer, R. McCall, J. Ostwald, B. Reeves, F. Shipman, "Seeding,
Evolutionary Growth and Reseeding: Supporting the Incremental Development
of Design Environments," Proceeding of the Conference on Computer-Human
Interaction (CHI '94), Boston, MA, 1994, pp. 292-298.
[11] G. Fischer, H. Nieper-Lemke, "HELGON: Extending the Retrieval
by Reformulation Paradigm," Human Factors in Computing Systems,
CHI '89 Conference Proceedings, 1989, pp. 333-352.
[12] W. B. Frakes, P. B. Gandel, "Representing Reusable Software,"
Information and Software Technology, 32(10), 1990, pp. 653-664.
[13] W. B. Frakes, B. A. Nejmeh, "Software Reuse Through Information
Retrieval," The 20th Hawaii International Conference on System
Sciences, Hawaii, 1987, pp. 530-535.
[14] G. W. Furnas, "Experience with an Adaptive Indexing Scheme,"
Human Factors in Computing Systems, CHI '85 Conference Proceedings,
1985, pp. 131-135.
[15] G. W. Furnas, T. K. Landauer, L. M. Gomez, S. T. Dumais, "The
Vocabulary Problem in Human-System Communication," Communications
of the ACM, 30(11), 1987, pp. 964-971.
[16] F. G. Halasz, "Reflections on Notecards: Seven Issues for the
Next Generation of Hypermedia Systems," Communications of the ACM,
31(7), 1988, pp. 836-852.
[17] S. P. Harter, "Psychological Relevance and Information Science,"
Journal of the American Society for Information Science, 43(9),
1992, pp. 602-615.
[18] S. Henninger, "Using Iterative Refinement to Find Reusable Software,"
IEEE Software, 11(5), 1994.
[19] S. Henninger, K. Lappala, A. Raghavendran, "An Organizational
Learning Approach to Domain Analysis," Seventeenth International
Conference on Software Engineering, Seattle, WA, 1995, pp. 95-104.
[20] S. R. Henninger, Locating Relevant Examples for Example-Based Software
Design, Depart. of Computer Science, University of Colorado, Boulder,
CO,1993.
[21] Y. S. Maarek, D. M. Berry, G. E. Kaiser, "An Information Retrieval
Approach for Automatically Constructing Software Libraries," IEEE
Transactions on Software Engineering, 17(8), 1991, pp. 800-813.
[22] M. C. Mozer, Inductive Information Retrieval Using Parallel Distributed
Computation, Institute for Cognitive Science, University of California
San Diego, La Jolla, CA, ICS Report 8406, 1984.
[23] E. Ostertag, J. Hendler, R. Prieto-Diaz, C. Braun, "Computing
Similarity in a Reuse Library System: An AI-Based Approach," ACM
Transactions on Software Engineering and Methodology, 1(3), 1992, pp.
205-228.
[24] P. F. Patel-Schneider, R. J. Brachman, H. J. Levesque, "ARGON:
Knowledge Representation Meets Information Retrieval," Proceedings
of The First Conference on Artificial Intelligence Applications (CAIA '84),
1984, pp. 280-286.
[25] R. Prieto-Diaz, P. Freeman, "Classifying Software for Reusability,"
IEEE Software, 4(1), 1987, pp. 6-16.
[26] D. E. Rose, R. K. Belew, "A Connectionist and Symbolic Hybrid
for Improving Legal Research," International Journal of Man-Machine
Studies, 35(1), 1991, pp. 1-33.
[27] G. Salton, C. Buckley, "Improving Retrieval Performance by Relevance
Feedback," Journal of the American Society for Information Science,
41(4), 1990, pp. 288-297.
[28] G. Salton, M. J. McGill, Introduction to Modern Information Retrieval:
McGraw Hill, New York, 1983.
[29] R. H. Thompson, W. B. Croft, "Support for Browsing in an Intelligent
Text Retrieval System," International Journal of Man-Machine Studies,
30, 1989, pp. 639-668.
[30] D. B. Walz, J. J. Elam, B. Curtis, "Inside A Software Design Team:
Knowledge Acquisition, Sharing, and Integration," Communications
of the ACM, 36(10), 1993, pp. 62-77.
[31] G. Wiederhold, P. Wegner, S. Ceri, "Toward Megaprogramming,"
Communications of the ACM, 35(11), 1992, pp. 89-99.
[32] M. D. Williams, "What Makes RABBIT Run?," International
Journal of Man-Machine Studies, 21, 1984, pp. 333-35.