Databases and Hypertext 130
A Brief History of Databases 131
Meaning and Knowledge 132
Database Example 134
Knowledge Representatio 136
Logic Example: 137
Semantic Network Example 138
Frame Example 139
Connectionist Models 139
Retrieval in Practice: 141
Information Retrieval
To many people, hypertext represents a database of information that facilitates searching and accessing information. In many ways it is. This chapter will try to make this view of hypertext systems as databases more understandable. It will try to show how a hypertext is similar to a database. There will be no real attempt to explain the low level organization of databases; instead, we will look at how a hypertext system organizes information in records, rules, or higher order objects such as frames or semantic networks. We will try to explain what some of these complex terms mean in ordinary language with examples that try to capture the primary functions and uses these of these abstract data structures.
In fact, many hypertexts store the information contained in their nodes in a database, using standard database structures. This makes access to the information fast and accurate when the user links to it. Unlike an ordinary database, which usually provides operators to retrieve the database in record or table form, hypertext usually retrieves knowledge in more complex representations. Hypertext can represent information multi-dimensionally. That is, the links are not constrained by the two-dimensional data structure of a database. Some of the most interesting current work has as its goal the storing and retrieving of knowledge in many forms (visual, iconic, linguistic, and textual) in ways that mimic human intellectual reasoning in all its complexity.
Databases and Hypertext
Since much of the discussion of Hypertext can be made more meaningful from the perspective of database organization, this section will offer a brief introduction and overview of database systems. This cannot substitute for a working knowledge of database systems, but it will lead naturally into a description of some advanced hypertext systems that explicitly make use of databases created for the personal computing world. These systems build on the strength of retrieval and simplified updating that are the hallmarks of good database systems, and the notoriously weak Achilles' heel of most hypertext systems today.
A Brief History of Databases
Vannevar Bush' visionary article is as important to database researchers as it is to the hypertext community. It outlines the dream not only of a numeric and text database that enlarges and supplements human memory, but it adds to that fantasy an exotic optical retrieval system that presages modern experiments with optical computers. Optical computers remain today in the most primitive stage (or most advanced, depending on your perspective) of research, although there is great hope for their rapid development into practical working systems in the near future. What made Bush's vision even more visionary and remarkable was its emphasis on semantic encoding and retrieval, in a way that entailed more than just a database,but in fact a knowledge base. Currently the distinction between databases and knowledge bases is blurring rapidly as the techniques from Artificial Intelligence are infiltrating the database community as Trojan horses brought in with Prolog; advanced knowledge representation languages, such as KLONE; and other object oriented, frame-based, and rule-based systems.
However, the distinction between databases and knowledge bases is an important one that will be drawn out in this brief description. It will be made clear that the issues that raise the distinction between databases and knowledge bases are important for understanding the history and current state of hypertext too.
In the early computer languages, databases were often appended to a particular programs to provide the initial values for variables and more complex structures such as lists and arrays, if these were used in more than one place in the program. This idea of sharing data expanded greatly when timesharing systems arose and data could be shared not only within a program, but between programs and even between individuals at different times. The special file structures and data structures that made this sharing possible arose in the 50's and 60's.
Although the first languages, such as FORTRAN and COBOL, created the need for sharing large bases of data and text ( mainly computer code, rather than such frivolous things as letters and books), little formal database research was published until the 1970's. Then a series of articles, many of them revealing the unnoticed research of the previous decade, came to light.
The 1973 Turing award lecture by Charles Bachman presented the advantages of the first coherent view of machine memory and file systems as a hierarchical network of records. In beginning a debate that lasted through the decade, T. Codd, then with IBM, revealed a general - purpose programming language based on set theory and logic. This relational programming language won out during the next fifteen years and became the basis for most of today's data base management systems (DBMSs) in their many hundreds of different forms.
Meanwhile, research on DBMSs is moving on and the key topic appears to be how to capture semantics, meaning, in database forms, without overloading even the most powerful computers available today. Making the database simple enough for computers to use in a generative way is the problem of "computational tractability" and it is really this problem that AI addresses.
Meaning and Knowledge
Today many outstanding researchers focus on this issue, designing knowledge bases of enormous size and scope. Doug Lenat proposes to create a system containing millions of nodes of information about everyday facts, such as those one would find in an annual world almanac or a child's reference book about the world. He calls his project CYC, short for enCYClopedia. Others, like Ray Reiter, Hector Levesque, and Ron Brachman, are building computational systems that make these sorts of projects possible. By and large, these systems are based on knowledge structures arranged in trees and networks. The knowledge is connected systematically, but loosely. Many heuristics are invoked somewhat capriciously to make connections between nodes. For instance, doctors and lawyers may be connected in many different ways: as professionals; employers of other professionals; degreed individuals; wealthy people; and so on. Furthermore new connections may be formed computationally among these nodes by searching the knowledge base for similarities among nodes.
In general, heuristics dominate the system. There are no axioms or fundamental laws from which to deduce the remainder of the knowledge. Instead comparisons focus on analogy and similarities among the content. For instance, a doctor may be loosely compared with a lawyer, and the whole network of relationships built for a lawyer may be reused to provide the skeleton of structure for the doctor. Usually, by using this copy and edit mode of building knowledge structures, enormous time savings can be had; and in fact, the process can be generative too, since certain relationships that may be obvious for the lawyer may not have suggested themselves for the doctor, except by analogy. Recent work on case based reasoning (Kolodner, et al.) and learning through analogy (Anderson and , 1990) have examined the basis of analogy-based reasoning. (See also Gentner, 1989, and for a dissenting note (Moran and Halasz, 198 ).
Facts are often embedded into higher order structures such as qualitative models and mental models of complex real life situations. The notion that mental reasoning is mediated by qualitative models of the world has a consensual validity among many cognitive scientists , and is the subject of an entire book ( Gentner and Stevens, 1983) and a thoroughly reasoned exposition by Johnson-Laird (1983). The flavor of these models is very much like the earliest philosophers' discussions of physics. For instance, here is what Vitruvius, a Roman architect from around the first century B. C. had to say about the analogy between sound and waves in water:
"Voice is a flowing breath of air, perceptible to the hearing by contact. It moves in an endless number of circular rounds, like the innumerably increasing circular waves which appear when a stone is thrown into smooth water, and which keep on spreading indefinitely from the center unless interrupted by narrow limits, or by some obstruction which prevents such waves from reaching their end in due formation. When they are interrupted by obstructions, the first waves, flowing back, break up the formation of those which follow."
Vitruvius, The ten books of architecture. M. H. Morgan (trans.) Cambridge, Massachusetts: Harvard University Press, 1926.
Diagrams and complex visual simulations appear to be naturally suited to these representations. Often these visual systems are combined into models that mimic mental structures, that reason the way peoples' minds do.
In contrast, the logic community, with representatives that range from Jack Minker and John McCarthy to Robert Kowalski, are continuing to examine the special strengths of logic-based databases. The rapid development of descriptive languages, such as PROLOG, have enabled them to devise logic databases of great complexity. The framework of logic allows them to make deductions about elements of the database with great certainty and accuracy. If something is found there, it has to exist, since the reasoning is tautological, and rigorously certain. However, the inference rules work with very tiny steps, so you need an awful lot of them to find anything new about the database. Fortunately, the Prolog systems turn out to have a convenient and efficient representation form in computers, so these tiny steps can be taken with great rapidity. Even so, however, the purity of these logic systems is almost always compromised with special constructs (such as "cuts") that introduce heuristics and non-logical components into the system.
It is much too early in the development of these systems to judge whether search or logic-based approaches are better for the many uses of hypertext applications; or even for which uses one or the other approach is better. At the moment, they are largely equivalent; although in general, natural language systems seem better suited to PROLOG implementations, and simulations, especially visual simulations that incorporate pattern recognition and updating, are more conveniently implemented in a heuristic, search-based paradigm.
To provide an overview of some of the potential differences and strengths of these approaches, the following sections will provide some examples of popular knowledge representation techniques.
Database Example
Probably the most familiar database structure is a record. A record is divided into fields that can be accessed through the first field, the key term:
dwelling....house..................................................................
dwelling....lodge...................................................................
house.........bungalow....cabin....dacha...cottage................
Each field is often of fixed length, and when one of these fields is used for retrieval it is called a secondary key. The most interesting part of database retrieval occurs when a secondary key on one record is a primary key on another. In the example above, the secondary key of "house" on the "dwelling" record is the primary key on its own record.
Instead of just retrieving on the basis of the primary key in a strictly linear order, the use of secondary keys creates a complex network of relationships, a relational database. When a primary key occurs only once as a secondary key, the implicit network structure is more simply ordered, and is called a tree.
Imagine this simple database about kinds of houses and dwellings, represented in a more logical way, with some of the relationships in the record explicitly laid out:
small(bungalow, house)
log(cabin, house)
country(dacha, house)
summer( cottage, house)
family(house,dwelling)
hunting(lodge,dwelling)
Each line can be read as : "A .... is a ...... ......." For instance, the first line can be read: " A bungalow is a small house."
You could think of these in a kind of table or chart:
house
small bungalow
log cabin
country dacha
summer cottage
dwelling
family house
hunting lodge
or you could read them in a narrative, such as:
a bungalow is a small house;
a cabin is a house made of logs;
a lodge is a house for hunting;
a dacha is a house in the country;
a cottage is a summer house;
and:
a house is a kind of dwelling.
a lodge is a kind of dwelling.
Not only is the database easier to understand now, but that implies that it is also easier to create and update by users with less specialized knowledge. In fact, it now becomes possible to create automatic systems to begin to help maintain the internal consistency of the database and update it with new information; or even to use automatic techniques for deduction and inference to effectively create new knowledge from the same database.
For instance, if one were to query whether a cottage is a kind of dwelling, a simple logic program could retrieve the fact that, since a cottage is a kind of house, and a house is a kind of dwelling, (modus ponens), a cottage is a kind of dwelling. This sort of reasoning, is called deductive. Although it appears that a new fact was created, namely "kindof(cottage, dwelling)." it is not really new since it was already there in the database according to the rules of the system. It can be said to be tautologically true, or true by definition. However, it is still very useful to be able to make these relationships explicit, where they were only implicit before. Both rule-based and logic-based systems offer this useful possibility.
Knowledge Representation
With these, more natural, ways of laying out databases, we have entered the worlds of Artificial Intelligence (AI) techniques and forms of knowledge representation. Underlying all of these forms is the notion of a linked list that developed very early in the formation of higher order languages such as LISP, as the underlying data structure in databases. In some sense, these ideas (AI and databases) have never really separated very far. In some sense, the phenomenon of hypertext is forcing them together again; as large scale knowledge bases become not only possible, but used by people with widely disparate abilities, goals, and experience with computers. Now that hypertext systems are making the incredible power of computers available to users with little computational experience, the question of how to represent the knowledge in these systems is essential for letting these users expand their access to this power. Only by finding representations that are naturally understood by a broad range of users will the growth of access continue to empower users and augment their natural intelligence.
Logic Example:
One of the most common and widespread implementations of these new relational databases makes use of the first order predicate logic embedded in the computing language Prolog.
Most logic examples begin with modus ponens. Starting with a proposition, such as : All men were mortal;
and given another proposition that says: Vannevar Bush was a man;
the logic of modus ponens gives us: Vannevar Bush was mortal!
This use of simple propositions in first order predicate calculus, can be very powerfully employed in databases. Prolog can take these representations and make them into working systems, so that given:
man(Vannevar Bush). and
mortal(man).
The prolog system can return the request:
mortal(who?).
into
Mortal (Vannevar Bush). just like modus ponens, and also just like any good database system. As a result, Prolog-based models of databases are growing with great vigor.
When applied to our little database example, prolog can return a simple query such as:
how?(who?, dwelling).
with
family(house,dwelling)
hunting(lodge,dwelling)
Semantic Network Example
Semantic networks are an alternative to logic based representation forms. Instead of letting the deductive logic of first order predicate calculus dominate the way topics are related, semantic networks, like hypertext, allow anything to be connected to anything else. As with hypertext, semantic networks are built out of nodes and links. Nodes can be single words, or complex functions. When they are complex functions, anything can happen, so simple logical inferencing cannot easily be carried out on the nodes. However, the system can be described by labelled, directed graph and so inferences can be carried out by searching the graph. Order can also be imposed on the graph structure by restricting the links rationally. A common technique uses only "isa" links between nodes. The graph can then be described as a taxonomy, and used to structure concepts.
One of the most theoretically significant techniques is to view these semantic networks as a method for describing problem spaces (Newell and Simon << >>) for any domain.
Here is a simple taxonomy for "thing":
The tree above has been extracted computationally from a much larger network of relationships built into WordNet by George Miller and his lexigang colleagues.
Frame Example:
The same information could have been made available in a slot and value syntax called a frame, below:
thing: non-living
material: man-made
structure: enclosed
building: for-people
dwelling: for-family
house: small
In this representation each node is now called a slot, and each link is called a value. These values may be simple words, or they could be complex functions
Connectionist Models
<<>>
The differences between logic forms, rule-based forms, semantic networks, and frames are manifold, and obviously of great interest to the many fine scholars at work on extending their power and implementing useful databases and knowledge bases in the various forms. For many researchers, some of the most interesting questions are raised by finding ways to create connectionist models within these frameworks. Needless to say, each representation format is having some success in finding adequate ways of representing connectionist kinds of models (cf. Brachman and McGuinness, 1988).
Brachman, R. J., and McGuinness, D. L. Knowledge Representation, Connectionism, and Conceptual Retrieval. Association for Computing Machinery, pp 161 - 174, 1988.
The importance of "knowing NOT": Do you know the name of the third emperor in the Ming Dynasty: If you don't you will know that you don't, virtually right away. How could that work?
Could you possibly have searched all your memory, or did you just look under Ming Dynasty and realize that there were no individuals located there?
Did you know it was Laoshe Tzu?
Instead of the very complex query languages found in most databases, retrieval is simplified ultimately to the act of buttoning a spot or word on the screen.
Instead of the complexities of database programming languages, very simple Do As I Did (DAID) direct manipulation or programming by doing concepts are used. Of course, these are also potential techniques for database systems, and as DBMS adopt these techniques hypertext and databases will become more intertwined.
Certainly the issues of text retrieval are central to many effective hypertext systems. Many hypermedia systems are belatedly adding additional retrieval capabilities that go beyond simple string searches. Thus conditional searches are being added to Hyperties; graphic searches to Hyperties and hypercard; HyperKRS adds boolean and wildcard searches to HyperCard; Notecards is adding full text search to its original capabilities of searching titles; etc.
The relationship between hypertext and standard information retrieval systems is being actively explored ( see also the retrieval effects of hypertext in Fischer, McCall, and Morch's Janus system). Handcrafting links among nodes adds both to the accuracy and specificity of the retrieval process, improving it greatly over standard techniques. For instance, in large databases the probability of finding something in the first place you look is around 20 % (Furnas, Landauer, Gomez, and Dumais, 1987). At the same time the proportion of relevant documents retrieved by the standard boolean search is usually well below 50 % (Salton and McGill, 1983). Given these abysmal figures, when even a 99% figure may be too low in very large databases, the specificity and power of pre digested hypertext may become extraordinarily important. If, as futurists are excited to tell us, the whole world's knowledge base may be at our fingertips over glass fiber lines within our lifetimes, information retrieval will be no small problem and hypertext may be an important part of the solution.
Retrieval in Practice:
John Guthrie's experiments
Ed Friedman's work at Stevens with Galileo:
a 10 megabyte file of text. Students use it to make reports and essays.
Teachers felt it really improved work.
Students loved it.
Some did "original" scientific research, and one disovered that Galileo's idea of heat embodied the modern one when he searched the text for references to Galileo and the "poet" ---
Database IDE
Dan Russell has moved the Lisp version of IDE into Foxbase©. Analyse and refine
this application of hypertext and databases combined for instructional authoring.