Ogden's Basic English as a Lexical Database In Natural Language Processing
by Scott R. Hawkins
Ogden's assumptions about what was necessary [,] provide lexicon designers with surprising advantages, one of which is an unexpected order in the vocabulary. During the sorting process it became clear that many of the words in the vocabulary could be categorized according to which areas of experience they dealt with. Most of the vocabulary fell into one of four categories (physical, animal, human and economic) hereafter referred to as systems.
Design of the Lexicon
That property of the data could be rather significant in light of the
success of Terry Winograd's SHRDLU (Windograd, 1972). Though not
particularly advanced in terms of vocabulary and parsing ability, SHRDLU
was unusual in the depth of understanding of the system. In SHRDLU every
noun used was a label for a physical object of which the system had direct
knowledge, every verb was an operation it knew how to perform. The level
of dialog possible between man and machine was not particularly advanced,
but it was meaningful. The computer displayed understanding.
The question arises: can one incorporate a similar mapping between
real world objects and the essentially arbitrary character strings which form
the vocabulary of Basic English? The best, most obvious solution -- assign
labels to sensory memories similar to those of humans-is precluded by the
lack of adequate sensory equipment. For purposes of this project, Winograd's
one word => one procedure approach is also precluded by the size of the
Lacking that, it seems worthwhile to explore the possibility of
constructing some sort of mental model of the systems represented in the
vocabulary, a prospect which I will discuss further in the Directions for
Further Research Chapter of this paper.
Systems and their Components
The ultimate goal of any natural language processing system is to
enable the computer to 'understand' the meaning of natural language input
as text. The subproblems which arise -- lexicon organization, sentence
parsing, and the creation of a logical form -- are each monumental in their
own right, and it is possible to lose sight of this ultimate goal while pondering
This project concerns itself with the lexical organization component of
the broader problem. Any NLP system must have a lexicon from which to
draw. While it is possible to envision an MP system which makes use of a
simple vocabulary list of allowable letter combinations, it seems more elegant
to make the lexicon be as supportive of the parsing and logical form
generation phases as possible. In developing the systems, I tried to keep both
of those goals in mind, as well as the ultimate one of aiding understanding.
Each system (physical, animal, human and economic) is composed of
the three components listed below. Often it proved convenient to further
subdivide the components. For example, in the human system, operations
are grouped along the lines of 'physical' or 'cognitive.'
Broadly speaking, the elements of a given system are related by a
commonality of topic. Some terms in the systems can be viewed as almost
purely physical (run, table); others (subtract, idea) seem to require the support
of a cognitive framework in order to have meaning. In addition, many
physical terms also have analogous meanings when applied to non-physical
- Entities - labels for things which have independent
existence in either the physical or cognitive planes. (As
distinguished from operations, which require some entity to
perform them.) For example, 'stn1cture,' 'anima1,' and 'mind'
are all entities.
- Operations - labels for ways in which entities can relate
to themselves or other entities. For example, 'move,' 'attack,'
- Attributes - properties such as size, shape, location,
character, etc. which are associated with entities.
A description of the systems is given below (see also Figure 1):
- Physical - physical systems lie at the top of the hierarchy.
The idea is that the labels for entities and the operations those
entities may perform are general enough to apply to all the other
systems. In the physical system, a typical entity would be 'tree,' a
typical attribute 'color,' and a typical operation 'move.'
- Animal - direct references to elements of the animal
system don't often occur in conversation, but they are obviously
a superset of human systems. For example, the operation 'walk'
may be performed by both humans and dogs. A typical entity of
the animal system is 'horse,' a typical operation particular to
animals is 'eat,' and a typical attribute is 'health'
- Human - the human system per se is rather small since
the only elements of language which are applied exclusively to
humans are those dealing with cognition. With few exceptions,
physical operations performed by humans are also performed by
animals. A typical entity of the human system is 'girl,' a typical
operation 'think,' and a typical attribute 'character.'
In addition, there is a separate category of words whose only use is to
specify the relationships between other words in the sentence. These words
fall under the heading of 'Grammar.' This category of words is subdivided
along the lines of the type of relationship expressed.
- Economic - though clearly an element of human
systems, some of the relationships (such as ownership) are
unique enough to justify the creation of a separate system. A
typical entity of the economic system is 'industry,' a typical
attribute 'price,' and a typical operation 'buy.'
Grammatical relationships are subdivided as follows:
Cognitive - this category includes such words as 'about,' 'for,' and
'of' which specify the subject of a grammatical construct.
Functional - this category includes such words as 'if' and 'but
which define cause and effect relationships.
Logical - this category includes words such as 'and,' 'or,' and 'not'
which specify logical relationships.
Modifiers - this category includes words such as 'a,' 'the,' and
'all' whose purpose is to specify or modify the number of an entity or construct.
Temporal -- this category includes words such as 'till,' 'while,' and 'again' which specify the time frame of a construct.
Another useful feature is the hierarchical organization of the systems
(See Figure 1). Physical systems are highest in the hierarchy; the operations
Figure 1 - System Hierarchy
thus classified may be performed by members of both other systems. Next
comes the animal system, the entities of which may perform the operations
of both the physical system and their own. Finally, we have the human
system, whose entities have access to the operations of all three systems.
To get an idea of how this works, consider the following examples.
The operation 'move' is in the physical system and may thus be performed by
both an artifact (which is an entity in the physical system) and a dog (an entity
in the animal system). Conversely, the operation 'sleep' is in the animal
system and may not be performed by an artifact, though it is perfectly
allowable for both dogs and humans.
The economic system is a component of the human system. The
elements particular to this system can be viewed as purely cognitive, existing
only in the minds of people. On the other hand, the economic operations'
'buy,' 'sell,' etc. have access to entities of the higher systems: 'tools/(from the
physical system) 'dogs,'(from the animal system) and so on.
Analysis of The Model
With that said, we are now in a position to discuss how the nature of
the vocabulary aids in sentence parsing, knowledge representation and
understanding. The vocabulary has both syntactic and semantic features. For
example, the entities category contains only nouns, and thus might be used in
the parsing phase to identify the subject or object of some operation. Hence
the claim of syntactic support is partially satisfied.
However, I chose not to group all nouns under the entity heading. The
reasoning was that many words which are traditionally considered nouns
('red,' 'character,' etc.) gain meaning primarily through association with other
nouns. Such words fall into the attribute category of my vocabulary, which
also includes traditional adjectives.
On the other hand, the ancestor/descendant relationships of the
network can be used in question answering. For example, the question 'Did X
move?' can be answered by looking up 'move' in the vocabulary and
searching its descendants for an operation performed by the entity X.
Therefore, the vocabulary also has a semantic content.
The maximum depth of ancestry was five, the most common depth
For a full listing of the system, see Appendix A.
The implementation of my hierarchy, or 'ALAN' as I came to call it, 5 is
largely a program for the creation, storage, and retrieval of the data structure.
What natural language processing features there are are quite superficial and
do not begin to exploit the potential of the data structure. In the following
chapter I discuss some planned modifications which will enable the program
to do a bit more of what it was designed for.
. . .
5 for Alan Turing
. . .
. . .
. . .
. . . Verbs are stored only in present tense -- I plan to add
complete verb conjugations at a later date.
6 Currently, MAXWORD=15 and MAXCHILD=30, but that is easily changed by
modifying a single line in the header file.
Figure 2 -- A node
The type field of a node is a four character7 string containing letters to symbolize the values of the four fields below:
7 Though there are MAXWORD (=15) characters available, only the first four are used.
. . .
. . .
. . .
Component, Instance, Synonym
To get a feel for how this strategy worked in practice, consider the section of the tree shown in Figure 3 (below)
The type fields are shown below the words with which they are associated.
The type field for beast, 'eaas' indicates that the word 'beast' is an
entity, is part of the animal system, is animate, and is a synonym for its
parent. The type field for toe, 'epac' indicates that toe is an entity which is part
of the physical system, is animate, and is a component of its parent. Note that
the 'animal' node is described as being an entity in the physical rather than the animal system.
Figure 3 - Type Field in the Network
When the data structure has been successfully constructed,
ALAN prints out a message indicating the number of words which have been loaded
and waits for the user to hit return. At that point, the program prints the main menu on the screen. Five options are offered (see Fig 4): insert a
new word, listing menu, question and answer sequence, modify type, and
quit. These options are explained in greater detail below.
Vocabulary and Type Insertion
Inserting a new word is a complex but standard process. It involves ...
. . .
. . .
image page 37
In the event that a new word is a semantic ancestor of one or more
words already in the data structure, procedures exist to do the necessary
reshuffling of pointers. However, this feature is a holdover from an earlier
version of the program and is not entirely compatible with the current
implementation. If the ancestor option is invoked, the user must manually
enter the type through the 'modify type' option on the main menu. Failure
to do so will result in inconsistencies in the vocabulary file and will cause the
program to crash the next time the vocabulary is loaded. If a crash occurs
when the vocabulary is loading, a user can copy the backup file 'vocab.bak' to
the vocabulary file 'vocab.a.' A backup copy is automatically created each time
the program successfully loads a vocabulary file.
The listing menu prints out the words and their types, either alone or with the word's children on the same line.
The children of a particular word may also be printed.
Assign Semantic Components
The option allows the user to associate some semantic formula with a word. The semantic formula concept is discussed in detail in the Directions for Further Research section of this paper. The only thing that needs to be
said about it here is that it requires a separate file for storage space and is related to the 'Semantic Features added to the vocabulary' message printed when the system is invoked.
Figure 4 -- ALAN's Flowchart
The quit option writes the vocabulary together any changes to the file 'vocab.a.'
Question and Answer Interface
The question and answer interface is discussed in detail in the next chapter.
The system implementation, ALAN, is a portable, expandable program for the generation and maintenance of a lexical data structure. The current implementation contains all the words of Basic English stored in the format describe in Chapter IV.
The question and answer sequence has as two options : Tell the user about ...
Testing the System Using the Question and Answer Interface
This testing displays some of the capabilities of the data structure. The
concept of a semantic hierarchy with instance and component arcs is not
new. However, this project will contribute to NLP research in that the
vocabulary of Basic English has never before been thoroughly encoded. In
addition, the inclusion of the systemic hierarchy enables the computer to
eliminate from consideration impossible entity operation combinations such
as 'the rock chews' or 'the dog thinks'.
The current implementation was not designed to be much more than a
lexicon. What abilities it does have in the areas of parsing and the generation
of a logical form are quite superficial. Nonetheless, the system displays some
notable strengths in the area of question answering.
The ability to answer general questions about specific situations using
information about a word's semantic lineage is the main strength of the
program. Perhaps this feature alone could validate the expenditure of time
and resources required to construct and maintain an associative network
Directions for Further Research
Part I - Planned Enhancement Based on Prior Work
Though the system does possess some interesting capabilities in its
current form, it was designed to be a single component of a much larger
system. ALAN's capabilities for both sentence parsing and situation
representation would improve with the incorporation of certain elements of
prior art. This section is a brief outline of the NLP features and techniques
which would be particularly useful, together with a brief description of the
modifications necessary to incorporate them into the current system.
Obviously, there needs to be a provision for verb (operation) tenses.
Currently, only the present tense is stored. It would be tedious but not
particularly difficult to enable the program to store all the tenses of an
operation within the node. All that would be required are some minor
modifications to the node definitions and search procedures, another storage
file -- "tenses.a" -- defining the tenses and associating them with a node in
the tree. Furthermore an addition to the startup/shutdown package would
be required to maintain the file "tenses.a."
The initial benefits of such an addition are largely cosmetic. The
program's question answering interface would be the primary beneficiary.
However, if and when the hypothetical system of semantic formulae
described in the Directions for Further Research section of this paper is
implemented, a thorough catalog of verb tenses would become essential to
locating the situation in time.
Obviously, the simple 1-entry/M-attribute/N-operation model of
situations which I used is not an adequate tool for the representation of all
natural language sentences. In future versions I hope to add an expanded set
of situation structures based on the case grammar models of Charles Fillmore
(Fillmore, 1968) and others (Schank and Riesbeck, 1981). In fad, the current
situation model is nothing more than an scaled down implementation of
Case Grammar is based on the observation that though entities (noun
phrases) may occur in many forms, their semantic relationship to other
elements of the sentence is largely determined by that sentence's verb.
Furthermore, the number of possible relationships is finite and actually quite
For example, the phrase 'the dog' may serve as an AGENT in one
instance: 'the dog ate the food/ BENEFICIARY in another: 'the man petted the
dog/ and EXPERIENCER in a third: 'the dog was hungry/
I incorporated in my hierarchy many of the same observations about
the nature of situations as are found in Case Grammar. For example, case
grammar also distinguishes between physical and cognitive operations, uses
similar techniques for determining the locations of events, and so on.
Preliminary reading seems to indicate that case grammar parsing
techniques will be quite compatible with my data structure. I plan to parse
sentences in a two-pass fashion: the first pass will distinguish entities and
their associated attributes from operations. The second pass will actually
construct the situation structure, perhaps using information from the first
pass to eliminate incompatible case frames and identify necessary ones. Two
1) The EXPERIENCER9 case frame would be eliminated for use
with an entity whose node type was not either animal or human.
Though this is nothing new, it is an example of the usefulness of associative
networks in NLP.
2) Attributes such as 'here,' 'there,' 'against,' etc. whose ancestor
in the hierarchy dealt with 'location' could be identified as filling
a location related case frame such as 'AT-LOC' or 'TO-LOC'.
9 EXPERIENCER requires a living entity.
The last improvement which I envision is the most interesting and the
most difficult. The organization of the vocabulary yields certain semantic
abilities, but the system as a whole still falls short of any real understanding of
the situations with which it deals or even the vocabulary itself.
Due to the current I/O hardware limitations, I do not believe it will be
possible to equip computers with adequate sensory equipment in the
foreseeable future. Consequently, any system designed to handle natural
language symbols must do so using an internal representation of concepts
which is not just physically different10 from that of animals, but different in
essence as well.
Is such a system theoretically possible? Yes. You can create a
mapping/translation from any arbitrary alphabet of symbols to any other,
provided both are finite. The chemical alphabet used to encode memory is
complex and poorly understood, but it is finite (Prochaintz, 1989) . Not that
we need to duplicate every detail of the human system of sensory input and
interpretation--for our purposes a rough approximation would be adequate.
10 Electrical rather than chemical signals.
Is such a system workable? That depends on what you want it to do. It
will not, for example, be able to store detailed Cartesian maps of the shapes of
objects; such a map of a single entity would eat up man-hours and resources
at an extravagant rate. On the other hand, there might be a way to store
enough bare bones information to enable a meaningful situation
representation to be constructed. Let us explore the minimum parameters of
our hypothetical system of sensory shorthand.
Entities and Attributes - For each entity in a situation there must
be associated attributes adequate to serve as input to any
operation which the entity may perform or which may be
performed upon the entity.
Operations - For each operation O to transform a situation S into
a situation S', there must be some function which determines
the new state of every entity/attribute association which is
affected by the operation.
Consider the following situationll :
There is a room.
There is a man in the room.
There is a dog outside the room.
There is a red roach in the room.
11 Though my program will not yet handle a stream of input this complex,
the necessary modifications are routine enough to pretend that they have
already been implemented.
Now, let's move on to the interaction phase and ask the following
What does the man see?
In order for the machine to answer the question, it is necessary for
there to be a variety of information associated with the operation 'see.' First,
'see' implies the visual recognition of entities in the field of vision.
Therefore, the field of vision must be clearly defined. Second, the consequent
change of state affects only one entity-the man--and is entirely mental--he
has a piece of information that was not previously in his possession. He does
not 'see' the dog which was outside the room, because the operation 'see' does
not work through opaque entities, and 'room' generally implies opaque walls.
(No windows were mentioned.)
The previous example together with the parameter listing above
should illustrate the type of semantic formulae envisioned. Though the
actual implementation would be rather involved, I feel that it is not
unworkable. If it was implemented, it could perhaps be argued that the
system running it had gained a certain real understanding of natural
Basic Home |
Next : Bibliography
Allen, James. (1987) Natural Language Understanding. Menlo Park,
California: The Benjamin/Cummings Publishing Company, Inc.
Alshawi, Hiyan. (1987) Memory and Context for Language Interpretation.
Great Britain: The University Press, Cambridge.
Brachman, Ronald. (1979) "On the Epistemological Status of Semantic
Networks." In Associative Networks: Representation and
Use of Knowledge by Computers. Ed. Nicholas V. Findler. New
York: Academic Press.
Berwick, Robert C. (1985) The Acquisition of Syntactic Knowledge.
Cambridge, MA: The MIT Press.
Collins, A.M. and Quillian, M. R. (1969) "Retrieval time from semantic
memory" Journal of Verbal Learning and Verbal Behavior 8,
Davis, Ernest. (1990) Representations of Commonsense Knowledge.
San Mateo, CA: Morgan Kaufmann Publishers, Inc.
Fillmore, Charles. (1968) "The case for case." In Universals in Linguistic
Theory. Eds. E. Bach and R. Harms. New York: Holt, Rinehart and
Fillmore, Charles. (1977) "The case for case reopened," in Syntax and
Semantics 8: Grammatical Relations. Eds. P. Cole and J. Sadock.
New York: Academic Press, 1977.
Hausser, Roland. (1987) Computation of Language. New York:
Hendrix, Gary. (1979) "Semantic Knowledge." In Understanding Spoken
Language. Ed. Donald E. Walker. New York: North-Holland.
Levesque, Hector, and John Mylopoulos. (1979) "A Procedural Semantics for
Semantic Networks." In Associative Networks: Representation and
Use of Knowledge by Computers. Ed. Nicholas V. Findler. New York:
Minsky, Marvin ed. (1968) Semantic Information Processing. Cambridge,
MA: The MIT Press.
Nagao, Makato. (1988) Knowledge and Inference. Boston: Academic Press,
Ogden, C. K. (1934) The System of Basic English. New York: Harcourt, Brace
Prochiantz, Alain. (1989) How the Brain Evolved. New York: McGraw-Hill.
Quillian, M. Ross. (1968) "Semantic Memory," in Semantic Information
Processing Ed. M. Minsky. Cambridge, MA: The MIT Press.
Rich, Elaine. (1983) Artificial Intelligence. New York: McGraw-Hill.
Schank, R. C. and C. K. Riesbeck. (1981) Inside Computer Understanding.
Hillsdale, NI: Lawrence Erlbaum.
Schubert, L. K., R. G. Goebel and N. I. Cercone. (1979) "The Structure and
Organization of a Semantic Net for Comprehension and Inference." in
Associative Networks: Representation and Use of Knowledge bv
Computers. Ed. Nicholas V. Findler. New York: Academic Press.
Simmons, R. F. (1973) "Semantic Networks: Their computation and use for
understanding English sentences." In Computer Models of Thought
and Language. Eds. Schank, R. C. and K.M. Colby. San Francisco, CA:
Shainberg, Lawrence. (1979) Brain Surgeon. Philadelphia: J. B. Lippincott Co.
Sowa, John F. (1992) Conceptual Structures: Current Research and Practice.
Eds. New York: EllisHorwood.
Waltz, David. (1989) Semantic Structures: Advances in Natural Language
Understanding. Hillsdale, NI: Hove and London.
Winograd, T. (1972) Understanding Natural Language. New York: Academic Press.
Basic Home |
Next : Appendix A
The following is the listing of my categorization of C. K. Ogden's System of Basic English.
To the best of my knowledge, all 850 words in the original document have been represented at least once. A few of them appear more than once. For example, 'change' appears as both a noun and a verb. In addition, I too the liberty of adding approximately 100 words to the vocabulary. The words which have been added are marked with an asterisk (*).
There were two primary reasons for those additions:
1. The word added served as a 'parent' node in the network. For example, I added the word 'texture' to serve as a seantic parent for the set of words 'sticky,' 'fuzzy,' 'rough,' etc.
2. The word served to flesh out a set of which there was a
parent but few children. For example, 'filthy,' and 'sterile,' were
added to the category attributes / evaluation / cleanliness, which previously contained only the word 'clean' and 'dirty'.
I followed certain notational conventions in listing the vocabulary. Words were indented one and one-quarter inches further than their parents. For example, the network configurations shown below (Fig 5).
would be represented in the listing as:
Figure 5: Network Configuration
rain snow mist
Since nodes which were components of other nodes are the exception rather than the rule, the reader
may assume that any hierarchical relationship is an instance unless told otherwise.
There may be accidental differences between this listing and the actual implementation.
The root of the entire physical system is environment. All the nodes below are components of the environment, either directly or by inheritance.
The next 26 pages of systems require HTML tables, which are tedious.
Therefore I am just adding links to the scanned images. Appendix.
See also Economic System: Entities/Country.
[ image page 60 ]
[ image page 61 ]
[ image page 62 ]
[ image page 63 ]
[ image page 64 ]
[ image page 65 ]
[ image page 66 ]
[ image page 67 ]
[ image page 68 ]
See Physical Systems : Entities / Animal.
[ image page 69 ]
[ There may be a page or paragraph missing here for component "move". ]
[ image page 70 ]
See also Physical Systems : Entities / Animal / human [sic]
[ image page 71 ]
[ image page 72 ]
[ image page 73 ]
Back to Project Catalog or to
Institute home page.
About this Page: hawkins24.html - Project 712 page 24 Basic English
Last updated January 19, 2015 p.m.