Project 712

Ogden's Basic English as a Lexical Database In Natural Language Processing
by Scott R. Hawkins


Chapter IV
Design of the Lexicon

    Ogden's assumptions about what was necessary [,] provide lexicon designers with surprising advantages, one of which is an unexpected order in the vocabulary. During the sorting process it became clear that many of the words in the vocabulary could be categorized according to which areas of experience they dealt with. Most of the vocabulary fell into one of four categories (physical, animal, human and economic) hereafter referred to as systems.

    That property of the data could be rather significant in light of the success of Terry Winograd's SHRDLU (Windograd, 1972). Though not particularly advanced in terms of vocabulary and parsing ability, SHRDLU was unusual in the depth of understanding of the system. In SHRDLU every noun used was a label for a physical object of which the system had direct knowledge, every verb was an operation it knew how to perform. The level of dialog possible between man and machine was not particularly advanced, but it was meaningful. The computer displayed understanding.

    The question arises: can one incorporate a similar mapping between real world objects and the essentially arbitrary character strings which form
page 24

24

the vocabulary of Basic English? The best, most obvious solution -- assign labels to sensory memories similar to those of humans-is precluded by the lack of adequate sensory equipment. For purposes of this project, Winograd's one word => one procedure approach is also precluded by the size of the vocabulary.

    Lacking that, it seems worthwhile to explore the possibility of constructing some sort of mental model of the systems represented in the vocabulary, a prospect which I will discuss further in the Directions for Further Research Chapter of this paper.

Systems and their Components


    The ultimate goal of any natural language processing system is to enable the computer to 'understand' the meaning of natural language input as text. The subproblems which arise -- lexicon organization, sentence parsing, and the creation of a logical form -- are each monumental in their own right, and it is possible to lose sight of this ultimate goal while pondering minutiae.
    This project concerns itself with the lexical organization component of the broader problem. Any NLP system must have a lexicon from which to draw. While it is possible to envision an MP system which makes use of a simple vocabulary list of allowable letter combinations, it seems more elegant
page 25

25

to make the lexicon be as supportive of the parsing and logical form generation phases as possible. In developing the systems, I tried to keep both of those goals in mind, as well as the ultimate one of aiding understanding.
    Each system (physical, animal, human and economic) is composed of the three components listed below. Often it proved convenient to further subdivide the components. For example, in the human system, operations are grouped along the lines of 'physical' or 'cognitive.'     Broadly speaking, the elements of a given system are related by a commonality of topic. Some terms in the systems can be viewed as almost purely physical (run, table); others (subtract, idea) seem to require the support of a cognitive framework in order to have meaning. In addition, many physical terms also have analogous meanings when applied to non-physical systems.
page 26

    A description of the systems is given below (see also Figure 1): page 27     In addition, there is a separate category of words whose only use is to specify the relationships between other words in the sentence. These words fall under the heading of 'Grammar.' This category of words is subdivided along the lines of the type of relationship expressed.

    Grammatical relationships are subdivided as follows:

    Cognitive - this category includes such words as 'about,' 'for,' and 'of' which specify the subject of a grammatical construct.

    Functional - this category includes such words as 'if' and 'but which define cause and effect relationships.

    Logical - this category includes words such as 'and,' 'or,' and 'not' which specify logical relationships.

    Modifiers - this category includes words such as 'a,' 'the,' and 'all' whose purpose is to specify or modify the number of an entity or construct.

    Temporal -- this category includes words such as 'till,' 'while,' and 'again' which specify the time frame of a construct.
page 28

    Another useful feature is the hierarchical organization of the systems (See Figure 1). Physical systems are highest in the hierarchy; the operations

Hawkins Fig 1

Figure 1 - System Hierarchy

page 29
thus classified may be performed by members of both other systems. Next comes the animal system, the entities of which may perform the operations of both the physical system and their own. Finally, we have the human system, whose entities have access to the operations of all three systems.

    To get an idea of how this works, consider the following examples. The operation 'move' is in the physical system and may thus be performed by both an artifact (which is an entity in the physical system) and a dog (an entity in the animal system). Conversely, the operation 'sleep' is in the animal system and may not be performed by an artifact, though it is perfectly allowable for both dogs and humans.

    The economic system is a component of the human system. The elements particular to this system can be viewed as purely cognitive, existing only in the minds of people. On the other hand, the economic operations' 'buy,' 'sell,' etc. have access to entities of the higher systems: 'tools/(from the physical system) 'dogs,'(from the animal system) and so on.

Analysis of The Model

    With that said, we are now in a position to discuss how the nature of the vocabulary aids in sentence parsing, knowledge representation and understanding. The vocabulary has both syntactic and semantic features. For example, the entities category contains only nouns, and thus might be used in the parsing phase to identify the subject or object of some operation. Hence the claim of syntactic support is partially satisfied.
page 30

    However, I chose not to group all nouns under the entity heading. The reasoning was that many words which are traditionally considered nouns ('red,' 'character,' etc.) gain meaning primarily through association with other nouns. Such words fall into the attribute category of my vocabulary, which also includes traditional adjectives.

    On the other hand, the ancestor/descendant relationships of the network can be used in question answering. For example, the question 'Did X move?' can be answered by looking up 'move' in the vocabulary and searching its descendants for an operation performed by the entity X. Therefore, the vocabulary also has a semantic content.

    The maximum depth of ancestry was five, the most common depth was three.

    For a full listing of the system, see Appendix A.
page 31

Chapter V
Implementation

    The implementation of my hierarchy, or 'ALAN' as I came to call it, 5 is largely a program for the creation, storage, and retrieval of the data structure. What natural language processing features there are are quite superficial and do not begin to exploit the potential of the data structure. In the following chapter I discuss some planned modifications which will enable the program to do a bit more of what it was designed for.

    . . .

page 32


    . . .
    . . .
    . . .
    . . . Verbs are stored only in present tense -- I plan to add complete verb conjugations at a later date.

page 33

Fig 2 - A node

Figure 2 -- A node


    The type field of a node is a four character7 string containing letters to symbolize the values of the four fields below: page 34

    . . .

    . . .

    . . .

page 35

35

Component, Instance, Synonym

    To get a feel for how this strategy worked in practice, consider the section of the tree shown in Figure 3 (below)

    The type fields are shown below the words with which they are associated. The type field for beast, 'eaas' indicates that the word 'beast' is an entity, is part of the animal system, is animate, and is a synonym for its parent. The type field for toe, 'epac' indicates that toe is an entity which is part of the physical system, is animate, and is a component of its parent. Note that the 'animal' node is described as being an entity in the physical rather than the animal system.

[image Fig 3]
Figure 3 - Type Field in the Network


    When the data structure has been successfully constructed, ALAN prints out a message indicating the number of words which have been loaded and waits for the user to hit return. At that point, the program prints the main menu on the screen. Five options are offered (see Fig 4): insert a new word, listing menu, question and answer sequence, modify type, and quit. These options are explained in greater detail below.
page 36

36

Vocabulary and Type Insertion

    Inserting a new word is a complex but standard process. It involves ...

    . . .

    . . .

image page 37

37


    In the event that a new word is a semantic ancestor of one or more words already in the data structure, procedures exist to do the necessary reshuffling of pointers. However, this feature is a holdover from an earlier version of the program and is not entirely compatible with the current implementation. If the ancestor option is invoked, the user must manually enter the type through the 'modify type' option on the main menu. Failure to do so will result in inconsistencies in the vocabulary file and will cause the program to crash the next time the vocabulary is loaded. If a crash occurs when the vocabulary is loading, a user can copy the backup file 'vocab.bak' to the vocabulary file 'vocab.a.' A backup copy is automatically created each time the program successfully loads a vocabulary file.

Listing


    The listing menu prints out the words and their types, either alone or with the word's children on the same line. The children of a particular word may also be printed.

Assign Semantic Components


    The option allows the user to associate some semantic formula with a word. The semantic formula concept is discussed in detail in the Directions for Further Research section of this paper. The only thing that needs to be said about it here is that it requires a separate file for storage space and is related to the 'Semantic Features added to the vocabulary' message printed when the system is invoked.
   
page 38

ALAN's Flowchart
Figure 4 -- ALAN's Flowchart

page 39

Quit

   The quit option writes the vocabulary together any changes to the file 'vocab.a.'

Question and Answer Interface

    The question and answer interface is discussed in detail in the next chapter.

Conclusion

    The system implementation, ALAN, is a portable, expandable program for the generation and maintenance of a lexical data structure. The current implementation contains all the words of Basic English stored in the format describe in Chapter IV.
   
page 40

Chapter VI
Testing the System Using the Question and Answer Interface

    The question and answer sequence has as two options : Tell the user about ...
   

page 41
page 42
page 43
page 44
page 45
page 46

Conclusions

    This testing displays some of the capabilities of the data structure. The concept of a semantic hierarchy with instance and component arcs is not new. However, this project will contribute to NLP research in that the vocabulary of Basic English has never before been thoroughly encoded. In addition, the inclusion of the systemic hierarchy enables the computer to eliminate from consideration impossible entity operation combinations such as 'the rock chews' or 'the dog thinks'.

    The current implementation was not designed to be much more than a lexicon. What abilities it does have in the areas of parsing and the generation of a logical form are quite superficial. Nonetheless, the system displays some notable strengths in the area of question answering.

    The ability to answer general questions about specific situations using information about a word's semantic lineage is the main strength of the program. Perhaps this feature alone could validate the expenditure of time and resources required to construct and maintain an associative network lexicon.
page 47

Chapter VII
Directions for Further Research

Part I - Planned Enhancement Based on Prior Work

    Though the system does possess some interesting capabilities in its current form, it was designed to be a single component of a much larger system. ALAN's capabilities for both sentence parsing and situation representation would improve with the incorporation of certain elements of prior art. This section is a brief outline of the NLP features and techniques which would be particularly useful, together with a brief description of the modifications necessary to incorporate them into the current system.

Verb Tenses

    Obviously, there needs to be a provision for verb (operation) tenses. Currently, only the present tense is stored. It would be tedious but not particularly difficult to enable the program to store all the tenses of an operation within the node. All that would be required are some minor modifications to the node definitions and search procedures, another storage file -- "tenses.a" -- defining the tenses and associating them with a node in the tree. Furthermore an addition to the startup/shutdown package would be required to maintain the file "tenses.a."
page 48

    The initial benefits of such an addition are largely cosmetic. The program's question answering interface would be the primary beneficiary. However, if and when the hypothetical system of semantic formulae described in the Directions for Further Research section of this paper is implemented, a thorough catalog of verb tenses would become essential to locating the situation in time.

Case Grammar

    Obviously, the simple 1-entry/M-attribute/N-operation model of situations which I used is not an adequate tool for the representation of all natural language sentences. In future versions I hope to add an expanded set of situation structures based on the case grammar models of Charles Fillmore (Fillmore, 1968) and others (Schank and Riesbeck, 1981). In fad, the current situation model is nothing more than an scaled down implementation of those ideas.
    Case Grammar is based on the observation that though entities (noun phrases) may occur in many forms, their semantic relationship to other elements of the sentence is largely determined by that sentence's verb. Furthermore, the number of possible relationships is finite and actually quite low.
page 49

    For example, the phrase 'the dog' may serve as an AGENT in one instance: 'the dog ate the food/ BENEFICIARY in another: 'the man petted the dog/ and EXPERIENCER in a third: 'the dog was hungry/

    I incorporated in my hierarchy many of the same observations about the nature of situations as are found in Case Grammar. For example, case grammar also distinguishes between physical and cognitive operations, uses similar techniques for determining the locations of events, and so on.

    Preliminary reading seems to indicate that case grammar parsing techniques will be quite compatible with my data structure. I plan to parse sentences in a two-pass fashion: the first pass will distinguish entities and their associated attributes from operations. The second pass will actually construct the situation structure, perhaps using information from the first pass to eliminate incompatible case frames and identify necessary ones. Two examples: Though this is nothing new, it is an example of the usefulness of associative networks in NLP. page 50

Semantic Formulae

    The last improvement which I envision is the most interesting and the most difficult. The organization of the vocabulary yields certain semantic abilities, but the system as a whole still falls short of any real understanding of the situations with which it deals or even the vocabulary itself.

    Due to the current I/O hardware limitations, I do not believe it will be possible to equip computers with adequate sensory equipment in the foreseeable future. Consequently, any system designed to handle natural language symbols must do so using an internal representation of concepts which is not just physically different10 from that of animals, but different in essence as well.

    Is such a system theoretically possible? Yes. You can create a mapping/translation from any arbitrary alphabet of symbols to any other, provided both are finite. The chemical alphabet used to encode memory is complex and poorly understood, but it is finite (Prochaintz, 1989) . Not that we need to duplicate every detail of the human system of sensory input and interpretation--for our purposes a rough approximation would be adequate. page 51

    Is such a system workable? That depends on what you want it to do. It will not, for example, be able to store detailed Cartesian maps of the shapes of objects; such a map of a single entity would eat up man-hours and resources at an extravagant rate. On the other hand, there might be a way to store enough bare bones information to enable a meaningful situation representation to be constructed. Let us explore the minimum parameters of our hypothetical system of sensory shorthand.
Entities and Attributes - For each entity in a situation there must be associated attributes adequate to serve as input to any operation which the entity may perform or which may be performed upon the entity.

Operations - For each operation O to transform a situation S into a situation S', there must be some function which determines the new state of every entity/attribute association which is affected by the operation.

Consider the following situationll :

There is a room.

There is a man in the room.

There is a dog outside the room.

There is a red roach in the room.

page 52

    Now, let's move on to the interaction phase and ask the following question:

What does the man see?

    In order for the machine to answer the question, it is necessary for there to be a variety of information associated with the operation 'see.' First, 'see' implies the visual recognition of entities in the field of vision. Therefore, the field of vision must be clearly defined. Second, the consequent change of state affects only one entity-the man--and is entirely mental--he has a piece of information that was not previously in his possession. He does not 'see' the dog which was outside the room, because the operation 'see' does not work through opaque entities, and 'room' generally implies opaque walls. (No windows were mentioned.)

    The previous example together with the parameter listing above should illustrate the type of semantic formulae envisioned. Though the actual implementation would be rather involved, I feel that it is not unworkable. If it was implemented, it could perhaps be argued that the system running it had gained a certain real understanding of natural language.
page 53

Contents   |   Basic Home   |   Next : Bibliography


Bibliography

Allen, James. (1987) Natural Language Understanding. Menlo Park, California: The Benjamin/Cummings Publishing Company, Inc.

Alshawi, Hiyan. (1987) Memory and Context for Language Interpretation. Great Britain: The University Press, Cambridge.

Brachman, Ronald. (1979) "On the Epistemological Status of Semantic Networks." In Associative Networks: Representation and Use of Knowledge by Computers. Ed. Nicholas V. Findler. New York: Academic Press.

Berwick, Robert C. (1985) The Acquisition of Syntactic Knowledge. Cambridge, MA: The MIT Press.

Collins, A.M. and Quillian, M. R. (1969) "Retrieval time from semantic memory" Journal of Verbal Learning and Verbal Behavior 8, 240-247.

Davis, Ernest. (1990) Representations of Commonsense Knowledge. San Mateo, CA: Morgan Kaufmann Publishers, Inc.

Fillmore, Charles. (1968) "The case for case." In Universals in Linguistic Theory. Eds. E. Bach and R. Harms. New York: Holt, Rinehart and Winston.

Fillmore, Charles. (1977) "The case for case reopened," in Syntax and Semantics 8: Grammatical Relations. Eds. P. Cole and J. Sadock. New York: Academic Press, 1977.

Hausser, Roland. (1987) Computation of Language. New York: Springer-Verlag.

Hendrix, Gary. (1979) "Semantic Knowledge." In Understanding Spoken Language. Ed. Donald E. Walker. New York: North-Holland.
page 54

Levesque, Hector, and John Mylopoulos. (1979) "A Procedural Semantics for Semantic Networks." In Associative Networks: Representation and Use of Knowledge by Computers. Ed. Nicholas V. Findler. New York: Academic Press.

Minsky, Marvin ed. (1968) Semantic Information Processing. Cambridge, MA: The MIT Press.

Nagao, Makato. (1988) Knowledge and Inference. Boston: Academic Press, Inc.

Ogden, C. K. (1934) The System of Basic English. New York: Harcourt, Brace and Co.

Prochiantz, Alain. (1989) How the Brain Evolved. New York: McGraw-Hill.

Quillian, M. Ross. (1968) "Semantic Memory," in Semantic Information Processing Ed. M. Minsky. Cambridge, MA: The MIT Press.

Rich, Elaine. (1983) Artificial Intelligence. New York: McGraw-Hill.

Schank, R. C. and C. K. Riesbeck. (1981) Inside Computer Understanding. Hillsdale, NI: Lawrence Erlbaum.

Schubert, L. K., R. G. Goebel and N. I. Cercone. (1979) "The Structure and Organization of a Semantic Net for Comprehension and Inference." in Associative Networks: Representation and Use of Knowledge bv Computers. Ed. Nicholas V. Findler. New York: Academic Press.

Simmons, R. F. (1973) "Semantic Networks: Their computation and use for understanding English sentences." In Computer Models of Thought and Language. Eds. Schank, R. C. and K.M. Colby. San Francisco, CA: Freeman.

Shainberg, Lawrence. (1979) Brain Surgeon. Philadelphia: J. B. Lippincott Co.

Sowa, John F. (1992) Conceptual Structures: Current Research and Practice. Eds. New York: EllisHorwood.

Waltz, David. (1989) Semantic Structures: Advances in Natural Language Understanding. Hillsdale, NI: Hove and London.
page 55

Winograd, T. (1972) Understanding Natural Language. New York: Academic Press.
page 56

Contents   |   Basic Home   |   Next : Appendix A


APPENDIX   A
Systems Listing

Prefix

    The following is the listing of my categorization of C. K. Ogden's System of Basic English. To the best of my knowledge, all 850 words in the original document have been represented at least once. A few of them appear more than once. For example, 'change' appears as both a noun and a verb. In addition, I too the liberty of adding approximately 100 words to the vocabulary. The words which have been added are marked with an asterisk (*).

    There were two primary reasons for those additions:

    1. The word added served as a 'parent' node in the network. For example, I added the word 'texture' to serve as a seantic parent for the set of words 'sticky,' 'fuzzy,' 'rough,' etc.

    2. The word served to flesh out a set of which there was a parent but few children. For example, 'filthy,' and 'sterile,' were added to the category attributes / evaluation / cleanliness, which previously contained only the word 'clean' and 'dirty'.
page 57

    I followed certain notational conventions in listing the vocabulary. Words were indented one and one-quarter inches further than their parents. For example, the network configurations shown below (Fig 5).

[image fig 5]
Figure 5: Network Configuration

would be represented in the listing as:

    weather
                        rain         snow         mist

    Since nodes which were components of other nodes are the exception rather than the rule, the reader may assume that any hierarchical relationship is an instance unless told otherwise.

    There may be accidental differences between this listing and the actual implementation.
page 58

Physical Systems

Entities-Physical

    The root of the entire physical system is environment. All the nodes below are components of the environment, either directly or by inheritance.

    The next 26 pages of systems require HTML tables, which are tedious.
Therefore I am just adding links to the scanned images. Appendix.

image page 59

page 59
country
    See also Economic System: Entities/Country.

[ image page 60 ]

[ image page 61 ]

[ image page 62 ]

[ image page 63 ]

Attributes-Physical

[ image page 64 ]

[ image page 65 ]

[ image page 66 ]

[ image page 67 ]

Operations-Physical

Operations-Physical-Inanimate

Operations-Physical-Animate

[ image page 68 ]

68

Animal System

Entities-Animal


    See Physical Systems : Entities / Animal.

Attributes-Animal

Operations-Cognitive-Animal

Operations-Physical-Animal

[ image page 69 ]

69

Operations-Physical-Animal-Move
    [ There may be a page or paragraph missing here for component "move". ]

[ image page 70 ]

70

Human System

Entities-Human

    See also Physical Systems : Entities / Animal / human [sic]
component (human)
mind

Operations-Human

[ image page 71 ]

71

Attributes-Human

[ image page 72 ]

72

Economic System

Entities-Economic

Attributes-Economic

Operations-Economic

[ image page 73 ]

73

Grammatical Framework

Cognitive
aboutforofthan
Functional Relationships
becausethroughifbut
Logical
andornot
Modifiers
atheallany
everynoothersome
suchthatthis
Temporal
tillwhileagainever
stillthenafteras

page 74

finis

Contents   |   Basic Home


Back to Project Catalog   or to   Basic English Institute home page.        
About this Page: hawkins24.html - Project 712 page 24 Basic English
Last updated January 19, 2015 p.m.
Contact us

URL: http://www.basic-english.org/projects/hawkins24.html