Project 322. Vocabulary Profiler Spell Checking.
    Make simple, long text files for use with Lextutor Web VP (Vocabulary Profiler).

    Scope :     Make two or more, simple, long text files for use with Lextutor Web VP (Vocabulary Profiler).

    Background :     Web VP (Vocabulary Profiler) is a well done program to look for common words and color coding them as to which of two lists they appear with all other words being a third color. Initally, the two lists are of the first and second 1000 root words in common English usage, ie. most frequent.
        Paul Nation, Victoria University, Wellington, New Zealand (VUW), gives the statistical program and some useful wordlists as a Zip download . The documentation is complete -- 15 page in word.doc . The word lists are text files. with Families (root words) and Types (derivatives).
        There are two programs. Each outputs a full report of all root words (families) found.
      Frequency - analyses the frequency of words used in the text you are looking at -- very straightforward
      Range - is much more analytical. Up to ten vocabularies may be used. Words of the base files are Families (root words) and Types (derivatives) The list of each root word is followed by tab-indented list of derivative words in that rootword family. The same word cannot appear in more than one basewrd list. Results of several types are displayed for each wordlist. Therefore, it seems able to add the Basic 850 words as baselist in the same simple format to basewrd4.txt and the Basic 1500 words to basewrd52.txt. And turn-off basewrd1 thru basewrd3.
      The programs analyse text files, this means Word.doc, OOo.odt, and web.htm files must be saved as text format before processing. An interesting feature is the exclusion list, function.txt , which will stop processing of words on this list. This allows simplying the output by excluding a, an, the, etc. Any part of the text not to be searched can be "commented out" with <   >
      Jargon:
    • Family - root words
    • Type - derivates within a root family. ("type" can also be the verb "to enter by keyboard," or a noun for "a division of a subject.")
    • Tokens - individual words that are counted.
        If the statistical program "range" is used, then the most interesting statistic is "Types Not Found In Any List" (where "types" is jargon for "words".} This will show all the non-Basic words in the text and will be intermingled with proper nouns (capitalized names of persons, places, and things.).

        The Web VP program is a user friendly usage of the analytical files Paul Nation provides. Our hope is to have it give color coded results for pure Basic English in blue ; for advanced Basic English as green ; and for non-Basic words as red.
        These non-Basic words can be looked up with the IDP Companion Translator from full English to Basic.
    However, the Web VP program is not immediately available for download.
        This software is easier to download and install than the full OpenOffice.org office suite of programs which is 72MB. The WebVPs interface is colorful and should give user appeal for popular usage. OOo has full electronic office features and offers growth to drop down translation (thesauris) and grammar checking for professional use.

    Approach     We have Basic wordlists in various forms -- excel of words, derivatives, and complex words. This should be trivial to do for list 1. For testing purposes, the original list "basewrd1.txt" can be renamed basewrd2.txt. This will color code Basic in blue, most common words that are not Basic as green, and not Basic or otherwise common words as red. List 2, be it the original basewrd1 list or a new list, can be simplied per Simple English discussion in project #452.
        The Basic 1500 and Simple English word lists have affix options (MySpell/OOo) and will have to be converted by eye into simple text format. This should be a simple clerical activity.
        Alternatives for list two Basic 1500, Basic 2000 and others will have to be created. The three way color coding of Web VP can be mixed and matched to user preference. A writer for public media might want list 1 to be Basic 1500 and list 2 to be Basic 2000 or Simple English (of whatever definition). The only conplexity is that of efficiency, to avoid redundance of wordlists between the selected list 1 and list 2. This should be a simple "match, delete common" routine. but because of the number of potential list1, list2 combinatons, we may want to create premade packages of the most commonly expected usages.
       
    References :
      http://www .lextutor.ca/vp/ - Web VP (Vocabulary Profiler).
      http://www.vuw.ac.nz/lals/staff/paul-nation/nation.aspx - Download of VP Program
      project #452.
      http://www.basic-english.org/down/readsimple.html - ReadSimple

    Back to Basic English Institute home page.           or   Project Catalog  
    About this Page: 322.html - Project 322 Vocabulary Profiler .
    Last updated February 15, 2006.
    Contact us
    URL: http://www.basic-english.org/projects/322.html