Project 320. Translation Thesaurus for Open Office.
    PERL TRACK
    Institute side of discussion is in green
    1.  DATA FILE:
    	(discussion of problems with the Alpah Test version, then : )
         I have a procedure in Excel that writes something that looks like
    what the data files OpenOffice provides.  The procedure is posted in 
    	http://www.basic-english.org/projects/320.html  linking to
    	http://www.basic-english.org/projects/320thesdat.html
    in an attempt to make a Windows version of the creation programs. 
         Ohh, I just realized that you start from Romanian, whereas I
    have started from ten masterfiles in Excel. My offer may be useless. 
         It would probably have been more efficient for me to have started
    with romanian, but that might go away if OOo works well.
         Although, for simple text (Notepad-like), the Companion is an icon
    of the desktop for really fast answers.
         And your Perl program uses it.
    	I see a bunch of things to change in romanian.txt that were good 
    for Companion, but should be different for OOo.  This will occur over time.
    Plans call for 'P', the last of SCOWL 20 vocabulary, to have been completed this month
    and the whole file reissued as version 0.6. Each sub-version 
    has a major addition to one letter plus any remembered updates from other letters. 
    	How often will this file change?  We have had almost monthly 
    updates during the dictionary building phase.  Version six should have 
    about six monthly updates then relative stability of maintenance only 
    until we are satisfied to issue as version 1.0 and offer it to OOo.
    
    2. Index file
    	It seems like there should be no problem creating a cumulative 
    count program to duplicate the index program for Windows use (excel)  -- 
    But I can NOT come up with an algorithm that correctly determines the starting positions. 
    I have tested all starting points with sample data   There seem to be inconsistent, hidden, line control characters or something.
    
    
    Subject: Cleaner files! From: "Carl Ray" Date: Wed, September 13, 2006 4:49 am To: basic-english.org Ok, these files should have no extra spaces and no extra indexes, nor extra "| " They're pretty cleaned up. Take a look. I chose to use separate lines for each sense 1) because that's how my program outputs it. And 2) I finally figured out this about the thesaurus. When definitions are on separate lines each, they show in the left box when you use the thesaurus. When they are all on a single line (| delimited) they show in the right box. The reason? Because each LINE represents a single DEFinition AND the multiple SYNonyms that go with that definition (separated by | delimiters). For example code: arrangement|2 to put in order|set|setting|classification agreed to|plans|agreement|meeting| So in this example, when using the actual thesaurus, "to put in order" and "agreed to" will show at the same time in the left box. Then, if you highlight in the LEFt box "to put in order" the synonyms "set" "setting" and "classification" would list in the RIGht box ("plans," "agreement," "meeting" would not show at all unless you highlight "agreed to" in the left box) Well, not sure you needed to know that. All seems to work well with me. Let me know what else I can do or what you need. Not sure what your trying to do with this (because I'm not doing any Excel stuff) : "It seems like there should be no problem creating a cumulative count program to duplicate the index program with excel -- But I can NOT come up with an algorithm that correctly determines the starting positions." One thing I do is when I have to use a program other than text editor for modification (ie Word, Excel, OpenOffice) I reopen the file in OpenOffice and do a find $ replace with \n (with "regular expressions" checked in the More options pull down") and then save it--sometimes I get replaces sometimes I don't. Then just to be double sure, I reopen the file in gedit (text editor) put in an insignificant space or something, and then save it again. I'm sure I dont' need to do this, but it makes me feel better, like I'm getting read of any lower level unseen formatting that the more sophisticated programs (Word/Ecel) might want to add. Again, Prob. not necessary. Perl is a piece of cake compared to the old languages we old fogies were trained on (yeah, I'm an '80s guy, basic, fortran, pascal, db1, some html, and Access) That's all I knew before, this is my first shot with Perl (and first program I've written in over 15 years). go to this site for a very quick and thorough intro. I did all this from only the first four short lessons. My script is a non fancy, linear (no loops) script-- just a quick mirror of my natural thinking process, nothing fancy. This is the process I do: to make the input file I use romanian.txt (or my basicDic.txt). Replace tab with a return using "regular expressions", replace $ with \n, then replace \t with |1\n, then replace ^$ with nothing. (you can just look at my input file and see what I did). Input file = th.BE.txt the input file must be flawless because the perl script counts the lines, so no extra hard returns are allowed. (no problem, if you run the script and the data is off, you just trail down to the first bad instance and locate the problem) Output file thBEtemp.txt: After running the script, I have only set it to find and consolidate 8 senses. So go to thBEtemp.txt find all instances of "8" manually with find function. If it's truly only 8 senses, do nothing, if its more, then delete the second headword and simply append the other senses to the end of the first 8. Change myexample|8 to myexample|12 (or however many senses there are). This takes all of 3 minutes. Next, just take out the extra returns (using regular expressions) with replace [space]$ with nothing, then replace ^$ with nothing (or whatever way you choose to get the extra carriage returns out) change name to .dat make the .idx file finished. notes on the perl script: open(OBE, "th.BE.txt"); ##reads the file into the working file I named OBE @lines = ; ## reads each line of the working file into an array with elements lines[0], lines[1], etc.. #### to get the output to a text file instead of screen, do from command line: #### perl BEtoThes4.pl > thBEtemp.txt make sure all files and the .pl script are in the same directory easy perl tutorial http://www.linuxforums.org/programming/learn_perl_in_10_easy_lessons_-_lesson_1.html Attachments: th.BE.txt 1 M [ text/plain ] Download | View th_it_IT_v2.dat 1.2 M [ video/mpeg ] Download th_it_IT_v2.idx 379 k [ text/plain ] Download | View BEtoThes4.pl 1.6 k [ application/x-perl ] Download

    Renamed for Member and Developer usage as Beta version. th_en_BE.dat -- data th_en_BE.idx -- index BEtoThes4.pl -- program
    Hi Carl, Looks great. You are announced on the Institute website asking others to try it. We will drop the Beta status in a month as we get reports of usage. Let me know if you find any errors in our presentation or improvements to be made. Jim

    Back to Basic English Institute home page.           or   Project Catalog  
    About this Page: 320.html - Project 320 Translation Thesaurus for Open Office .
    Last updated September 14, 2006.
    Contact us
    URL: http://www.basic-english.org/projects/320.html