Project 320. Translation Thesaurus for Open Office.
PERL TRACK
Institute side of discussion is in green
1. DATA FILE:
(discussion of problems with the Alpah Test version, then : )
I have a procedure in Excel that writes something that looks like
what the data files OpenOffice provides. The procedure is posted in
http://www.basic-english.org/projects/320.html linking to
http://www.basic-english.org/projects/320thesdat.html
in an attempt to make a Windows version of the creation programs.
Ohh, I just realized that you start from Romanian, whereas I
have started from ten masterfiles in Excel. My offer may be useless.
It would probably have been more efficient for me to have started
with romanian, but that might go away if OOo works well.
Although, for simple text (Notepad-like), the Companion is an icon
of the desktop for really fast answers.
And your Perl program uses it.
I see a bunch of things to change in romanian.txt that were good
for Companion, but should be different for OOo. This will occur over time.
Plans call for 'P', the last of SCOWL 20 vocabulary, to have been completed this month
and the whole file reissued as version 0.6. Each sub-version
has a major addition to one letter plus any remembered updates from other letters.
How often will this file change? We have had almost monthly
updates during the dictionary building phase. Version six should have
about six monthly updates then relative stability of maintenance only
until we are satisfied to issue as version 1.0 and offer it to OOo.
2. Index file
It seems like there should be no problem creating a cumulative
count program to duplicate the index program for Windows use (excel) --
But I can NOT come up with an algorithm that correctly determines the starting positions.
I have tested all starting points with sample data There seem to be inconsistent, hidden, line control characters or something.
Subject: Cleaner files!
From: "Carl Ray"
Date: Wed, September 13, 2006 4:49 am
To: basic-english.org
Ok, these files should have no extra spaces and no extra indexes, nor
extra "| "
They're pretty cleaned up. Take a look.
I chose to use separate lines for each sense
1) because that's how my program outputs it. And
2) I finally figured out this about the thesaurus. When definitions are on
separate lines each, they show in the left box when you use the
thesaurus. When they are all on a single line (| delimited) they show
in the right box. The reason? Because each LINE represents a single
DEFinition AND the multiple SYNonyms that go with that definition
(separated by | delimiters). For example
code:
arrangement|2
to put in order|set|setting|classification
agreed to|plans|agreement|meeting|
So in this example, when using the actual thesaurus, "to put in order"
and "agreed to" will show at the same time in the left box.
Then, if you highlight in the LEFt box "to put in order" the synonyms
"set" "setting" and "classification" would list in the RIGht box
("plans," "agreement," "meeting" would not show at all unless you
highlight "agreed to" in the left box)
Well, not sure you needed to know that.
All seems to work well with me. Let me know what else I can do or what you need.
Not sure what your trying to do with this (because I'm not doing any
Excel stuff) : "It seems like there should be no problem creating a
cumulative count program to duplicate the index program with excel --
But I can NOT come up with an algorithm that correctly determines the starting positions."
One thing I do is when I have to use a program other than text editor
for modification (ie Word, Excel, OpenOffice) I reopen the file in
OpenOffice and do a find $ replace with \n (with "regular expressions"
checked in the More options pull down") and then save it--sometimes I
get replaces sometimes I don't. Then just to be double sure, I reopen
the file in gedit (text editor) put in an insignificant space or
something, and then save it again. I'm sure I dont' need to do this,
but it makes me feel better, like I'm getting read of any lower level
unseen formatting that the more sophisticated programs (Word/Ecel) might
want to add. Again, Prob. not necessary.
Perl is a piece of cake compared to the old languages we old fogies were
trained on (yeah, I'm an '80s guy, basic, fortran, pascal, db1, some
html, and Access) That's all I knew before, this is my first shot with
Perl (and first program I've written in over 15 years).
go to this site for a very quick and thorough intro. I did all this from
only the first four short lessons. My script is a non fancy, linear
(no loops) script-- just a quick mirror of my natural thinking process,
nothing fancy.
This is the process I do:
to make the input file I use romanian.txt (or my basicDic.txt). Replace tab
with a return using "regular expressions", replace $ with \n, then
replace \t with |1\n, then replace ^$ with nothing. (you can just look
at my input file and see what I did).
Input file = th.BE.txt
the input file must be flawless because the perl script counts the
lines, so no extra hard returns are allowed.
(no problem, if you run the script and the data is off, you just trail
down to the first bad instance and locate the problem)
Output file thBEtemp.txt:
After running the script, I have only set it to find and consolidate 8
senses. So go to thBEtemp.txt find all instances of "8" manually with
find function. If it's truly only 8 senses, do nothing, if its more,
then delete the second headword and simply append the other senses to
the end of the first 8. Change myexample|8 to myexample|12 (or however
many senses there are). This takes all of 3 minutes.
Next, just take out the extra returns (using regular expressions) with
replace [space]$ with nothing, then replace ^$ with nothing (or whatever
way you choose to get the extra carriage returns out)
change name to .dat
make the .idx file
finished.
notes on the perl script:
open(OBE, "th.BE.txt"); ##reads the file into the working file I named OBE
@lines = ; ## reads each line of the working file into an array
with elements lines[0], lines[1], etc..
#### to get the output to a text file instead of screen, do from command
line:
#### perl BEtoThes4.pl > thBEtemp.txt
make sure all files and the .pl script are in the same directory
easy perl tutorial
http://www.linuxforums.org/programming/learn_perl_in_10_easy_lessons_-_lesson_1.html
Attachments:
th.BE.txt 1 M [ text/plain ] Download | View
th_it_IT_v2.dat 1.2 M [ video/mpeg ] Download
th_it_IT_v2.idx 379 k [ text/plain ] Download | View
BEtoThes4.pl 1.6 k [ application/x-perl ] Download
Renamed for Member and Developer usage as Beta version.
th_en_BE.dat -- data
th_en_BE.idx -- index
BEtoThes4.pl -- program
Hi Carl,
Looks great.
You are announced on the Institute website asking others to try it.
We will drop the Beta status in a month as we get reports of usage.
Let me know if you find any errors in our presentation or improvements to be made.
Jim
Back to Basic English
Institute home page.
or Project Catalog
About this Page: 320.html - Project 320 Translation Thesaurus for Open Office
.Last updated September 14, 2006.
Contact us
URL: http://www.basic-english.org/projects/320.html