BB, It's obvious you have already gone some distance in this direction, but after all these months, you still don't seem to know how to make it work. I think I could make it work, but would probably do things differently. Please help me understand the strengths of your approach... >i need to identify the unique strings in a textfile >-- delimited by carriage-returns and/or spaces -- >so they can be sorted for storage in one of 53 str# resources. >1-26 hold strings that start with an upper-case letter, >27-52 hold strings that start with a lower-case letter, >and the 53rd holds strings that start with anything else. Why the division between upper and lower first letters? I fail to see the benefit. > >(eventually, these 53 str#'s will be split out into 250 str#'s, >each with a maximum of 250 items, but don't worry about that now.) > >all of this is just to massage text into my e-book program. > >i'm working now on some project gutenberg texts, and i think >their director michael hart is gonna like the outcome a lot. >so maybe the program will get some real good exposure. >i will, of course, give you full credit for your contribution. >(especially since it will be more weighty than mine!, >i've just hacked out something almost anybody could.) > >the general course-of-action flow-chart is this: >1. load file into handle >2. munger change spaces to carriage-returns Why? This seems like an utter waste of time, to me. >3. load each now-cr-delimited string into an array Surely you don't intend to include all those CR's in the array? That's wasted space. >4. sort the array, eliminating duplicate strings Why sort? If all you need is a lookup, just make a temporary key during the dictionary build. Why duplicate strings in the first place? It would be much easier to check each string as you encounter it to see if it already exists in the array. If so, drop it. >5. check each string against the existing dictionary of strings Like I said. >6. add the strings that are not in the dictionary to the dictionary I thought the array we were talking about WAS your dictionary. This looks like wasted effort. >7. load file into handle again Why? What happened to the first one we loaded? >8. replace strings with 2-character tokens (str# and item within) Why not replace them as we go, the first time through? >9. save tokenized text Good idea! >10. repeat for next file > I have some ideas i think would be faster and smaller than what you're proposing, but here's some pseudocode to show how I would do what you suggest: 1. Initialize your 53 handles, ready to receive data. (We're talking dynamic arrays here!) 2. Read the file into a handle, leaving a few bytes padding at the beginning. We'll use the same handle to hold the tokenized text just to avoid any space problems. 3. Set InPtr to the first byte of text, and OutPtr to the first byte of the handle. 4. Examine the character at InPtr. That tells you which handle you will store the word in, and what the first byte of the token (the "page number") will be. 5. Count ahead to see how many more chars there are before a space or CR. Replace the first char with this count. You now have a string you can refer to with PSTR$ or (in FB^#) strPtr.0$. 6. Walk through the appropriate handle to see if this string already exists. Using a sort key will speed this up considerably. Using the Alphabetic Continuum would make it unbelieveably fast, but is probably unnecessary. 7. If it's not there, copy it to the end of the handle. There's no reason (that I can see) to sort the strings--you can use a key array if you want to speed up this process, but it will already be very fast. 8. Record it's page number and string count (the token) at OutPtr and increment by 2. 9. Repeat from 4 until you reach the end of the text file. Reset the text handle size to the end of your tokens and save it. I would use page-zero tokens to indicate delimiters. 0-0 would be a single CR. No indicator would be needed for a space, as it will be assumed, but if there are multiple spaces or CRs, the second byte of the token could contain the count (<128 = # of CRs, >128 = 128 + number of spaces.) Now that I've described it, I realize I've already written most of the code, so I may decide to play with it a bit. I'll let you know if anything comes of it. BTW, since you're using tokens to represent the text anyway, is there any reason not to compress the dictionary, providing it doesn't slow things down too much on reexpanding the words? That would make it possible to use the Alphabetic Continuum approach. 0"0 =J= a y "