With a Cassandra environment, the system can simply save all the words (3 letters or more, or whatever we can think of) in a table used as an index that references all the pages with those words.
search_table[word][page] = 1;
"[page]" is specific to a website, but not "[word]". We could also control and make sure that words inserted in this table are limited to a dictionary, however, it could be difficult if we support 200 languages. (We have to think of speed as well and we may want to have one table per letter.)
When a page is deleted, we need to make sure to delete the corresponding entries in the search index.
We want to look at the Apache solr and such features to see weather that could be of any help.
Extensions would be systems that allow us to search all kinds of documents (PDF, MS-Words, etc.)
See also: http://www.opensearch.org/Home
Snap! Websites
An Open Source CMS System in C++