Subtitle:"How I made IBM cry once"
A RichardHenderson production.
From: DocQueryInSql and TreeInSql.
If you can store parameterised queries then you can push the results into a cache ahead of time. Alternatively you can go buy an additional caching Web application box for a lot of cash (Oracle 9i, Sybase Enterprise Portal) and it still won't be as fast as this, or you can DIY.
This version was on the db server and was faster than any existing commercial solution. Okay the server had 4GB of memory, but only 4 cpus. It drove 4 webservers (4 cpu each) giving out content for all pages plus arbitrary queries and was only just breaking a sweat. Scalability was made infinite using a write/read separable architecture.
This is one of the larger system/process architectural examples for my book. The cache maintenance logic is the primary focus, though I will discuss some of the non-caching advantages of having parameterised queries.
So far we've got a pretty efficient stemming query based on a small but useful logic. We can store any query since it is a list of
Initially this was a convenience for page authors who could embed an ad-hoc query as a JavaScript http GET using a cgi style parameter string (this was actually significantly complicated by the fixed number of parameters in stored procedures, but I digress). As time went on, more and more standard queries were generated and stored by name, thus the term 'stored queries'. Later this functionality was exposed to users so they could store their own libraries of queries and watchdogs.
The result of a query is a list of documents in some order. This list can also be stored in a cache. Thus stored queries can be run at any time and cached. This is the basis for 'push' functionality. Whenever a document was stored in the system, its attributes were stored as
The cache entries could also age out, so the 'push' cycle involved two parts:
The second part was designed to only run queries related to new (or updated) documents, but completely regenerated the cache contents for any particular query. This was much more reliable and simple. Updated document categories would otherwise cause duplicate cache entries, requiring further logic to fix. Specific caches could be manually invalidated where the query itself had changed, or the cache data wasn't there. Thus queries against cached data had to fallback to a full query when necessary. This prevented users from being forced to wait for push-cycle updates if they didn't want to. The flag which indicated cache status was put into the
The
We are starting to get into serious functionality now. Particularly this can be the basis for any dynamic personal portal. User preferences and standard pages are derived from the same query logic. This uniformity makes for a happy bunch of developers and happy users who often heavily customised the portal site by defining their own pages of stored queries. Content publishers used the same system for standard pages, and used a slightly modified version of the 'user preferences' dialogs to define those pages and how they should be displayed.
The 'push' logic allowed writes and reads to be fully separable. This scheme was well suited to scaling.
Oh yes, why did IBM cry? Well they quoted 3.5 megabucks to duplicate our feat using Websphere and about eight very large Sun boxes. Suddenly my extortionate rate didn't look so bad after all :). I don't know the status of the project, but it is still a very successful portal, even if they did strip out a lot of the good bits so IBM could do it.
Cheers
-- RichardHenderson.