5th
Indexed TiddlyWeb Filters
One of the core features of TiddlyWeb is its ability to use filters to constrain the tiddlers that are selected from any collection of tiddlers (bag, recipe, search results, etc.). In the early design discussions that led to the creation of TiddlyWeb filters were conceived as the mechanism a recipe would use to choose only some tiddlers from a bag. Bags are containers for tiddlers that have been grouped together for some reason. Recipes are lists of bags that lead to the creation of some useful set of tiddlers. When using TiddlyWeb and TiddlyWiki together, a recipe can create a particular application or vertical of TiddlyWiki. In that context one would use a filter to select some tiddler tagged “systemConfig” from one bag, others tagged “faq” from another, and others with modifier “cdent” from another.
When TiddlyWeb is used generally as a data store, filters are just as useful. When requesting tiddlers from any bag you can select and sort by attributes, and limit the number of tiddlers. Application developers can also make new filters as plugins (see mselect for an example).
This is all quite grand and useful but recent explorations by Mike Mahemoff while developing Scrumptious have revealed some (fairly expected) problems. Imagine a bag called “comments” containing some 10,000 or more tiddlers which are comments on URLs. Now imagine you’d like to get those tiddlers which have the field ‘url’ set to http://cdent.tumblr.com/.
The naive way to do this is to look at each one of those tiddlers, one at a time, and say “Hello tiddler, have you got your url field set to http://cdent.tumblr.com/? Oh you do? Well then I’ll have you, thanks!” This is time consuming and resource intensive.
It is also the way TiddlyWeb does filters. It’s like this for a few different reasons:
- The original design imagined many bags, not few bags with large numbers of tiddlers.
- It preserves the strict separation between the filter system and the storage system, meaning that the storage system can be simple and very adaptable: any filter can work with any store.
- It makes the filter system fairly transparent: There’s no magic going on; a filter works by looking at tiddlers and making a decision.
- It makes the filter system easy to extend: The contract between a filter and the rest of the system is “look at some tiddlers, return some tiddlers”. What the filter does when looking is arbitrary.
The 0.9.74 release of TiddlyWeb includes support for querying an index when doing select style filters. The support only kicks in in special circumstances (explained below) but when it does it can speed some filters up immensely. In a test (using profile/list_tiddlers.py) of 10000 tiddlers a filter that took 13.96 seconds without an index took .30 seconds with an index.
That’s great news. Here’s the bad news: In at least this initial implementation the prerequisites for the indexing system to be used (and be useful) are quite complex. Here’s the list:
- The filter being performed must be a select filter (sort and limit do not).
- If there are multiple filters being performed on the collection of tiddlers, the select filter must be first in the stack and only that one filter will use the index.
- The collection of tiddlers must be what’s called a “natural” bag. That is, the thing being filtered is a bag that exists in the store and the entire contents of the bag is what’s desired to be filtered.
- That bag should be skinny, meaning when it was loaded from the store its tiddlers contents was not determined. If you are processing recipes or working from URLs this is handled for you by the code. It’s only a concern if you are writing your own handlers.
- tiddlyweb.config[‘indexer’] is set to a string which is the name of a module which provides an
index_query(environ, **kwargs)callable that returns tiddlers which have been loaded from the store. tiddlywebplugins.whoosher has been updated to provide this and the next item. - Something must provide an index for index_query to query. That index needs to be kept up to date as tiddlers are changed. tiddlywebplugins.whoosher has this functionality. The sql and mappingsql have the guts to make the functionality possible, but have not yet been extended with an
index_querymethod.
The TiddlyWeb at http://tiddlyweb.peermore.com/ has been updated to use the filter indexes. The relevant changes to the tiddlywebconfig.py are:
- Add
tiddlywebplugins.whooshertotwanager_plugins. - Add
tiddlywebplugins.whooshertosystem_plugins. - Set
indexertotiddlywebplugins.whoosher.
Then twanager wreindex is run to build the initial Whoosh index.
You can get some sense of the effects of the index by comparing the following to URLs (this is not an exact test, but gives the sense of things):
- Not indexed: http://tiddlyweb.peermore.com/wiki/bags/docs/tiddlers?limit=5000;select=tag:systemConfig
- Indexed: http://tiddlyweb.peermore.com/wiki/bags/docs/tiddlers?select=tag:systemConfig;limit=5000
The docs bag only has a couple hundred tiddlers in it and memcached is involved, so the effect is not huge, but if you imagine bags orders of magnitude bigger…
Astute observers will note that what’s going on here is not particularly innovative: It’s simply the addition of an index to a query system. One can imagine future improvements a la SQL query optimization, wherein the order of the filters are adjusted to allow most effective use of the index, and the index is used for more queries than just those against “natural” bags. Constant evolution, constantly building on the shoulders of that which has come before.
For more details, a browse of the code will be instructive. tiddlyweb.control:filter_tiddlers_from_bag is a good entry point.
