16 March 2006

Tasks and how to evaluate them while making use of link information

This post is primarily reconstructed from a comment I made on Stephen Green's weblog, which I thought would be good to recreate here.

I think there may be some confusion in Stephen Green's recent post arising from his reading a paper by Nick Craswell, Dave Hawking and Steve Robertson that I'll hopefully try to clear up. Nick Craswell is actually in town (on holidays from Cambridge) and I've chatted to him about this subsequent to writing the original comment. He had some additional thoughts on the matter as well.

At TREC, the Information Retrieval / Enterprise Search group at CSIRO coordinated a series of tracks involving search tasks over large scale and collections of Web data, from about 1997 to 2004.

During that time, we evolved our understanding of the interplay between classic ad hoc information retrieval evaluation methods, metrics and tasks and those appropriate to the Web.

Early on, we applied classic ad hoc retrieval TREC methods and metrics and tasks (referred to as subject seach tasks in the 2001 SIGIR paper of Craswell, Hawking and Robertson) to this Web data. An example subject search task might be captured by a short topic such as "channel tunnel".

Over time, we discovered that link-based methods did not provide great benefit for such subject search tasks, when evaluated using classic measures (such as average precision, which requires large numbers of relevance judgements for each query). Classic ad hoc track evaluation methods use a human assessor to form binary judgements (relevant/not relevant) over a large pool of documents, and across a set of queries.

In fact, there were two things wrong here: subject search tasks are only occasionally performed on the Web, and average precision is not a good metric for most of the kinds of tasks that do get performed.

The 2001 SIGIR paper investigates a particular kind of task (home page finding, which is really a subset of known item finding), and determines that the algorithm (Okapi/BM25) which had done so well at TREC on pure content (subject search) can also do really well over documents represented just by the combination of anchor text for that document's URL, when applied to home page finding tasks.

By 2004, the TREC Web track was considering home page finding, known page finding (specific pages which aren't home pages), and topic distillation (providing a set of relevant home pages about the topic), which is believed to be far more indicative of a lot of Web search activities.

The evaluation metric for known item search is typically the mean reciprocal rank of the first relevant item (and often there is exactly one and only relevant item, as remarked by Jeremy). Topic distallation can be judged with a number of metrics. Part of the lessons from TREC are that you may need to use multiple metrics, since different metrics have different outcomes. For more information see the overview of the TREC-2004 Web track.

So to return to Stephen's original post: the 18.5 million VLC2 collection is just fine to use for link-based methods. But it's all about what you are measuring and how you are measuring it. So 18.5 million pages aren't sufficient to show benefit for link-based methods on subject search tasks evaluated using average precision.

To comment briefly on Jeremy's analysis, it's not clear that Brin and Page's 1998 paper addresses evaluation, other than informally. In fact, they state that an "extensive user study or results analysis" is out of scope. They do have many good points about the difficulties of academic evaluation of Web search. Our group has a set of papers relating to Web search evaluation that may be of interest in this arena.

In discussions with Nick, he felt that WT2g, the smallest of all the Web collections we put together, could even be sufficient for making use of link information. However, it was never used with a series of tasks/evaluation metrics which could demonstrate this, so it's only a conjecture. There was some work by Nick and Dave Hawking which did examine link information available however, reported in their chapter in the TREC book, which seemed to support this.

02 February 2006

Adding my Contacts to the available addresses in Outlook with Exchange

At my work, Exchange is in use. Mostly that's been great. However, the default setup has not permitted me to be using my own Contacts (and distribution lists) when composing email. This has bugged me no end.

There is a solution however! After digging through the help system, I found My contacts don't appear in the Address Book. Indeed, the check box is not available, so clicking on the How link reveals what to do. Basically, you have to "Add a new directory or address book" (in the Tools > Email Accounts popup). And then after restarting, my Contacts are at long last available when composing email.

17 January 2006

Wikipedia gets a Google blessing for biographical searches

I've been looking into definitional question answering of late, and have observed something that intrigues me. No doubt other people have noticed it as well, but I haven't read a reference to it so maybe it's not totally obvious.

Google appears to be doing some quite clever query processing, and is making use of additional indicators that you're asking a question if you type in queries like "who is ...?", "what is ...?". For some time, they have had the Google Answers service, and include a link tip to redirect you to their pay-for-answer facility for more complex queries - e.g. "what symponies did Beethoven write?".

For simpler "what" queries (e.g. "what is a symphony?"), their Web definitions facility highlights a short factual answer for common terms, selected from one of a number of different Web definitions providers.

But what is most interesting (especially given the recent controversy surrounding Wikipedia involving biographical entries) is that for "who is/was ...?" queries, Google has given a blessing to the authorativeness of Wikipedia by highlighting the Wikipedia entry (if one exists) for the person. e.g. "who was Beethoven?"

Note that this works for both real people, alive or dead, and for imaginary people or things, provided they have a Wikipedia entry (e.g. "who is Winnie the Pooh?").

You can tell you're getting something different, because Google treats the result differently, placing it above any news results for the person, and acknowledging the source with a tag "According to http://en.wikipedia.org/wiki/...".

In a small number of circumstances, Google appears to prefer references from www.who2.com, which provides a service listing details about famous people, but it was not the case that if there is a who2.com entry, then Google returns that in preference. Given that two results which returned me the who2 entry were for "Bill Gates" and "George Bush", it may be that Google use who2 in cases where significant defacing of Wikipedia entries occurs (even if speedily rectified).

I wonder if this is one of the examples of Google applying some of the research/practices from the enterprise search arena that John Battelle refers to, or clever natural language processing (though that tends to be computationally expensive), or just some simple and efficient query analysis.

Overall, my reading of this is that Google believes Wikipedia provides the best results for the vast majority of biographical subjects. I imagine it's unlikely that Wikipedia is doing any deals to get placed in this way, since they don't really need any more traffic than they're getting already! Thus it's a strong statement to make by Google in support of collaboratively-authored and mediated content.

15 November 2005

Software product delivery - working out your bug tradeoffs

Just read Eric Sink's latest article, on how he (and his company) make decisions regarding bugs, after their particularly embarrassing week shipping 3 maintenance releases. Last week I gave an internal talk on delivering products, so this is an issue close to heart. However, I didn't cover any of the mechanics of software development. This article is probably the best thing I've read on the hard choices you have to make when delivering working software to real customers, while still shipping on time and to an effective degree of quality. I've always enjoyed Eric's writing - read this to see why.

Vale Peter Drucker

Peter Drucker, the person who coined the term "knowledge worker" and who made great contributions to management theory, died a few days ago. McKinsey have a special collection of articles on Peter Drucker and the knowledge manager, which looks like a great read. It's only available to site members, but you can join up for free. McKinsey often have interesting things to read about if you're interested in business, so it's worth doing this for that reason alone.

07 November 2005

Paying for Google

Jason asks Would you pay $5/month to use Google's services. Lots of responses in the comments. I guess the point is that we do pay $/month to use Google's services, just indirectly. Everytime we (and that means everyone who uses Google) buy anything from a company that advertises on Google, we are supporting the advertising budget of that company. And it's that advertising budget that funds Google services for "free" to us. So while I individually may not click on an ad link, or indeed ever buy from any company that advertises on Google, there are an awful lot of people who are. Now the interesting thing to work out would be how much are we paying (indirectly).

02 November 2005

Panoptic and CSIRO used as an example of search engine 'Best bets'

James Robertson provides a discussion of why and how to make use of what he calls 'best bets' in a search solution. Although it doesn't say so, the example screenshot he provides is through the Panoptic search engine. Having been involved with the development of Panoptic in its very early days (circa 1999) while at ANU, I thought it might be interesting to recall why we added it into the product back then. (As of the last 3 weeks, I am now employed by CSIRO, as part of the Enterprise Search group, which develops Panoptic still.) One of the characteristics of any large federated enterprise (such as a university or large corporation) is that there can be a mass of content, of quite varying nature. What makes it doubly hard is that much of the evidence that can be found on the web at large in anchor text (which leads to much better result ranking in general) is not available from within the enterprise's own content. The properties of the search engine's ranking algorithm may mean that the top ranking answer(s) do not include the one(s) perceived by the organisation as the "correct" one(s) for a particular query. With Panoptic's very first customer (the ANU), from memory it was either "library" or "biology" led to exactly the scenario described. The ANU Library home page was not listed close to top (or possibly even on the first page of results). We examined the text of the Library home page, and could see why it would not be ranked highly due to the words used. However, the home page for the Library was serving its purpose just fine. This still holds true today, as can be seen in this search for library at ANU. The library home page is listed as result 18, but as a featured page, it's top of the list. The solution was to implement what in Panoptic is called "featured pages", or "best bets" as James calls it. This mechanism allows the search engine operators to identify the preferred results for particular queries, thereby guaranteeing they appear at the top of the list, while still distinguishing them from the ranked results.

Ruby, Eclipse, and Ferret - on Windows

I've started a new job, and I need a high productivity toolset for doing information retrieval experiments with. First, there was the question of what programming language to learn. (There's a goal in the Pragmatic Programmer of learning at least one new language every year.) I've been hearing about Ruby since 2000, and of course in the last year Ruby on Rails has become all the rage. I'm doing quite a lot of web-based stuff at the moment, so having this framework available down the track should be useful. And, again due to OTI, I have a soft spot for anything which uses Smalltalk as such a strong part of its intellectual inheritance, while adding the useful parts of Perl which has always been a good tool for string manipulations. So I decided to finally check it out. And ok - at long last I've bitten the bullet and started playing with new development environments for what seems like the first time in years. I know, I know, the die-hards will ask, "why would you bother leaving emacs". But since I worked for a time at OTI (back in the Envy Developer days) who I now see are finally known as the IBM Ottawa lab, I know there is an alternative. And of course, it's called Eclipse. Eclipse would be no use on its own, since it was originally built for Java, but luckily, there are the Ruby Development Tools for Eclipse. These days, I've become acclimatised to running my desktop on Windows, despite for years having been a die-hard Mac boy, and then doing a lot of productive research using Linux as my desktop. So as an experiment, I decided to see whether I could be productive with Ruby and Eclipse on Windows. It turns out not to be complex at all. Here's what I did:
  1. Downloaded and ran the One-Click Ruby Installer. This is a great piece of work - it just works! There's been a security patch release of Ruby (1.8.3) since the most recent version of the installer was released, but I'm running this on an all internal network, so no drama. Given that I was planning on downloading Eclipse as well, I probably didn't need to select some of the extra editing and debugging tools, but I wasn't sure how Eclipse was going to stack up.
  2. Neal Ford's IBM Developer works article on "Using the Ruby Development Tools plug-in for Eclipse" was a useful resource in working out what to do next. For Eclipse to work, I needed a Java runtime environment. I chose the J2SE v 1.4.2 one available from Sun. Install, "You must reboot" - I hate that. Reboot.
  3. Next I downloaded Eclipse, but just the platform runtime binary as I didn't want to do Java programming or write my own plug-in just yet. It's not exactly clear what you do with the downloaded zip file. I decided to extract it to C:\Program Files\eclipse. And then made a shortcut from the eclipse.exe and put it on my desktop.
  4. Next was the Ruby Development Tools, or RDT for short. The RDT site mentions three ways to obtain them. At first, I tried downloading the zipped file, but Windows refused to unzip it, citing security grounds. Good to see that security is taken very seriously these days at Microsoft! So next I followed the instructions on the RDT download page about using Eclipse's Update Manager. This worked fine. But you have to restart Eclipse.
And that was it - I had Ruby running through Eclipse! Ran some example programs, started developing some of my own. Too easy. I'm interested in handling UTF8 encoded files, and at present Ruby still is not the ideal language for treating these natively. Apparently it's coming in Ruby2.0/Rite. In the meantime, there are some Eclipse Run configuration settings that will probably help.
  1. In the Arguments tab, add -Ku as a program argument.
  2. And in the Common tab, set the Console encoding to UTF-8.
While searching for information about Ruby and UTF-8, I came across Dave Balmain's Ferret, a Ruby port of the Lucene search engine. Dave didn't have instructions about how to install this with Eclipse, but it was straightforward.
  1. I downloaded the latest stable Ferret zip file.
  2. Extracted the zip file to a new Ferret directory in my chosen Eclipse workspace folder. This was ferret-0.1.3. Since Dave seems to be moving fast with new releases, I want to be able to download new versions in the same top level directory.
  3. Then I created a new Eclipse run configuration for the setup.rb file that lives in the top level directory of ferret-0.1.3, and ran it. It complained about not having make and a C compiler, but that's because I don't - as yet. Otherwise it all ran fine.
  4. Then I created a testFerret.rb file in my test workspace, and copied Dave's example Ferret program from the comp.lang.ruby group announcement post into it. One more run configuration, and hit run. And it all just worked!
Looks like I can stick with my Windows desktop for now. And I guess it's time to start doing some experiments...

16 October 2005

Information zones

An idea that's been buzzing around in the back of my head for some time is applying Permaculture's zone (and maybe sector) analysis to information.

Broadly speaking, the zone concept in Permaculture works to aid in the design process when making use of energy available on a site. Zones are established according to the patterns "of use, of access, and of time available". [quoted from Permaculture - A Designer's Manual by Bill Mollison]

The zones are pictured as an expanding ring of concentric circles, moving from Zone 0 (the house or village), through intermediate zones containing herb spirals, kitchen gardens, vegetable plots, shelterbreaks, etc and out to Zone 5 (the natural, unmanaged environment).

I believe this could be a helpful way of reviewing the personal information landscape. I think we unconsciously select the information zone when carrying out information retrieval activities, and use this as a contextual aid according to the need we have.

Zone 0 - the individual

First up in Zone 0 is everything I myself know without reference to any external aid.

Zone 1 - immediate personal proximity

Zone 1 contains information that I carry with me pretty much all the time, but don't remember. So this may include information I keep in my wallet, my mobile phone, or my PDA. There needs to be minimal disturbance between thinking of an information need and accessing it through such resources.

Zone 2 - local vicinity

The next zone constitutes any information readily accessible to where I spend much of my day. This zone includes fixed PCs or notebooks at my home or my work. It can also include the books and magazines on my bookshelf, or the contents of my filing cabinet. There is likely to be a greater disturbance between thinking of an information need and accessing it, unless of course I'm currently sitting in front of the computer and that's where the information resides.

Zone 3 - immediate networked neighbourhood

Before leaping onto the Web at large to find an answer, I know that sometimes information resides in the immediate networked neighbourhood. This zone can include other PCs and storage repositories connected to my computer (either at home or at work), or people who I can easily communicate with (in person, by phone or IM).

Zone 4 - the 'net

Next step is everything I can reach through the Internet or other more old fashioned communication channels. This zone includes the contents of the publicly accessible Web, libraries, but also people who I can contact through email or letters. The key here is that there is some substantive degree of physical distribution and/or asynchronicity between my local environment and the environment in which the information resides. I can't rely on an immediate response. It should be noted that this is where search engines such as Google and organisations like the Internet Archive have played an invaluable role by caching information in case it has changed or is temporarily unavailable.

Zone 5 - beyond

The last zone is everything beyond what I can reach in one step. This zone includes huge quantities of information in the "Dark Internet", or that is accessible only by other people (such as within corporate intranets). This information may never become accessible to me, but has at least the potential to do so.

I'm going to explore this concept some more in future - particularly with respect to how tools have worked to compress the lag in accessing information in outer zones. A nice diagram would be helpful too.

Have I missed any zones that are important do you think?

Starting a new job

After 3 months of being pretty much away from modern communications, I'm back.

Tomorrow I start a new job. I'll be working with the CSIRO, in the Enterprise Search team. I'm getting the privilege of working as Research Scientist in a great group of people, and I'm really looking forward to it.

I hope to be continue (and restart) blogging regularly, primarily on matters relating to enterprise search and information retrieval. However, it should be noted that this blog is my personal blog, and the opinions I express on it are my own, not the formal position of CSIRO or the Enterprise Search team. There will naturally be some things I can't blog about, depending on what we're doing within the group.

I've finally posted in my previous post some ideas I had a while ago about semi-structured tagging of data. Since I've been away on holiday for some time, it's quite possible that other people have expressed these ideas, or even gone implementing appropriate systems to make use of them. That's what happens when you duck out of the blogosphere for a while - you've got no idea what has been going on!

About Here Me Now . Info

What

In a nutshell, it's all about identity. Ours. And our data.

The genesis of the name lies in a realisation I had that the logical extension of ad hoc tagging would be in semi-structured tagging of all data. Ad hoc tagging is that found in systems such as Flickr and del.icio.us. It simply consists of keywords chosen by the tagger, and is a way to provide some Info about the content or context of the data being tagged. Tags are a form of meta-data - data about other data.

The received wisdom about human-authored metadata is that in general, metadata is going to be: inaccurate, out of date, expensive to produce, or used to try to spam information retrieval systems. There are some rare exceptions to this rule (think library catalogues). However, web search engines basically ignore nearly all metadata associated with web pages, because of these reasons. In fact, the only metadata that will generally be trusted is that which is computer-created. For example, the last modified date on a file.

Increasingly, large amounts of the data that we produce (and make available to others) will be captured through devices that can reliably produce metadata. Mobile phones, digital cameras, PDAs, tablets are our primary tools for this at present. Observing the convergence of these devices, or at least, aspects of the technology (e.g. it's become much harder to buy a mobile phone without a camera in the last year), I anticipate that in a very short time, these tools will record metadata across at least three axes.

Where

Where was that photo of the sunset taken? Which stadium was that rock concert performed at?

Geo-location through GPS is the sleeping giant of metadata production. At present, its use is restricted due to the relatively high cost of an individual receiver (at least, compared to the cost of a digital camera lens). But if I could have geo-location tagging built into the recording of my photos, it would immediately provide incredible extra value. Naturally enough, various people have shown how to do this in ad hoc ways, but it will require an automated solution to make it commonplace. Google Earth and MSN Virtual Earth, together with the local search with integrated mapping offered by MSN, Google and Yahoo, are demonstrating how much better it is to correlate data with our position in space.

Identity on the geo-location axis is denoted by the place Here. I use identity here in a loosely mathematical sense.

In addition to Cartesian grid coordinates, location can be described relative to other known locations - e.g. north of ..., above the ..., near my ..., between ... and ....

Who

Who authored that article on metadata? Who took that photo of elephants?

The twin trends of micro-identification (think RFID and IPv6) and ubiquitous connection to the Internet through wireless communications means that our tools will be uniquely addressable and connected.

The notions of assigning our human selves a digital identifier are fraught, but receiving considerable attention, and involve even more complex issues of trust than for fully digital entities. As usual, there are many defacto mechanisms which may be used in different contexts. For example, I most frequently identify people by the homepage of their blog (if they have one). And people can make reference to me through my blog.

Of course, we sometimes lay claim to our data by associating it with some property we own, such as our blog, our Flickr site, or more traditionally writing our name as one of the list of authors. In future, ways to more directly tie identity to the data will become ubiquitous I suspect.

Identity (mathematical) on the identity (metaphysical) axis is denoted by the entity Me.

When

When was that concert recorded? When was that video of my baby taken - was it his birthday?

Many devices (think digitial cameras) already tag data with timestamps highly effectively - provided we can navigate the interface to setting the time correctly. In the future, through Internet connections, it will be possible for devices to keep themselves in sync with Coordinated Universal Time wherever they are. There are probably already applications for Flickr that ferret out all the photos taken on a particular day.

Identity for the date-stamp axis is denoted by the time Now.

In addition to standards such as ISO 8601, time (like space) can be relative to other know times - e.g. before the ..., after the ..., around my ....

Why

Why is any of this useful?

My hypothesis is that as these metadata axes become increasingly common in our data, we may wish to modify some of it selectively (e.g. we wish to associate our digital camera's identity directly with our own digital identity to lay claim to photos), and we may wish to retrieve data based on these criteria. Lastly, there will be applications that emerge to mine the (meta)data and repurpose it.

The key to making use of it all will be to allow sloppy and human readable access to the metadata itself and any querying interfaces. Thus the default identity position - here, me, now - becomes essential.

25 May 2005

Selecting an enterprise search engine

I remain surprised at how frequently the most important aspect of selecting a search engine gets overlooked in procurement processes. Namely, does it produce good results.

Google did not get where is today by having clean design (though it does), or no ads (remember those days?), or at least unobtrusive ads (as it does now), or really fast response times (though it has).

It got where it is today because its results were qualitively better than its competitors, at least at the time of its emergence.

AGIMO publishes a thoroughly comprehensive guide (in the form of checklists) of things to consider when Implementing an effective website search facility. I think its one flaw is not emphasising sufficiently that quality of results matters more than anything else. It does allude to it, in somewhat opaque terms just before the “Evaluation and selection” section:

It should not be forgotten, however, that the ultimate aim of any search facility is to guide searchers (at least those within the target group) as efficiently and effectively as possible to the information and services they need.

Information retrieval is one of the more scientific areas of computer science. It has decades of research, and with the TREC series, a long history of understanding how to do comparitive search engine evaluations. Comparing two or more search engines is not black magic, but a straightforward and scientific process.

I'll say it again, the most important thing when selecting a search engine is whether it produces better results than its competitors.

Mixing advanced high level programming languages with "standard" programming language environments

Jon Udell writes about “Languages and environments”, and hopes to see deeper integration of dynamic languages with the JVM or CLR.

While not a dynamically typed language, an old friend of mine, Don Syme, has been doing some interesting things with F#, a derivative of OCaml, which is designed to be an “ML that fits with .NET”. I have a soft spot for ML, having used it for studying parallel programming semantics in my graduate student days. It's a fantastic programming language, that combines functional and imperative traits, together with parametric polymorphism type inferencing, and a full formal semantics and type theory. The Wikipedia has an excellent article on various aspects of type systems and tradeoffs within programming languages. In my opinion, strongly typed programming languages which incorporate polymorphism and type inferencing give you just about all the advantages of dynamically typed languages, without most of the dangers. But that's just my opinion :-)

Several years ago, I was talking with Don about PHP (which I was then immersed in) and how good it would be to see more support for dynamic languages integrated with .NET. Don was then working on the Generics system for the CLR, which should assist in making some of the features of dynamic languages more readily possible to implement. I'm sure that the CLR is going to provide additional assistance as it evolves, since dynamic languages are so popular.

On a side note, I fully agree with Jon's comment that in fact it's the programming language environment, not the programming language itself, which is hardest to learn. When working with OTI a few years ago, learning Smalltalk was relatively easy, but getting on top of the class libraries available in the Envy development environment took me the best part of 3 months.

Yahoo! debuts 360 degrees

As everyone's been discovering, Yahoo! has launched 360 degrees to the public, for a limited beta trial period. Jeremy Zawodny was kind enough to provide me an invite within a few minutes of sending him an email - thanks very much Jeremy!

Overall, I like the service a lot. It's got quite a clean “Yahoo-ish“ feel to it, with a particular focus on the social connectivity model done to such great effect in Flickr. In 360 degrees, they've gone one step further, with a number of points I really like:

  1. an ability to group friends by categories I name
  2. levels of privacy/exposure of information by levels of friend-indirection (e.g. let [ friends | friends of friends | friends of friends of friends ] see). Lucky they didn't go as far as 6 degrees of friends though, as everyone in the world might then be able to see things ... or maybe they did, and just used the shortcut “everyone“
  3. abilities to review what your site looks like to these different network connectivity levels.

The latter is a facility I'd love to have with Sytadel, though the complexity of our model of permissions would make this rather harder to accomplish.

Once again, I'm impressed that customised typed micro-content and dynamic page assembly is being done on this scale. Dynamic page assembly while checking security is hard to do really fast (we know - Sytadel does this for every page), and Yahoo have pulled it off from what I can see. Congratulations to the team who've put this together.

There are still some glitches - some of the profile display (e.g. places I've worked) doesn't look particularly pretty (floating commas separating whitespace) when fields are left blank, but these are small things to fix.

I'm really looking forward to growing integration between My Yahoo, 360 degrees, Flickr, and Yahoo messenger. Most of the pieces are in place, now it's a question of oiling the interplay.

Where will all this end up? I wonder if we're likely to head down the path of major battles (between Microsoft, Yahoo, and possibly Google) for ownership of our personal identity projections onto the web. Because presumably with ownership comes eyeballs for ads and other additional services. Let's hope that Yahoo's revitalised mojo will encourage it to continue to play open.

Project management communications - Basecamp style

Okay, so my last experiment in improving project communications was flawed.

This time we're trying Basecamp, the project management software-as-a-service (from 37signals) that's been making an ever growing splash of late. Basecamp, for those who haven't used it, is a supremely elegant example of application-specific content management. It's a particular application (providing effective communication of project tasks through to-dos, milestones, and messages) wrapped up in some nice clean UI. The application logic makes use of some extra concepts - e.g. people, security, lists of content items. But really, it's all just micro-content, neatly arranged and presented (securely) to users in the right kinds of ways.

The team at 37signals have made some good hard choices about what features to provide, and what features not to bother with. In my opinion, the most impressive set of features is the combination of a simple company/contractor/customer model with easy ad hoc notification (via email) of significant new information.

On top of this, we're using a Wiki for document management; though most attachments will just be added to Basecamp messages. (I'd really like to have a simple TWiki plugin, that displays lists of files in a directory, so that from the TWiki we can view uploaded documents.)

Lastly, we'll use Bugzilla, for managing bugs in everything from requirements to code.

Now there's an idea for the 37signals crew - work out how to provide effective change management tracking on projects as well.

I'll report later on how effective this approach of Basecamp/Twiki/Bugzilla has been in practice.

Problem solving

Years ago, I went to a talk given by Professor William Waite, where he discussed problem solving by programmers, in the context of the engineering of Eli. The key points of this talk always stayed with me for some reason. I recently communicated with him to clarify my recollections, and he was very kind and sent me some notes from a similar talk he gave in Karlsruhe.

In a recent post from Jack Vinson, he summarises Popper's "All life is problem solving" theory with the following:

Every person deals with problems and attempts to solve them in ways that reflect her previous experience with this type of problem.  And her level of success at solving a given problem will build on what she does next time.  In other words, she learns from her experiences and knowledge as she encounters new problems. 

These sentences prompted my communication with Professor Waite, and in his analysis (quoting the notes from his talk), the following 3 levels of problem solving apply:

  • Selection - this is an instance of problem X
  • Abstraction - this is an instance of problem type T, with parameter P 
  • Explanation - here's how you solve it

These 3 levels are interesting, and apply not just to programmers I believe. To put these levels into my preferred order and with a different tagline, here goes:

  • Explanation - this problem can be solved by using the following specific steps that are unique to my understanding of the problem
  • Abstraction - this problem can be thought of as an instance of a generic problem, whose solution can be parameterised, and then solved with particular values for these parameters
  • Selection - this problem is identical to this known problem, which has the following solution

Hopefully this clarifies it a bit better.

The explanation level is for beginners - you've never seen the problem before, or if you have, you don't recognise it and solve from first principles of problem solving. Basically, you see it only at the level of understanding that says, here is a problem to be solved. For example, a child first learning to tie shoelaces - seeing how it's done the first time is like magic, and you try to replicate it. Even the few times afterwards, it's still a tricky problem that has to be solved almost like the first time, as you put together the various physical (laces, holes) and structural (loops, tightness) parts of the solution.

The abstraction level is for more experienced practicitioners - you have seen the problem before in a different scenario, and now you are faced with a similar problem in a new setting. At this level of understanding, you are able to apply your previous solutions with different raw components. For example, wrapping a present with a ribbon. The structural parts of the problem are the same - how to tie a bow in some cord, but the physical setting is quite different - paper surrounding a present.

The selection level is available only once you've solved a lot of problems at both the explanation and abstraction level - you just "know" the answer, and "this" is what it is. For example, the problem is understood to be, "I need an easily releasable knot that binds together something wrapped around something else", and the answer is, "a reef (or square) knot) with two draw loops".

What I like about this analysis is that it helps me to understand why different people solve things more or less easily. We never reach the selection level with all problems; and different people arrive at different levels for different problems. The more experienced we are, the more likely we are to reach the selection level for some particular domain. And while we may be an expert in solving some complex information organisation problem, we may still be unable to fix a leaking tap.

End note: In the context of programming, Professor Waite commented that there is research [Jeffries, Turner, Polson and Atwood, "The Processes Involved in Software Design", 1981] showing that the number of patterns accessible to a practictioner at the Abstraction and Selection level is the only thing which distinguishes novices from experts. In searching for references to this paper, I came across Susan Gasson's various works in progress, including one on "Organisational 'Problem-solving' and Theories of Social Cognition", which provides an excellent overview of much recent research about design more generally.

Disclaimer: Any failures in the explanation of Professor Waite's ideas are my own; for more of his work see this Google scholar search for William Waite problem solving.

Workflow is the wrong metaphor!

James Robertson asks Is workflow the wrong metaphor? Having been building systems that incorporate "workflow" engines as part of our CMS for the past 3 years, and trying to deploy them within various organisations, I'm happy to answer in the affirmative.

James provides an excellent breakdown of several of the issues involved in trying to institute workflow within an electronic system, and also looks to task management as a better alternative for managing such activities.

The core problem as I see it is twofold:

  1. Workflow as traditionally understood is really about the movement of a piece of paper (that can be changed/added to/amended) among a number of different people to create a "final" piece of paper. Since it's a piece of paper, it's incredibly maleable, and the "rules" about who is involved with that piece of paper can be easily determined on an ad hoc basis. This is why email is great for workflow, although terrible for accountability or deciding the "master" version. The flexibility of the rules in practice is what causes any attempt to straitjacket them into a "powerful workflow system with simple customisable rules by administrators" to be doomed to failure.
  2. There are two parts to the pre-publication process - a development cycle (usually ad hoc and messy) and a publication approval cycle (usually linear with one-level fallback if approval is withheld). Recognising that there are in fact two kinds of activities would go some way towards solving this, rather than trying to use the same system for both activities. The classic approach (in pre-CMS days) was of course to have two web sites - a "development" site and a "production" site.

Of course, there are circumstances where there is a highly customised and repeating workflow pattern. (We are doing this at the moment for a metadata registry project for AIHW.) In these circumstances, it is possible to build an electronic replica of the workflow pattern, and expect it to be usable. But I suspect these are few and far between compared to fully ad hoc "workflow" scenarios.

Web search - popularity of citations

John Battelle (with the help of C. Lee Giles) finds that four papers are extremely frequently cited with respect to search. Well, one of them (Modern Information Retrieval) is actually a book, not a paper.

What I'm interested in is understanding why these are the most frequently cited resources in the context of Web search, and by such a large margin of number of citations.

First up, The Semantic web paper by Berners-Lee, Hendler, and Lassila. Actually, I don't think is really being cited by others for its value as a search reference so much as its authorativeness as a reference on the semantic web. That's not to say there aren't people writing about search, and citing this paper, but they're doing so because they're writing about how search may work on the semantic web. And a lot of the references are about completely other things to do with the semantic web.

Next, Brin and Page's WWW7 paper on The anatomy of a large-scale hypertextual Web search engine. While Google's founders were still at Stanford, they wrote this paper about building Google. Now it turns out that up to that time, almost no one had (and to the best of my knowledge, still hasn't) written extensively about how to go about building web search engines, and the challenges and problems you face when doing so. This paper in fact lays bare much of the basic architecture of putting together an entire web search engine from scratch and some of the problems you'll encounter in trying to do so. It also describes how (at an abstract level) they made use of additional hypertextual metadata to help with query ranking. Most of the people building web search engines are in companies, and they have very little incentive (or time) to inform the world how they go about solving the major issues for running large scale web search engines, which are the issues of managing large scale data (while retaining performance) and spamming of their ranking algorithms. (Solving these provides a large part of the competitive advantage for search engine companies, since they're so hard to replicate.) So for anyone writing papers about web search, Brin and Page's paper is a fantastic reference point for people, at a unique time in Google's history, before they got big enough to want to shut up about how to do this successfully.

Third, Kleinberg's paper on hubs (documents that reference lots of good authorities) and authorities (documents that are referenced by lots of good hubs, or less reflexively, which have a lot to say about a particular topic and are recognised as such), titled Authoritative sources in a hyperlinked environment. Brin and Page came out with PageRank around about the same time as Kleinberg's work. The key difference with regards to conducting search between the two is that Kleinberg's algorithm is query-based, and thus must be computed every time a user issues a query to a search engine, whereas PageRank can be computed for the document set at large. Thus PageRank becomes just an attribute of a document's general properties (like the number of words it contains, or its format), and can be trivially incorporated into ranking algorithms. Kleinberg's algorithm for computing hubs and authorities is computationally expensive, and must be done every single query, since the hubs and authorities for one query are different for another. Kleinberg's paper gets cited an awful lot because it's an awfully clever way to work out what are the best authorities in a document collection, and good authorities have any number of roles to play when trying to figure out how hyperlinked documents can be understood better. (Much more so than PageRank for example.)

From memory, a late version of the Northern Light web search engine, before it got out of the web search game, and started focusing on enterprise search instead, implemented Kleinberg's algorithm. Intriguingly, I would suspect that Kleinberg's algorithm is less easy to spam than PageRank, but it's only a guess. (Of course, this is probably one of the reasons why PageRank is only a small factor in Google's ranking algorithm, not the secret magic ingredient that people believed it to be for so long.)

Lastly, Modern Information Retreival, authored and edited by Baeza-Yates and Ribeiro-Neto (plus a cast of experts), and published in 1999. This book is cited by everyone, because everyone working in the field of information retrieval, who needs to cite the current practices of the IR community, can find in this book a good overview of the particular area of IR they're writing about in their own paper. There are a number of chapters, some by the authors, some written by experts in their particular sub-field, detailing the practices and theories current in 1999. So it's more an overview of the whole of information retrieval, than a particularly insightful paper in its own right. Intriguingly, the previous reference overview on IR is Salton and McGill's book, Introduction to Modern Information Retrieval, published in 1984. The latter outstrips Modern Information Retrieval by 2204 references to 1198, according to Google Scholar.

In a loose manner of speaking, following Kleinberg's terminology, the first three papers would be considered authorities, while the latter would be considered more of a hub (only it has collected its linked authorities under one umbrella - that of a book). And that's why they all get hyperlinked to, ... sorry, I mean cited.

Managing security with roles

In an earlier entry (simple personas and anthropomorphising systems), I made a passing reference to role-based security. And realised that perhaps everyone isn't familiar with what this really is.

A common facility in most CMS these days is that security should be managed using the concept of roles. At its most basic, individual members (the representation of a person's identity) within the system are assigned a role. So for example, the member Freda is assigned the role System administrator. This means that when the person Freda logs in, her member representation is understood to have the responsibilities and rights of a system administrator. The security system is coded to allow members with the System administrator role to do certain things - for instance, produce reports about the new members who have joined the site recently.

A more complex system allows members to be assigned multiple roles. For example, Freda may be assigned both System administrator and Security administrator; the latter allowing her to maintain security permissions for other users.

The step after this is to allow a role to be assigned another role. For example, it could be decided that if you are able to maintain security permissions, then you should also be allowed to do anything that a system administrator can do. In this case, the Security administrator is assigned the role System administrator. Then Freda need only be assigned the Security administrator role, and she can carry out anything that either Security administrators or System administrators can do. (Obviously, the system needs to be able to follow inherited roles to do this.)

In Sytadel, we went one step further than this, which was to allow roles to be assigned members (which is pretty much never used, since it doesn't really occur in situations we've encountered yet), and more usefully to allow members to be assigned other members. An example where this is handy is the common organisational practice of delegation - Freda goes on leave and is temporarily replaced by Jeff, whose member Jeff is only assigned the System administrator. During her absence, Freda assigns her member Freda to Jeff, which means that Jeff can then do everything that Freda could do.

So in a sophisticated security system like Sytadel's you can assign permissions to both roles and to members. Now as a rule we find it a good practice to only ever assign permissions to roles, and then assign members to roles, and to assign roles to other roles (where hierarchies of permissions make sense in the particular business scenario).

The reason for this is that managing permissions by roles is a much better mechanism than trying to manage it by members. Why is this so? Two reasons: first, I suspect that we naturally go through the following thought processes: a press release publisher needs to be able to do the following things, and Joanna is a press release publisher, so let me create a role Press release publisher and assign it the appropriate permissions, and then assign Joanna the Press release publisher role. Second, if Joanna is not the only press release publisher, then maintaining (and understanding) my security model is much easier to do, as I can now provide the other publishers with the same role, and don't have to assign them all the specific permissions associated now with the role.

As an old wise friend once said, just about everything in computing can be solved by an extra level of indirection.

And you want as many layers of abstraction as you can, because security permission systems get complex in real world situations pretty fast. In fact, there's no better way for understanding exactly who does what in your organisation than trying to map out what permissions they should get in the synthetic context of a content management system's security model.

Desktop search - are you outside looking, or inside looking out?

The release of Microsoft's new desktop search facility shows once again that Microsoft should never be underestimated. It's already a slick polished product, that shows up in all the right places (Outlook, the task bar, IE) ... at least, if you've rolled over and are using Microsoft lock stock and barrell. I've always been a big fan of the Google toolbar for the same reason, and this makes it just that much easier to search wherever I am contextually at the time, without having to click that extra button or hyperlink.

In Charles Ferguson's article on Google for Technology Review [via John Battelle], he outlines what Microsoft can do in search by virtue of owning the proprietary APIs that underly so much of the data we manipulate on a daily basis (email and office documents - again, provided you're using Outlook and Office). Microsoft's tool demonstrates this again - in comparison, Google's desktop search, while a good effort, just doesn't provide the same deep level of filtering that is possible. I'm still of the opinion that Google's desktop ranking algorithms appear to be a bit better, but they don't find as much stuff as the Microsoft tool, and it's not even finished indexing my desktop yet. And it's not much better, like they've been able to do in Web search for so long.

I think at heart there remains a fundamental difference of world view between Microsoft's approach and Google's on the desktop. Google is all about the network - and a networked set of documents. Their new project to digitise large collections of academic libraries illustrates this perfectly - make everything connected and thus searchable. The approach they took to desktop search was the same, it's as if you'd put your entire document collection up on the web and let Google search it, and then very nicely, in any search you run with Google, you get to see your information included in the search results as if it was part of this one big enormous document collection that Google indexes. I characterise this as the outside looking in.

Microsoft's approach is just about the opposite, and again reflects its own culture, which is me-centric. I'm on the inside, at the centre or the hub if you like, of my own information spaces (located on my PC), looking out to draw parts of the world into me. Thus it gives me a much richer understanding of the data that exists here locally in my own information space. Of course, results are all wrapped into a web browser display, because you may be wanting to search the web instead, or any other connected information.

Which of these philosophies may ultimately prevail is going to depend on a whole lot of different things, not just whether people feel more comfortable looking out or looking in, but I think it's an interesting difference.

All up the current situation, making a whole lot of people just a little nervous no doubt at Google, reminds me of an African folk tale I heard many years ago in South Africa while visiting the Kruger National Park. (There was a park radio station, which tells you lots of stories about the park, including folk tales about the animals.) I've forgotten the details, but the gist of the story was, don't make your camp between a herd of hippopotami and the river.

Disclaimer: the author holds shares in neither Microsoft nor Google, although he knows people who have shares in each of them.

Simple personas and anthropomorphising systems

A number of people have written about the use of personas in system development in recent months, including Don Norman (ad-hoc personas and empathetic focus), although as Norman points out, they've been in use since at least the early 1990's if not longer. There's been some debate about the degree to which personas need to be realistic or fully developed (i.e. with a life history, career, lifestyle, hobbies, and names).

When building a system from scratch, using personas to help identify the real different kinds of users for the product is a very valuable exercise. Making these personas sufficiently detailed to allow for empathetic focus (as described in Norman's entry) is also going to be valuable.

But what about a different scenario, where instead of building from scratch, you're building a replacement system? This in effect is what we've been doing for AIHW with our metadata registry project. While the new system will allow the involvement of several new types of users who could not interact with the old system, these users have been in existence for a long time. And the team at AIHW are familiar with them - sometimes they even are them.

In most content management systems (such as Sytadel), the concept of roles (and usually role-based security) are well supported. What can be done and by whom is determined by what role or roles you have.

Thus a system role is effectively analogous to a persona.

Does it mean then that personas are superfluous in this context? Well, no, not in our experience.

First, the names of roles are quite long - e.g. metadata developer, metadata registrar - and this becomes awkward to keep saying aloud or writing in documentation.

Second, by using the name of a persona (a person), it assists in anthropomorphising the system to be.

And whereas anthropomorphism is often a practice to be wary of when analysing non-human things in the real or a conceptual world, it's precisely what we want to do when designing systems which incorporate some concept of a user.

With our current project, we found it was sufficient for our personas to be represented by names alone. This is just about the simplest persona possible.

So for example, our metadata registrar was called Reg, and metadata developer was called Dell, and Ursula was our general user. We didn't bother writing any details whatsoever about their background, interests or social lives. And we found ourselves arguing passionately about whether Reg would do this, or Dell would want that, because we knew precisely who they represented within the system.

A quick tip - use names that help you to remember the formal role. Otherwise you'll be forever asking, "Who is Penelope again? Oh yes [consulting persona guide], she's our system administrator."

Another quick tip - don't make the mistake of naming the system roles the same as your persona names. While the personas are meaningful for the design/development team, for long term maintenance the role name "metadata registrar" is going to be far more meaningful than "Reg".

(For some more reading, and some different perspectives, see Personas and Plogs, Using personas in intranet projects [via Column Two], Personas & Scenarios in the wild, and An introduction to persona and how to create them.)

Testing accessibility of your website - a simple and illuminating method

One trick taught me by a visually impaired friend of mine a few years ago for testing accessibility is very easy and very effective. (I'm reminded of this, because I'm writing up some accessibility tests for our current project's acceptance testing.)

Assuming you are running Windows XP, you go to the Control Panel, and select the Accessibility Options category. (Similar things exist on Macs and Linux desktops as well I'm sure.)

Then go through the various options you have available to you - Windows walks you through a wizard if you select the "Configure Windows to work for your vision, hearing, and mobility needs".

Assuming they're not actually totally blind, then vision impaired people often need to see things bigger and with more contrast. This means you want to select larger icons, larger font sizes, and one of the high contrast modes. (Go back after you've done this test, and try one or more of the other modes as well - they're not all the same.) So now you're running in high contrast and large fonts and icons. Maybe turn on the screen magnifier as well, because this can often be useful if you're trying to confirm fine level details, like which button to press.

Now try visiting the website you're working on and see how it looks. You're in for a surprise very probably.

All the text is very big. Most of your pretty design colours have vanished.

What is it like trying to move the mouse around and only read your web page through the magnifying glass screen at the top of the page? Is it easy to find stuff on a page? Do I have to scroll around a lot? Are entry forms and buttons close together in a logical sequence? What's it like to only see 10% of the page at any time?

I think you'll find it illuminating. It's also very sobering, and an excellent reminder that trying to control how a page looks on other people's computers is an exercise in futility.

Of course, automated accessibility evaluation tools are of great help as well, but a multi-page report won't give you the same visual impact as seeing how some people are going to view your work..

Walking a critical path - agility helps

Johanna Rothman displays some lateral thinking on how to refactor a project management task with interdependent sub-tasks. The problem arises partly because there's a critical path of activities with limited expert resources who can be brought to bear on these. Johanna had the interesting approach to refactor the problem by formulating the problem in terms of features instead of architectural components. However, the student's organisation would find this a big change to their operations.

I like her approach, since treating development by prioritised features, instead of macro-level architectural components, makes the problem far more amenable to an agile approach to involving the customer in helping to make the hard tradeoffs against schedules. I'm not sure it makes it any easier to see what the critical path is, but that's often a problem with project scheduling tools once it gets complex.

That is, the time that needs to be invested in making the project scheduling tool sufficiently accurate to model (what is needed to happen to deliver the project optimally), can be so great that by the time you start walking down that critical path towards your solution, you'd have been better off just starting with a good first approximation of what's needed, and spent less time trying to plan everything out as if it was going to run perfectly.

Because once you do commence, you're almost certainly going to find one or more of the following happens:

  • the requirements analysis wasn't sufficiently detailed, and you have to re-plan
  • the critical path was not sufficiently detailed to include all the really critical activities, and you have to re-plan
  • some of your estimates are out, and you have to re-plan
  • some of your resources aren't available, and you have to re-plan
  • the customer changes their mind, and you have to re-plan

Hmm, a pattern seems to be emerging!

This of course is not to say that planning is not important - in fact, it's essential. Or that trying to see how the critical path may fall out is not important -because that's what's going to give you a baseline minimum elapsed time for the project.

Delivering a project using an approach (I hesitate to use the word methodology here) that allows for both regular and frequent readjustments to the critical path makes a lot of the planning problem disappear. You accept that it's not going to be right a priori, and thus you plan in rebalancing activities as part of the delivery process. This can be thought of as wobbling down the critical path, rather like an amateur tightrope walker. Just remember to carry the big stick so that you don't get hit on the head.

This would all make more sense in the context of our current approach to project management, so mental note to self - write blog about combining classic software development lifecyle techniques with agile approaches, on fixed budgets and fixed timelines with mostly fixed feature sets.

Archiving for dynamic CMS web sites - a hard problem with no easy answers

As an organisation, Synop does a lot of work for government agencies with our Sytadel CMS.

One of the Australian Government's Guide to Minimum Website Standards concerns archiving. Archives are important to be kept for all sorts of reasons. A purely sensible reason for this is that periodically there are held what are called Senate Estimates, which are a Parliamentary process designed to keep the Government of the day accountable. Public servants need to be able to produce tranches of information about what they (and their department or agency) have been up to. Among the issues that occasionally arises is what a member of the public has seen on a web site, or was told by a member of the agency on the basis of the information that they have access to. So a need exists to establish exactly what was able to be seen by the person concerned.

The people in charge of archiving for the Australian Government are the National Archives of Australia (NAA). They've written excellent stuff about how to archive  web-based systems ("keep records" in archive-speak), both at a policy level and as guidelines.

(An amusing side story - NAA now inhabit what's known as East Block, a building located next to the original temporary Parliament House. Up to the second world war, all of the Canberra-based public servants were able to fit into this one building, including their archives. Now, not even the archived material can all fit in here.)

The NAA guidelines are a comprehensive outline of the considerations an agency faces when embarking on archiving their material. In particular, for dynamically generated websites and web resources, they recommend that:

The major issues these sites raise for agency recordkeepers is the need to choose whether to use an object-based or activity-based approach to keepin records of web resources and activities. That is, ... [the choice between] keeping records of:

  • the individual transactions between clients (users) and servers (agencies), or
  • the objects that comprise the content of the site at any given time.

If you adopt the second approach, the fundamental problem for archiving that arises with dynamic systems such as Sytadel is that every page that exists may appear slightly or grossly different for different users.

Now, in a complex enterprise system that incorporates Sytadel (or a similar CMS), these issues are compounded by the inclusion of yet other systems, which may interact with the CMS to determine what is shown. For example, you might have an LDAP-based directory system for managing all your users and groups. Depending on what user the person was, and what roles they had, different things are shown to them. Similarly, a search engine is typically incorporated into the system, and its results (and result snippets) may also constitute records of what was seen.

To fully establish then what has been seen by an individual, you need to be able to reassemble the exact state of the system at the time in question.

Now, most CMSs include version control, which means that you can in fact establish what was published at what time at an underlying raw content level. In Sytadel, we even built a mechanism for setting a date in past, and then being able to browse the site exactly as it was then, complete with all the security and custom assembly of pages.

The problem comes about because you also need to have every other system that you're interacting with to adhere to the same version control principles. The LDAP repository must be able to serve up the contents of its directory - as at the date in question. The search engine needs to know exactly what was in its search indexes - as at the date in question. And so on.

Ultimately then, you need to be able to reconstruct the exact state of a complex enterprise system, consisting of multiple interlocking components, at some time in the past. Anyone who's been involved with setting up such a system will tell you that to do this correctly is going to be either impossible (several months later, due to changes in operating systems and versions of systems), or very very hard and time consuming (read several days to weeks).

Now the alternative is to record every single transaction that occurs between the user and the server. In other words, this means logging (to an archivable store) every single web page that was ever delivered to anyone. From a technical perspective, this is definitely not impossible. In fact, Kent Fitch wrote a nice paper about how to do it that was presented last year at AusWeb, and has a system called pageVault which carries it out. (The technical limitation being that while both IIS and Apache 2 support filtering architectures, Apache 1 does not.) I have a vague recollection of the CSIRO taking out a patent on this kind of approach as well, but I can't recall the name of it.

But this approach only solves the "what people saw" part of the problem, not the "what was the system" part.

For example, you might wish to establish that a particular document was not uploaded to the site until such and such a time, regardless of when anyone chose to look for it. Or that there was no occurrence of a particular phrase to be found in any of the published materials on such and such a date.

To solve that, you need to go right back to the earlier issue, of archiving the exact system that exists in entirety with all other enterprise systems.

A pragmatic compromise may mean crawling and then archiving copies of the site as it would be seen by the majority of users. Add in the actual generated page log record as well, and you'd be 99% of the way there. And then just hope that that special 1% case doesn't ever crop up!

Question answering - Ellen Voorhees and an idea for (web based) definitional search engines

Today I went to Ellen Voorhees' talk on question answering and evaluation in information retrieval, the final in the summer series of search seminars organised by Dave Hawking at CSIRO.

The question answering (or QA for short) track at TREC has been in operation for 5 years now, and has got progressively harder each year. It's now a matter of answering multiple question series. Each question series consists of:

  • factoids (simple questions that have a "correct" answer e.g. what year was Mozart born?)
  • lists (more difficult questions that entail finding information possibly spread across a number of documents e.g. what are the names of all Mozart's symphonies?), and
  • definitions (very hard questions that entail producing summary information e.g. who was Mozart?)

The QA track has been an interesting learning experience in working out metrics to successfully evaluate QA type questions. However, that's the kind of detail really only for afficionados of IR evaluation. The TREC QA track set of documents (a corpus, in IR terminology) over which the questions have been targeted is one consisting primarily of newspaper articles, which provide good overall coverage.

What I thought was particularly interesting was that in some years, various groups have used the Web as either an input into their query formulation (e.g. find documents which talk about the question, and hopefully the answer, then go look for similiar documents in the TREC corpus), or as post-validation to check that the answer from the corpus they've produced is correct (e.g. by seeing if there is external evidence which supports it).

Among the definitional questions, Ellen gave a few examples which stuck with me. One of them was "who is vlad the impaler?" and the other was "what is ph in biology?".

I thought I'd try these out quickly on the Web to see whether using web-based resources could assist in answering these particular questions effectively. I had a hunch that my favourite wiki, the Wikipedia might an excellent resource to use, so decided to use it as well.

I tried them on a number of search engines:

  • Google
  • AskJeeves (the best known ostensibly question answering search engine)
  • BrainBoost (a search engine I found by searching for question answering search engines on Google)
  • Wikipedia (the native search engine with the web site)
  • Google at Wikipedia.org (combining Google with a restricted site search over my favourite wiki, Wikipedia)

I formulated the questions exactly as I would ask them, one with a question mark and one without one.

search engine who is vlad the impaler what is Ph in biology?
Google who is vlad the impaler what is Ph in biology?
AskJeeves who is vlad the impaler what is Ph in biology?
BrainBoost who is vlad the impaler what is Ph in biology?
Wikipedia.org who is vlad the impaler what is Ph in biology?
Google at Wikipedia.org who is vlad the impaler what is Ph in biology?

 

On a degree of difficulty level, the Vlad question is pretty straightforward - people love Vlad (these days at least) for the association with our favourite myth of Dracula.

The Ph in biology question is much harder, because Ph is a concept in its own right, but then has to be restricted by how it applies in the field of Biology.

My evaluation methology consisted of examining the first result, using the result title and snippet as the answer.

Overall, from a very unscientific and simple evaluation, Google at Wikipedia.org was surprisingly successful, and most of the others were substantially less effective. For the Ph in biology question especially, everyone except Google at Wikipedia.org did abysmally.

So in conclusion then, combining a really high quality general purpose free text search engine (Google) over a source of really great definitional content (the Wikipedia) comes up with the best of both worlds - a quite good definitional search engine.

Which must mean that this technique could definitely assist in either of the ways postulated earlier for QA (definition) search engines.

Now, if only I had some spare time to participate in the TREC QA track this coming year ...

Mass customisation of content

With the launch of MSN Spaces today (as widely reported in the blogosphere), I've been contemplating the mass customisation of content.

MSN Spaces (which seems to be the Powerpoint approach to blogging) is a very powerful concept. Provide a tool that thousands of people a day will use to start communicating, and what do you get? A facility for mass customisation of networked content that all shares much the same general shape and structure. (Of course, this is done by Blogger as well and others.)

Weblogs in general have the same principles, as does indeed the web at large, but with progressively looser degrees of conformance between different content items.

Why should the general shape/structure of content be interesting?

Well, the answer comes down to the degree to which more implicit information can be extracted from it.

The cleverness associated with Google arose because they extracted some of this implicit structure from the general morass of the web (namely recurring patterns associated with hyperlinks) to more effectively calculate relevance to our information needs. 

Now the web is a pretty loose association of content, syntactically strung together just with hyperlinks. The folks involved in the Semantic Web (TIm Berners-Lee no less) are working towards a substrate for recording networked information that will embed meaning (semantics) into the very descriptions of the content. (For example, it might include information that a piece of content is a photo, and that this is a collection of photos belonging to me, and that I'll license people to use them under the Creative Commons license.)

My suspicion is that it's going to be a long time before we see the web in full semantic web glory, if ever. Why? Because I don't believe most of us are librarians - cataloguers - and that's what you need to be good at. (Not to mention the known problems of using a taxonomy.) So the tools for creating semantic web content had better be good, and there's never going to be a general purpose tool that's useful, because content is intrinsically mass customised.

That's why content-specific applications, like MSN Spaces (which let you nominate content as: this is my blog and these are my posts, these are my lists and here is a list item, this is my photo album, and here are my sections, here are my photos; and here are names for all of these things) are gold for finding more implicit structure and messy meaning, without the formal rigour of the semantic web.

Of course, to make use of the patterns that are associated in this implicit structure, you really need access to a very large computing facility. Because the volume of data we're talking about is huge, and growing bigger at a staggering rate. The recent article about Google's Urs Holzle describing the technology challenges gives some indication of how much computing power and system management capabilities is needed to throw at what is a comparatively straightforward and embarrassingly parallel computing problem. In fact, I suspect there's only a handful of companies in the world (Google, Microsoft, Yahoo!, Ebay, Amazon) who know how to run such computing systems.

And preferably, you also need access to all the user data associated with how people interact with any service you provide over this content.

That's why if you're trying to understand what people do when they search, you need a really big search engine (Yahoo! or Google or MSN Search).

If you're trying to understand what information people need to buy a product, you need a really big shop (Amazon or Ebay).

The bigger you are, the more data you have, and the better service you're going to be able to offer.

That's why MSN Spaces is a smart idea for Microsoft, and even smarter to tie it all in with their other technologies, like Windows Media Player and MSN Messenger. It's not that they couldn't integrate with other open technologies for these facilities, but they can get better instrumentation of the flow of interaction between various communication products that they provide with the hundreds of millions of people who use them. And long term, it's owning that knowledge which delivers you power in the technology arms race.

XML for Microsoft Word rocks - open document formats rule

As part of our metadata registry project for AIHW we are constructing some really nice facilities in our Sytadel CMS product. One of these is the ability to save individual metadata items or entire sets of them as Microsoft Word documents. Strictly speaking, we're not saving Word documents, we're saving WordProcessingML documents. But as that of course is just XML, the content is now manipulable not just by Office applications, but by anyone who wants to write a converter into their own preferred format.

The elegant thing about this is that Sytadel (and the ISO/IEC 11179 metadata registry model we're implementing) uses a component model for content. And a metadata standard is made up of lots of different bits of typed microcontent (data set specifications, data elements, data element concepts, properties, object classes, value domains etc). Typed microcontent is of course just custom XML schemas for different content types.

When a user would like to view or save some unique set of metadata items they are interested in, these items have to be dynamically assembled on demand, and converted into a coherent linear representation. First, on the screen, where, with the benefit of hyperlinks, relationships between individual items is easy. Second, into a linear paper publication format - Word (WordProcessingML) and/or PDF. (Note, this is one of the rare examples of where structure matters that I've stumbled across.) Generally speaking, the second one is much harder, because designing a screen representation (not even using hypertext) is distinctly not the same as a printed representation. There are various challenges to solve here, as Richard (who's been doing the heavy lifting in developing this facility) would attest. Not least how to convert HTML to a Word representation that is sensible.

Jon Udell has been asking where applications making use of the openess of XML and custom schemas in combination with Office. I'm not sure whether our work is what he means, since we're generating Word documents from the content which conforms to our custom schemas, not generating XML from Office docs if that's what he meant. But it's pretty cool neverthless. We're also going to be generating PDF documents as well. At the moment, we suspect this may be harder than it has been to generate the Word docs.

The main thing to take home is that if we'd been doing this more than two years ago, we just couldn't have done it at all, since there wasn't an Office XML format we could publish to. (We could of course do the same thing for Open Office's XML document format if we thought this was going to be useful too. Or someone could just write a WordProcessingML to OpenOfficeXML converter, if they haven't already.)

Now however, we can generate nice looking printed publications from dynamic assemblies of typed XML microcontent, on demand, in real time.

And when you compare being able to do that, to walking around with an annually published, 3 volume set of the latest National Data Dictionary for Health, several hundred pages long, you just know that keeping up to date with the latest in metadata standards for the community is about to get a whole lot more speedy and efficient for people.

Addendum:

To the best of my understanding, some people still want printed paper versions of documents; for example, when they're sitting around in a committee meeting reviewing lots of new proposed metadata definitions over several days. Jon Udell and Tim Bray's musings on XHTML were intriguing for their view of conversation in a networked document world, but there are still some people for whom the network isn't always on. For those who are, they'll be able to just browse the system directly anyhow.

I'd be interested to know too whether native XHTML can afford as rich a paper printable rendering as a schema (such as WordProcessingML) custom designed for applications that are intended for print output. That said, most Word-produced documents that I read these days are done in front of my PC, never printed off if I can avoid it. But they could be, and they'd look nice if I did.

Structure vs speed - the perennial question - part 2

Okay, today I'm going to talk about the creating part of the story for whether we choose to emphasise facilities that help a user create structure or facilities that help a user get things done faster.

Unsurprisingly, I suspect that tools which emphasise speed over structure are going to be more acceptable to users.

What is fascinating is how even micro-level improvements can lead to dominance of a tool. As I've written about the benefits and shortcomings of wikis previously, one of the single biggest reasons why wikis are great is that creating hypertext links requires no interruption to the flow of writing. (We've even decided to start capturing some low-level design documentation in a wiki, solely because the driving priority is to get the information out of developer brains and into text, not to get it well structured.)

Another example is weblogs, where the ease of creating a web site and to update it regularly, are super efficient. I think I set up a weblog through Blogger but hosted off my own private server in under 15 minutes one time. The remaining missing ingredient from Blogger in my opinion (that is solved in many other weblogging software tools) is categories (which form a light-weight and optional structure mechanism). (Of course, they may have made the deliberate tradeoff that for speed vs structure, 80% was sufficient, and the other 20% could just not exist at all for the moment.)

Another example is Flickr, the photo sharing software service. The aspect of Flickr which impressed me so much was how cunningly they've solved two really hard problems for photo sharing:

  1. specifying the degree of sharing with your family, your friends, and the world for your photos as a whole, and for individual photos (the photo equivalent of ridiculously easy group forming); and
  2. how to provide some simple levels of tagging of photos (light-weight and optional structure for photo organisation/categorisation).

Google provides yet another example of an extremely efficient interface for searching. Yet the Google toolbar trumps even Google - it's one less click - the search form is right there in your browser. (The link to structure is fairly loose here, though it is possible to consider a ranked set of search results as a form of structure that's been applied via your query over the pool of documents that they index.) The comparison is with an "advanced" search query interface, which allows you to be far more precise about the things you're interested in, but much less efficient.

Or the feature in Microsoft's OneNote which automatically inserts the URL citation for you when you copy and paste information from a web page. When I'm copying and pasting text normally, I wouldn't want this feature, but when I'm taking notes, 8 times out of 10, this saves me time in case I want to find out where this text was from later on.

In summary then, tools which make a design tradeoff towards the speed with which you can perform a task over tools which emphasise how structured you can make your output, are going to be better for you.

Structure vs speed - the perennial question

When designing an information management system (typically these days based around a web site), how do we find the right balance between facilities that support good structuring of information and facilities that support fast access to (creating and consuming) that information?

Today I'm going to talk about the consuming side of the equation.

I'd hazard a guess that about 80% of information architecture work revolves around trying to create better mechanisms for structuring of the information. The theory being, that good structure leads to an improved user experience, a logical place for an author to place their content, integration with information and metadata taxonomies, and so on.

But is all this work going to waste?

Sadly, I fear so.

Why?

The answer is that the user rarely cares about all of your information, just the specific bit they are interested in finding. And if they can't find it, time to go somewhere else where they can.

So all that careful structuring of information is mostly useless.

How could this be true? Well, for the same reason that Google and Yahoo! et al can organise 8 billion or more pages of information into a searchable repository that allows me to discover the things I want to find out in a couple of seconds for about 80% of my searches. And they make use of little more than implicit human (hypertextual) clues about relative values of information. (Plus the deep well of a whole lot of human information on things that interest people.) So why bother with any kind of structure at all then? (Note, I said "mostly useless", not "completely useless".)

There are two kinds of value in careful structuring of a web site:

  1. Making it obvious to users what kind of information is available, as quickly as possible. (Value to end users - at least 80%.)
  2. Being able to comprehend entire sections of your web site in totality. (Value to end users - at most 20%.)

I think we should be focusing 80% of our work (for number 1 above) on making it as easy as possible for users to consume the information on your web site without having to browse for anything. This is why technologies such as RSS and aggregators (very effectve means to be informed of new material) and full text search (very effective means to discover specific material on your web site) are great. Spending most of the rest of the effort writing good content, and better yet, editing it effectively, will be more valuable than, for example, creating the best metadata taxonomy for tagging your content.

Why is metadata tagging so hard? The clincher for me here is how difficult it is to organise information in ways that are meaningful for more than a subset of the users of your web site. That's why I used more categories for this post than any other I've written. Is this an article about Ideas? Or information management? Perhaps knowledge management? And there's some search in here too, how about that?

For number 2 above, it is occasionally important to be able to grasp entire sections of a web site as a coherent whole. An organisation's annual report for example, is a great example of a set of hyperlinked documents that really only make sense in totality, not as individual units. But this is a far rarer user, and one for which there is usually a natural structure to be imposed.

Bruce Croft on the Information Retrieval Landscape

I went to Bruce Croft's seminar on the Information Retrieval Landscape today. It was a great high level overview of where we are in the world of IR today, and what are the challenges still in front of us. I'm always so impressed to hear people speak when they have such a sweeping and comprehensive understanding of an entire field of research as Bruce does about IR.

Bruce mapped out an interesting space of queries as the fundamental way to analyse where we are at in "solving" information retrieval challenges. He has a broad categorisation of "simple", "complex", and "hard". Basically, "simple" queries are mostly a solved problem (think Google style homepage finding or simple factoid question answering), but "complex" problems are not solved, and "hard" problems are likely going to remain hard for a very long time.

The major insight was how far we remain to go with problems to solve in IR. In comparison to other fields of text processing (such as categorisation), where researchers are obtaining close to 90% breakeven (a technical measure mapping where precision is equal to relevance), standard "ad hoc" queries under TREC experiments are performing still only at between 20-65%. Even "simple" query performance is currently at best achieving a 70% rating (using a different measure from breakeven, which doesn't really apply for simple queries). (The measure is mean reciprocal rank, for those who want to know exactly what it was - basically a measure of where in the list of results you get the first relevant answer.)

So assuming that it is in fact possible to approach 90% (and the jury is out on whether in fact it is), then there's some way to go! Bruce wrapped up by commenting on what he thought would be the best mechanisms to accomplish progress towards that. At the moment, much of the funding is heading down only one particular path, which, although it may solve some subset of the "complex" query classes, may not solve all of them. But sadly funding isn't always tied to solving the hardest problems.

Highlights for me was his take on current research indicating that language models may be a better and more broadly applicable fundamental IR technique than the standard Okapi BM25 probabilistic ranking algorithm. The jury remains out on this one (not everyone agrees), but it's interesting to hear that this is the trend from researchers at the moment. (During the 1990's Okapi replaced vector space models as the primary algorithm in use for "ad hoc" query processing, so it is a rolling issue to find a better system.)

There are some good toolkits for rolling your own language model-based search engine apparently - the Lemur Toolkit in particular - if you want to go play.

Good to know there's still work to be done!

Virtual identity - which one do I choose?

Often when I blog I want to refer to something or someone. When it's a something, then in the world of the web, there is a beautiful thing called a Uniform Resource Identifier or URI for short. Sometimes URIs are also known as Uniform Resource Locators or URLs. These are the hypertext links we see in our web browser. The URI for my weblog is http://www.synop.com/Weblogs/Peter/ for example. These things all exist in a virtual world, and thus are identifiable.

When it's a someone, the answer is not so clear.

Why is this important? When working with computers, it's very important for them to understand the notion of identity. To answer the question, is A identical to B, the computer needs to know the identity of A, and the identity of B. Vice versa, it's at least equally important (often more so) to know when A is not identical to B. Of course, in computing, A may be equivalent to B in some circumstances, but equivalence is not the same as being identical.

As people, we have the same needs. We like to know who we are talking to. It really does matter that I'm talking to my best friend, and not my best friend's friend. We have highly evolved senses that learn to discriminate between individuals, so that we can recognise people that are important to us from fragmentary bits of identification - the tone of their voice, the sound of their footsteps, or the curl of their hair on the back of the neck. We can make this identification in fractions of a second, from considerable distance, and with lots of other distracting information surrounding them.

So when I wish to refer to someone, and provide a hyperlinked virtual "identity" what do I use? I can choose from among the following options:

  • email address (bad, don't like spam, but most people online have one - and there again, which email address do I use if they have several?)
  • home page URI of their website (if they have one)
  • public key block (but only the geeks would know what this is, or have one available for use, and a string of hash digits seems wrong)
  • mobile/cell phone number (if they have one, or I know it, or they're happy to use this publicly; this one seems odd, but with the growing convergence through VOIP of the notions of being "online" and "having a personal telephone", this may not be so bad)
  • instant messaging nym (but which one, if they have 3 or 4 like many people do?)
  • blog URI (but what if they have multiple blogs - a work one, a personal one?)

Of course, organisations such as Microsoft (with Passport) or the Liberty Alliance Project, are working to provide a notion of digital identity. But these are not readily available, or (in the case of Microsoft's Passport) are tied to an email address.

The proliferation of different virtual identities arises from the proliferation of different modes in which we insert our physical presence into a virtual medium. Nearly all of these are focused on communication:

  • email address - asynchronous (mostly) text messages
  • home page - one-to-many (typically infrequent) multimedia publication
  • public key block - security wrapper over asynchronous communications
  • mobile/cell phone number - immediate live (voice) communication
  • IM nym - immediate (synchronous) text messages plus "online" presence information
  • Skype nym - immediate live (voice + text) communication plus "online" presence information with privacy modes
  • blog URI - one-to-many (typically frequent) (mostly) text (but increasingly photos and audio) publication

None of these are perfect, but this may be because the (communication) context in which we are operating differs at times. [An interesting question from a semantic point of view, are these identities the same?]

At the moment, my preference is the person's blog URI. The reason is that it's a good recent snapshot of their public information, it's a conversation between themselves and the world (at least implicitly), and they get to publish what other virtual identities for communication you can share in. Sometimes (as happened recently to Russell Beattie) your virtual identity may even help you get a new job.

So by presenting this as their virtual "identity", I get the best compromise. Particularly in that if you're reading my blog, you're in the right communication context to see someone else identified by what they write publicly on their blog.

As the conversations become deeper, and trust grows, the people concerned may let out more of their virtual identities to each other. The key is in letting them manage the disclosure level, as Jon Udell writes. Ultimately, they may even meet! (Which can be disconcerting as Dave Pollard found.) Of course, other people (like Robert Scoble) are inherently trusting, and share their virtual identities with everyone anyway. Call me paranoid, but I'm not quite ready to do that just yet! Maybe next year ...

Collaboration and communication

Dave Pollard writes about how we can improve collaboration. In the article he writes about the need for a concept he calls intellectual agility, to enable effective rapid collaboration. He also makes a guess that women are inherently better at it than men.

I was working with a client this morning, and the points Dave makes are all relevant. During the visit, we discussed ongoing collaboration possibilities between our organisations in the future, as well as discussing ways we can collaborate effectively as part of the acceptance testing phase of the current project. The ability to rapidly absorb and assimilate the other people's ideas are essential to making these meetings productive and successful, as we each bring our own ideas to how we could work together.

At the heart of successful collaboration between organisations is the relationship formed between individuals. (No matter what the contract says.) And the key to all successful relationships is egoless communication. I believe we often think of communication as a one-way process - I speak and you listen. But of course that's just information transmission.

What we're seeking is that I speak and you listen and you speak and I listen, and we both understand each other, and we don't let our egos get in the way of hearing wisdom just because it's not what we would have said, or the way we would have said it. This ability to have egoless communication seems to me to be an essential bedrock to Dave's concept of intellectual agility. 

We are starting to understand much more about biological differences in brain sand other bodily structures that contribute to language skills, and there have been studies which show aptitude for language and communication is more highly developed in women than men. [See this linguistics lecture from U. Penn for more details.] So if egoless communication is the key to intellectual agility, then Dave's guess that women are going to be better at it than men is probably correct.

Great summer search seminars coming up

Dave Hawking alerted me to a series of 3 seminars that are being organised by the Enterprise Search group at CSIRO. These should be superb, featuring:

  • Bruce Croft - The information retrieval landscape (Fri 26 Nov 2004 12:00-13:00)
  • Mark Sanderson - SPIRIT - a geographically based search engine (Thu 2 Dec 2004 12:00-13:00)
  • Ellen Voorhees - The state of the art in Question Answering (Mon 6 Dec 2004 12:00-13:00)

These three information retrieval researchers are some of the most experienced and smartest people out there in the IR community. Many of their past colleagues now work at Google (and/or Yahoo! and/or Microsoft Research), and so aren't able to talk publicly much about what they do any more.

So if you're in Canberra on any of those days, get along and hear some IR luminaries talk openly about where they think things are headed in the world of search. Contact seminars at mail . panopticsearch . com for more information.

Usability - simple, obvious and timely

I recently bought a recent second hand car, and was struck by how simple innovations can continue to be made in terms of usability. You'd think that pretty much all sensible ideas about making the driving experience easier had been thought out years ago for cars, but it's not so.

How often have you jumped in a car, particularly a rental one or a friend's, and gone to fill up with petrol at the service station, only to do the petrol cap dance? (That's the one where you park the car, only to discover the petrol cap is on the wrong side from the bowser, so you see if the hose will reach, or just accept that you have to 3-point turn the car to face the other way.) Or spend time wondering as you're approaching the service station which side it's on, and trying to peer into the rear view mirrors to see if you can see the line of the flap in the car body? (Instead of focusing on the road ahead.) I've discovered an ingenious mechanism where (if you have a remote release) you can pop the flap and see which side it's sticking out on in the mirrors which is a bit easier. But not all cars have remote releases either, and you have to be driving slowly else the wind will keep it pressed in.

The car, a Subaru Impreza, solves all these problems at a stroke.

The solution: two words "Fuel door" (simple) and an arrow (>) to indicate the side it's on (obvious).

These are positioned right above the fuel gauge, so when it's heading towards empty and you're thinking about visiting a service station, the information you need is right there in your face (timely).

Of course, cars are far more simple to consider from a usability perspective than stupendously complex and constantly changing software systems. But the principles remain the same. I believe usability, when boiled down to an essence, is all about making every aspect of a (software) system that a user has to interact with:

  1. Simple - so that anyone can understand.
  2. Obvious - so that it's clear what action to take.
  3. Timely - so that the information is available when you need it.

For more reading, Peter Merholz's writes eloquently about the first two aspects - simplicity (in Explicit design's relationship to simplicity), and obviousness (he uses the term explicit, in Explicit labels).

I've struggled a bit to find people writing about timeliness of information with respect to usability, especially in software systems. Even my favourite book of all time on design (Donald Norman's The design of everyday things) doesn't approach this explicitly. Norman does discuss the issue implicitly, using the concepts of knowledge in the world and knowledge in the mind. He writes,

Knowledge in the mind is ephemeral: here now, gone later. We can't count on something being present in mind at any particular time, unless it is triggered by some external event or unless we deliberately keep it in mind through constant repetition (which then prevents us having other conscious thoughts). Out of sight, out of mind. [p. 80]

In the car example above, ultimately knowing that the petrol cap is on the right may move to being knowledge in my mind. Reassuringly for me, it can stay as knowledge in the world, and is there to remind me every time I need it.

Vertical searching - published papers

Great to see today that Google (in the form of Anurag Acharya) has just launched Google Scholar [via Google Blog]. In my recent post on how big are everyone's sites, I talked about how interesting it is to see generic search and vertical industry-specific search facilities. Google Scholar is a great example of both. It searches generically over published papers, but also provides an form of vertical search - that of academe. It also shines a light onto some dark matter of the web (as I talked about in that previous post as well), by crawling subscription-only material, following agreements with publishers.

The name of the principal engineer, Anurag Acharya, seemed awfully familiar to me. So using his new facility, I tried searching for publications written by "Anurag Acharya". Sure enough, Anurag has been involved in computing from way back, and I'd read some of his work on distributed vs shared memory computing back when doing my own graduate study back in the mid 1990's. Of course, there may be multiple Anurag Acharya's in computer science, just like there are multiple Peter Bailey's.

I had to do a vanity search of course for my own publications, and am pleased to see my favourite paper (Engineering a multipurpose test collection for Web retrieval experiments) that I wrote (with Nick Craswell and David Hawking) at number 2, with 41 citations.

I think Google have taken an interesting approach by ranking results apparently almost exclusively by number of citations when searching on author names. This certainly provides a rapid method for establishing paper popularity.

Of course, by adding other subject terms, more of Google's ranking algorithm comes into play, and the citation rank is not the only factor. From first impressions, Google Scholar appears to do a better job than Citeseer, which has long ruled in the area of searching for research papers. (Steve Lawrence, the main developer of Citeseer, now works as a senior research scientist at Google.)

The Scholar search is not yet perfect however, as it appears to add in some papers that are referenced by you in one of your papers.

For example, I tried searching on "peter bailey" information retrieval, and got back the classic Cleverdon paper "The Cranfield tests on index language devices", sited in our paper mentioned earlier. My suspicion is that this arises because of the added value being provided by the ACM, which provides lists of citings of ACM papers by other published papers. Hence, since our paper references it, it is listed in the citings by the ACM, and then Google Scholar picks it up as part of the indexable material for the paper. Personally I think this is a bug, and either ACM should cloak these paper abstracts without citings for Google, or Google Scholar should exclude it. After all, I wasn't even born in 1967 when Cleverdon published that paper!

[More analysis and discussion at Search Engine Watch.]

On the value of acceptance tests

We've been chugging through the creation of a large set of acceptance tests for our major project for AIHW. What's been fascinating is how valuable it is to write the acceptance tests, not just use them. To date, we've been using them to help carry out quality assurance testing when we do a new release (typically every three weeks). The value in writing them has been for retrospectively reviewing the use cases of the user requirements specifications. 

While the user requirements went through several iterations and has been considered carefully by at least 3 people over several months, there have still been anomalies and inconsistencies in how the system was intended to work. Given the system is complex (several person years effort), this is not surprising. The acceptance tests effectively provide a new lens through which to examine the user requirements.

What this suggests to me is that having multiple ways to view your information has intrinsic value from a validation perspective. Dave Pollard has been discussing using the Wisdom of Crowds to help validate decisions and analysis. Simply providing new ways for an individual to view the same information - perhaps just presented in a different way - provides another mechanism to take a fresh perspective on its value and correctness. This perhaps is the heart of why crowds work well, by bringing together very rapidly multiple different perspectives to the same information.

Speed vs accuracy tradeoffs - get it fast, then get it right

I just read Jeremy Zawodny's post (of Yahoo!) about his desire for speed (or efficiency) more than relevancy in search results, but preferably both.

I agree!

Outside of the context of search, we've been tossing around the same set of choices recently, both on our major new 11179 metadata registry Sytadel project for AIHW, and in our internal new product development project building on Sauce.

Conventional software development wisdom says that: first you get it right, then you get it fast. On any number of projects I've worked on, we followed this approach, building the correct solution, and then if it runs like a dog, looking to start optimising it (and usually finding 1 or 2 orders of magnitude speedup by looking carefully in the right places). That's equivalent to building a car that can drive at 10km/h, and then finding ways to tweak the engine performance so that it runs at 100 to 1000km/h.

Since we often follow a loosely agile approach to building software, the overriding cause of customer frustration however is that when they use a version of the system sometime in the first half of the project, their immediate and dominating perception is of the performance limitations. No matter how much you explain that everything will get much faster by the end, that first impression is still formed, and it's a negative one.

I'm starting to think it might be better to design and build with the opposite choice as the dominating factor. You'd still have explaining to do for the customer: don't worry about bugs, we'll sort those out. (After all, you're always going to have bugs anyway.) But if the system is fast from the very first time it gets used and stays fast, then the customer is always on side with regards to being able to use the system efficiently. Increasingly, I believe we have very low tolerance for inefficient (read slow) systems, perhaps of our need to get faster.

Despite this going completely against new software development practices such as test driven development, my new software development meme is: get it fast, then get it right.

Taking a leaf out of the big G's book

It's early days yet at Microsoft's new search engine (still in beta), but I'm wondering why they've not taken more of a leaf out of that big scary Google monster's book, and tried to implement some of the better known techniques for getting relevant results towards the top of the ranking. And don't just blindly reimplement PageRank, it's got relatively little to do with it - anchor text is of far more importance.

Robert Scoble suggests that MS need to be more transparent with their search platform than Google - a great idea. But they also need to build technology that's not substantially worse at relevance ranking than their competitors.

For several years, I've always come in on the first page of results for Google when Googling myself. (Often it has had several pages, since I've had various home pages tracing me through the web world.) Since starting my blog at Synop, that has become my defacto identity/number one result for the concept of "me" on the web as far as Google is concerned. The advent of someone who owns the domain name peterbailey.net has put paid to the number one ranking, but that's okay. With MS's search, I can't find myself (where myself is my blog) in the first 10 pages of results! The only reference to "me" is to an entry on Synop's FAQTs site, at about page 7. It's okay for Scoble, he's got a relatively unusual name, and an awful lot of people who link to his blog (just like me in this one!).

Still I hear from friends who should know that the heat is on at Microsoft to do much better in search, and it's not so hard that they couldn't catch up, at least technically. So I look forward to seeing some rapid progress over the coming months. Until then, I'll still be using Google or Yahoo!.

How big are everyone's sites?

Well, Google now is claiming to index 8 billion pages. Microsoft has come out with their new search (still in beta) with an index size of 5 billion [via John Battelle]. And Yahoo! is probably up in the same kind of vicinity (or will be shortly).

It's been a while since we had an attack of the "my search index is bigger than yours" in the publicity of major search engines, or as Danny Sullivan more cordially puts it, in Search Engine Sizes, "Is bigger better?".

Of course, as various people have pointed out (including yours truly in Dark Matter on the Web), these indexes still represent only a fraction of the real information on the Web. [Read the extended technical report for more information on why.]

What's more interesting to me these days is seeing the plethora of different kinds of generic search - images, news, photos, web, shopping, local, music, rss, ... as well as vertical industry-specific searches.

The Google desktop search (and other upcoming local searches) remains to me far more exciting with its emphasis on integrating my personal information world with the outside Web as well.

Now, if only someone could come along and put these together compellingly with enterprise search, I'd be even more delighted. But maybe that turns out to be a variant of highly extended personal search (that reaches out into my local and protected environment).

How large is that site?

Have you ever wondered how many pages some web site has? (We do occasionally.)

A close approximation can be found with a bit of help from Google or Yahoo.

The trick is in knowing some of the advanced search query syntax. Google allows you to search for pages which do not contain some term, using the "-" operator.

So a query to Google such as:

-aMadeUpTermButNotThisOne site:www.synop.com

will report the number of matching pages (currently this says: Results 1 - 10 of about 1,620 from www.synop.com). From this, we can determine that www.synop.com contains about 1620 pages (or at least, that's how many Google has stored in their current index).

What we're really asking Google to do is to compute the following query:

(find me all matching pages which do not contain) (this made up term) (in the site) (name of site I am interested in)

And because the made up term we use hopefully does not exist anywhere, every page on the site will match.


The same query works on Yahoo (-aMadeUpTermButNotThisOne site:www.synop.com), due to the increasingly pseudo-standardised search query language in use by the major players. (But Yahoo's results appear to be far smaller [Results 1 - 10 of about 229 for -aMadeUpTermButNotThisOne site:www.synop.com] than Google's for some reason - perhaps less aggressive recognition of individual blog entries?)

So I'd recommend using Google for this kind of search.

[Credit to Nick Craswell, who first showed me this technique several years ago.]

Why wikis?

Jon Udell's "The Wiki way" talks about Wikis more generally and JotSpot in particular. In Jon's words:

The users of a Wiki think of the process as organic growth. Enterprise IT planners tend to regard it as unstructured chaos. They're both correct. JotSpot's aim is to harmonize these opposing views by empowering users to create islands of structure in their seas of unstructured data.

I liked JotSpot's demo; adding user-definable forms (structure, or what we'd call content types in Sytadel) to unstructured information is a nice feature. It's also one we often hear requested for CMS deployments. Interestingly, we moved away in Sytadel 4.0 from supporting this kind of facility (preferring to use just XSD and XSL), whereas in Sytadel 3.0 we had a complete Construction E-gineer which allowed users to build their own new content types.

However I disagree with the basic thesis that by allowing users to create structured information you will address those Enterprise IT planners' concerns. Let me explain ...

I've been trying to understand the exact appeal behind wikis for some years, since using one extensively in a previous company in the year 2000.

Wikis are great - they do let you (as in you, the average user in an organisation) rapidly build a web site, with minimal involvement from technical experts once they've installed the underlying software in your web server. This is a strong attractor.

One of the strongest features of wikis is the ability to hyperlink to other content while continuing to write without losing context. This works by using camel cased titles. (Camel casing is words where there are no spaces and all the individual words are capitalised.) This blog entry would be referred to as WhyWikis for example. (An alternative is to use [['s and ]]'s.) It's intriguing that this simple facility alone is noticeably more efficient than highlighting text, clicking a url insert icon, and entering the destination URL. Thus wikis provide a simple form of content management, where the location of the destination content is irrelevant to you, and the addressing of the content is also trivial (provided you know what it's called).

The Wikipedia is a great example of where Wiki technology is superb. The content is basically unstructured, and it's very unlikely that individual content items run into naming clashes. The one example where this may start to occur more frequently is references to individuals.  For instance, while Bill Clinton is currently the only entry for Bill_clinton in the Wikipedia, what happens if a famous business tycoon called Bill Clinton comes to prominence? (I've run into this problem in a minor way at my old university, where the name "Peter Bailey" was shared by a far more distinguished individual, Peter Bailey of human rights and public law renown.)

And wikis allow simple, collaborative, open content creation. As we found with FAQTs in the past, providing simple, collaborative, open content creation systems to people will generally be used (gratefully) and not abused.

There are a number of problems I suspect for Enterprise IT planners with wikis, and it's not the existence of unstructured (vs structured as in form or content type) content.

First, it represents yet another form of content repository that needs to be backed up, archived, and supported for users. Every new repository adds another lot of complexity to their tasks.

Second, many organisations are yet to embrace free-for-all intranet content. This will (hopefully) change, but there is still strong reluctance to unleashing all users within an organisation to publish whatever they like to whoever they like. (This of course is a question of trust, between managers and non-managers.) The smaller the organisation, the more this is likely to be supported on average I suspect. Neverthless, electronic workflow/approval rears its ugly head once again.

Third, not all content (and content creators) successfully self-organise. There remains strong value in pre-organisation of the information (this is essentially the field of information architecture), and strong value in editorialising (content managers) and post-organising (librarians). Pre-organisation is an evolving area of research and best practice (but almost completely content domain specific), and editorialising and post-organising is nearly all human-centric. (Some emerging practices such as content tagging in del.icio.us and Flickr are interesting approaches to this problem.) The upshot of these issues is that the content repository becomes rich with information, but lacking authority or easy navigability. Highly effective search helps the navigability, but managing the authority of information is harder to achieve.

Fourth, its hard to refactor the content. For example, I decide I want to consistently rename various bits of information, and all the links to it. I can do this if I have local access to the file system, and am handy with Perl or some other regular expression tool, but as a general user that's difficult through a wiki's interface.

Despite these problems, the emerging number of people using wikis, and more interestingly, companies building on top of wiki technology concepts, demonstrates that there is strong appeal in their fundamentals.

Porous ephemera

Bookmarking the weekend, I went to two great shows.

The first, by Conan the Bubbleman, was called Ephemera. This is a magical show, of bubbles weird and wonderful, and totally organic - nothing that is entirely reproducible. If you ever get a chance to see Conan's show, do it, because there are few people in the world I suspect who can create such beautiful and astonishing bubbles. 

The second was by Faithless, playing a gig in Canberra, and they went off, as did the crowd. Although the support act Way Out West was good, Faithless was great, having the crowd dancing energetically and grooving for 90 minutes.

In the latter, I was intrigued by the actions of a substantial minority of the crowd. For these people (primarily in their early 20's), it appeared that an essential part of their experience was to relay and/or record it by mobile phone, SMS messaging, digital camera, or mobile phone camera.

Personally I couldn't work out how mobile phones would be effective:

"Hi, I'm at the faithless gig"

"What?"

"I'm at the faithless gig I said"

"What? I can't hear you? There's some loud music playing ..."

It was that kind of concert!

The point of this post was that it struck me how, as communications technology becomes ubiquitous, more and more experiences are lived vicariously.

Of course, people don't set out to routinely record every single digital interaction they have or capture. I wonder indeed whether people even care? Although I have a friend who records and archives absolutely every single email he's ever been sent for the past 15 years or more, I think this is the exception rather than the rule.

Businesses on the other hand do care a lot. For them, there is strong concern if something of value is not recorded. Working out what is of value is a much harder challenge, especially automatically.

As information communication channels (email, IM, digital video and audio, blogs, digital photos) become ever more widespread throughout organisations, so too the leakage of digital information will become ever faster.

I like to think of this as the porousness of the information pipe. It's a bit like the currently broken water pipe under the pavement outside my house, which, as I wait for the local water company to come and investigate, exudes more and more water up through the soil, turning it into a sludge. It started off just a slow seeping, but every day I expect to come home and find the street awash.

But we can't turn off the tap. And although some efforts to routinely trap and archive all digital information that flows in and out of an organisation may work for some of the mediums, modern technologies (USB drives anyone?) and work practices (laptops and tablets for that evening or coffee-shop work) will continue to work around them. It's about as effective as trying to fix a pipe that's 3 feet underground and you don't even know where it is.

Hopefully the interesting stuff will bubble up to the surface, like it usually does.

Time and tool slicing

The last two weeks have been quite incredibly busy for me. Work, home life, a major celebration, lots of friends and family around, major project deadlines, tenders to respond to, keeping abreast of news and blogosphere - it never stops, like a wheel within a wheel, never ending or beginning.

Somehow, the day never produces any more than 24 hours to get these things done.

Just as modern operating systems slice their CPU's processing time over tens or hundreds of differents tasks to give the illusion that they're all moving forward simultaneously, in such excessively busy times we (humans) have to do the same thing. And just as these operating systems have to invest resources into the task of swapping process state in and out of memory, so do we. Which means we lose some of that precious time just keeping all these things moving forward.

The problem of course is that as humans we don't have neat address ranges and register values that can be extracted cleanly. Instead we have messy context, consisting of half a dozen different tools and technologies (paper, post it notes, lists, OneNote sheets, spreadsheets, word documents, blog entries, emails, web sites), and the mental coordination to keep these all singing in tune.

As we build more systems and tools here at Synop, I think we realise more and more that trying to minimise the impedance between task swapping is vital to success. Sharp discontinuties are mentally exhausting for users. Building an experience of smooth gradients is challenging however, relying on subtle and rigorous design and implementation. Working with users' preconceptions helps as Donald Norman describes so brilliantly in The Design of Everyday Things. However it becomes increasingly challenging knowing that nearly all the interactions are going to be with other companies' tools and technologies, and that while we can attempt to make our systems as smooth as possible, it is the jumps from these to the others which are the major points of discontinuity. In other words, design for inter-operation within an ecosystem of other people's tools.

Maybe this is why nerds tend to be early adopters in general. Given a talent for dealing with abstractions, and thus finding software development fun and enjoyable, lends itself to working with other people's models and abstractions, manifested as new tools and gadgets.

Time to get back to the next major deadline ... extract blog writing context, insert web site development/review context ...

Tool/process inertia

In my earlier post on Project conversations, I spoke about a new project and our desire to experiment some more with different communication mediums:

I've been adopting an approach of trialling different communication/collaboration/information sharing tools and channels, in the hope that at least some of them will prove effective and useful. I'm not exactly sure which of these is going to be the most helpful - maybe it will be the group email, maybe the project weblog, maybe the shared file storage. Guess we'll just have to suck them all and see!

As part of the first phase of the project, it's essential for us to record a number of important design decisions as they are made, so that everyone has a record of what we've agreed. Some of this ends up documented in the requirements specification ultimately. But not necessarily all of it.

The design decisions themselves arise from either conversations we have face to face, or through encountering issues while specifying the requirements. The project team is fragmented across three organisations, without shared file space. So how did we end up communicating and recording these?

Basically, as often happens, we defaulted to email (with a common subject prefix - DD) for non-face to face interactions, which are then collated in a spreadsheet to track the conversations around each decision. Face to face conversations relating to the design decisions are written up and emailed around. Periodically, the current spreadsheet is emailed to everyone, to provide a record of all the updates.

The obvious alternative, a project weblog with a specific category for design decisions, together with commenting mechanisms, was not adopted.

I've been thinking about why.

Martin Roell's recent excellent article about Distributed KM makes a number of interesting points about the ubiquity of an email client.

Today, the email client arguably has become the most intensively used knowledge work tool.  ... The reason for this is email's embeddedness in communicative processes.

Martin makes the following summary:

In my view, email has two characteristics that are its success factors for becoming a "serial killer app":

  1. It embedds managing information with communication
  2. It is personal (it belongs to the user), private (no one else can access it if it is not shared) and personalisable (it can be configured to ones personal needs and work style).

Or, to put it differently: Email is successful because it is personal and social at the same time.

He then goes on to examine weblogs (and weblog readers/publishing tools) and how they can contribute to the kinds of knowledge processes we've been carrying out.

Note that we do have a project weblog. And we do use it. Our failure decision not to use the weblog for this particular circumstance (which bears the hallmarks of all of the Information Snowflake set of activities) arises for a number of reasons I believe:

  1. Everyone has an email client as part of their work environment. Synop people all have Sauce Reader, but our customer team would have to use a web-based interface (or get changes made to their SOE), and this is not ingrained into people's work habits.
  2. The Blogger-hosted weblog does not support categories, to allow easy discrimination between posts to the project weblog on design decisions versus other information. (Yes, other weblog publishing engines support categories, but they are harder to set up in 5 minutes which are then instantly accessible to the distributed team, without more expert knowledge.)
  3. Some of the design decisions go to a slightly wider audience than others. This is easy to express within email - as per Jon Udell's analysis of ridiculously easy group forming (and dissolving). This is hard to do still with weblogs. (Not that the information is necessarily private, but it's unnecessary to be distributed to a wider audience all the time.)

The email solution is by no means perfect, and in fact materially suffers in many ways in comparison to a weblog/weblog tools approach:

  1. The current definitive set of design decisions is expressed in an ad hoc medium (email conversations), and only periodically are these summarised and available in one place from the spreadsheet.
  2. There is no cross-linking available (e.g. this design decision is made because this <link to>earlier design decision</link to> has these consequences). Thus there is no easy mechanism for people to explore the background of decisions.
  3. It is more difficult to track how the conversation around an individual design decision unfolded. An email client's view on conversations denies the richness available in a threaded (or even serialised) sequence of comments would provide. As much as anything, this is just a straight interface issue because the primary information (timestamps, people) is all identical to both mediums.
  4. Design decision communication intermingles with lots of other communication that arrives by email, adding to the filtering lag. 
  5. Both one of the customer's team and ourselves have multiple email addresses - which one is the one which should be sent to today? Send to all?

What will it take for weblog tools to become the medium of choice for such issues? I believe the critical features are:

  1. A ridiculously easy interface - for managing the intended audience for weblog posts.
  2. Tool ubiquity - everyone has to have one, and they have to be part of the daily set of tools in use.
  3. An easy ability to express the importance of being alerted to new material - it's got to be simple to say that new design decisions are important to me and I want to know straight away (or within a short period of time).
  4. An easy ability to add downloadable attachments - some of the design decisions come with affiliated information that should be preserved in its native format, not converted to an HTML presentation.

Guess those of us involved in building PKM tools better just keep working to make them simpler still! Until then, the "good enough" syndrome of email technologies (basically a tool/process inertia) will prevail over the adoption of more effective technologies.

Project conversations

Synop has recently started a new major project, that is Gantt-charted to take some of us through until early next year. It's going to be a fun project I think, involving solving lots of hard problems, expanding and enhancing Sytadel, and dealing with lots of structured, richly related content (or metadata as they like to call it). I can't say who it is as yet, because we've not yet signed a contract, but will talk more later.

I've been running behind as a consequence on keeping up with the Jones in the blogosphere, but was struck this afternoon reading Jack Vinson's recent post on project networks as a social network, pulling together the ideas of Patti Anklam and Dennis Smith. This set of ideas struck home with me, since I'd been spending a considerable amount of time setting up different communication networks for the project - internally for us in Synop and in association with our customer - along with setting up early project management reporting and planning mechanisms.

Along the lines of another recent post by Jack on PKM (starts the conversation), I've been adopting an approach of trialling different communication/collaboration/information sharing tools and channels, in the hope that at least some of them will prove effective and useful. I'm not exactly sure which of these is going to be the most helpful - maybe it will be the group email, maybe the project weblog, maybe the shared file storage. Guess we'll just have to suck them all and see!

Most interestingly of all, I've been struck by the value of conversation. All cross-organisational teams in the early stages of a major project go through a team-forming process. While doing some hard use case analysis with Nigel, an expert consultant to the customer, we occasionally took time out just to sit down and chat about ourselves, our backgrounds, passions and hobbies, and our experiences in life. Nigel regaled a great story he learned on a team course many years ago. The story goes like this:

The course had 12 or so people, and was split into 3 groups, each with 4 people. His group was left behind in the room, while the other two teams were sent off, the first to be set a task, and the second to await information from the first, and who would then pass more information to the third. The whole thing was to be strictly limited to an hour's duration, and they would be observed by the course organisers throughout the hour. There were two (male) engineers, Nigel himself, and a (female) corporate affairs person for a major company. The two engineers got stuck into doing some risk analysis about what might happen and when; Nigel sat back to chill out and enjoy himself since there was nothing that could be done; and the corporate affairs person said, quietly, calmly, that she was totally confident in the team and its ability to do whatever job was put to them, whenever that might be.

As Nigel remarked, the value of this was the confidence it brought to the other members of the team, and the optimism that whatever their task might turn out to be, it was going to be achievable, and they would do a good job.

The story tells me a lot about Nigel, and is a fine anecdote and antidote to over reliance on Gantt charts for understanding what is happening in a project. Yes, they're important for getting a handle on what may be going to happen at a macro-task level, but they rarely will cover the richness of interactions that occur on a day to day basis. And they definitely don't address the need to tell each other stories, so that we can understand a whole lot more about the nature of the people we are working with, rather than everything that is going on in the project, which is where we began...

The information snowflake - consuming, collating, commenting, collaborating and creating

We spend a lot of time at Synop thinking about information and what happens to it. Richard and Nathan last year came up with a fairly sophisticated but elegantly simple model of the lifecycle of content. This grew out of our experiences in trying to model workflow and activities over content in the context of Sytadel, our CMS product. I'll leave them to write about the full model when appropriate.

However, as I've got some meetings coming up where I need to explain how Sauce Reader (in combination with weblogs) can bring value to an organisation, I've been writing some explanations for the different activities that are involved with information, and thought they might be of wider interest.

We bucket the activities into 5 distinct areas, each of which may be supported (or not) by an information tool. I've created some icons to represent these different areas. This figure is from the perspective of some information - a data-centric view.

Figure 1 - information activies.

Inspired by Peter Morville's User Experience Honeycomb, I've also created an information snowflake, which represents the facets of the user's experience when acting on information. This is a user or activity-centric view, and as such the user is at the heart of the diagram.

Figure 2 - The information snowflake - facets of a user's information experience

A quick explanation of each facet is in order.

  • Consuming. We typically start by consuming information. There are many more information consumers than information creators in the world. The huge success of web browsers testifies to the importance of this versus the number of web logging tools. Human beings are chronic information consumers, probably for genetic reasons as it's the key to our survival.
  • Collating. Once we consume more than one lot of information, we start sorting and collating it. We put different pieces of information together which are related or relevant to each other. This arises from our own mental models and associations. Thus my collations of information may look very different from yours. For example, I file a collection of notes into my OneNote repository of ideas for future blog entries. When I want to write a new entry, I can always look there for inspiration. More sophisticated forms of collating involve active editing of the information as it is stitched together.
  • Commenting. Whether it's just one lot of information or whether we collate many, there will be many circumstances in which we wish to add our own comments to it. Comments help us to remember why we thought the information was useful/relevant/interesting/choose-your-own-adjective. They may also be applied (semi)publicly to other people's information. For example, think weblog comments or the ticks and crosses and remarks from your supervisor on your assignment.
  • Collaborating. Most of us are highly social individuals, and communicate widely with our friends, our colleagues, and people who work in organisations we do business with. Often we will collaborate to produce new information. That can be very formal, passing a document backwards and forwards with MS Word's Track Changes facility on to see what has happened as new material is added or removed. Or it can be very informal, scrawling design notes together on a paper napkin over dinner. We may just wish to share information with a friend, and send them a link to an article using email that we'd like them to read.
  • Creating. Lastly, we create completely unique new information. We may write a blog entry, a journal article, or a press release. Many of us do this every day. And so the cycle continues, because someone else is likely to consume this information we've just created, someone else may collate it, or make comments, or collaborate with it.

By looking at each of these facets, we can start to understand our own use of information or that of other people. Where are the flaws or weaknesses in our own information snowflake? Why would a new information tool help us to process each facet more effectively? When designing a tool that crosses facets, how can we minimise the disruption between user experiences of each activity? Which facets are we particularly good at? Which could be improved?

By breaking down the activities of handling information into these facets, we provide a clearer lens with which to view the multitude of ways we interact with information each and every day.

Morville on unhealthy fixations with the home page

Peter Morville writes elegantly on his approach to user experience design for web sites. There's a couple of great diagrams for information architecture practitioners, which help to capture the different components to be considered when designing. Of most interest to me was his recent focus on findability. I particularly enjoyed the following line:

... in which we used findability concepts and [Search Engine Optimisation] statistics to alleviate an unhealthy fixation on the home page, raising awareness of the need to design findable documents for direct access via the Google, MSN, and Yahoo! search engines.

This echoes a common concern we have here at Synop, that finding useful/relevant information is much more important than almost any other aspect of your web site's design and implementation. That's why in Sytadel we ensure that all URLs encode any parameters using slash (/) notation, not in cgi (?) notation, thereby allowing search engines to discover documents easily. That's why our consulting methodology on information architecture is concerned with usage patterns - how does a user find the information they need / what's their pathway through the web site. That's why we use, distribute and recommend Panoptic, a highly effective enterprise search engine designed to get you the right documents early in your search results.

Of course, a fixation with the home page may reflect an organisation's discovery in their web site statistics that by mandating the intranet home page as the starting page for their standard operating environment, people hit the home page an awful lot. Sadly, it doesn't mean they necessarily read or take in anything on this page, they just haven't learned how to reset it to something more useful.

In fact, I would argue that the effectiveness of home pages should be measured in a similar manner to the way Google and other search engines no doubt measure theirs - how fast do people leave this page having found the information they are looking for. In other words, the less time spent on it the better.

Email overload and task management

Jack Vinson points to some interesting comments by Dennis Kennedy about email management (Email Management - Eating My Own Dog Food) and adds his own in More on email management:

He talks about what you do when you get beyond the "empty your inbox" idea. There are still all those articles you need to handle (those "do later" items). Dennis uses Outlook 2003's follow-up flags, but he has discovered that when he gets too many follow-ups, e-mail paralysis sets in again. This is where he (and I) moves away from email management to basic priority management. Operate intentionally: how frequently do I want to read and respond to email today? How much time do I have to devote to those thoughtful responses? And then stick to my plan!

And the tools I use need to support the way I want to operate. Dennis pines for "touch it once" systems, rather than "handle it later" systems.

Having recently (the last couple of months) discovered/enacted the power of the "empty your inbox" idea, it was interesting to read about Dennis's flagging approach. Since we're still in Outlook 2003 land, I thought I'd mention my own current approach.

Any email that arrives for which there is an associated task (even if that task is "reply to email"), which I can't process immediately while emptying my inbox that minute|hour|day, I file into an appropriate folder somewhere. Then I add a task item in my Outlook task lists, and date it. If I need to refer back to the email, and I suspect I'll forget which of the several possible folders I've filed it into, I add a reminder in the task item where to find the email item.

Of course, what this does is change an email overload problem into a task overload problem. But at least that recognises the nature of the problem - it's not that we've got too little time (or too many emails), it's that we've got too much to do.

The first nice thing about this is that each day I can look in my task list, and see exactly what things must be done today. Hopefully I get to do them. Things that slip behind, get highlighted in red. Things that are not quite essential yet are in a soothing muted black. I get some small satisfaction by seeing a number of crossed out gray tasks completed each day.

Periodically, I review the red tasks of things which have slipped behind. The moment of truth emerges - were they really as essential as I thought they were? Perhaps not. Then I either re-date them to when I think they should be carried out by, or delete them altogether if I decide that in fact they're never going to get done because they just aren't that important.

The second nice thing about this is that it breaks the email-as-to-do-list metaphor that I've operated with for years. Received email is communication, much of it of relatively low value, but some of it very important. It's not a set of tasks. The act of communicating something to someone (perhaps by composing and sending an email, even as a reply to someone else's) is an activity/task - and may be pleasurable or just something to be done.

I'd love to have a "touch it once" system too, but I suspect it would only ever exist if I had a team of highly capable, trustworthy and independent executive assistants to whom I could instantly delegate tasks to, together with a lot of spare time and not much to do. As it is, there are things which must be left behind for now, while there are others which it turns out will be left behind for ever.

Now, if only I could also break the email-as-surrogate-archival-file-system metaphor as well ...

Search and connected structured data

The ever interesting Jon Udell writes some more about WinFS and its infrastructure. I mention it really just to draw attention to the support Jon provides in passing for my earlier post about whether search needs metadata (schemas).

Admittedly my "finding versus organizing" distinction was a bit of a cheat, since finding depends sensitively on prior organization. Except when it doesn't: brute-force free-text search routinely trumps navigation and structured search. But OK, we've all got to hope that better organization, someday, will level the playing field. [emphasis added]

The very next paragraph, Jon goes on to describe some of the similarities between RDF and WinFS and the mechanisms they use to relate different content together.

Today's personal information systems are organized hierarchically. WinFS proposes that they be organized semantically. A number of observers have noted a family resemblance between RDF (Resource Description Framework) "triples" and WinFS relationships. An RDF triple, in geek-speak, is a subject-predicate-object relation. Sets of RDF triples can be (and Semantic Web people say must be) used to represent and organize knowledge.

Sytadel  incorporates a very similar mechanism which we call a Relation - just another content type, which happens to implement a useful subset of the XLink standard. We use relations everywhere in Sytadel, to semantically connect two content items together. Relations are typed, which provides the meaning behind the connection.

For example, we have topic hierarchies in Sytadel. Topics are connected in a vertical tree by the Parent topic relation. When a topic doesn't have a Parent topic, then it's a root level topic in the tree. When a topic exists which has no other topics with a Parent topic to it, it's a leaf topic. (As usual in computing, trees are usually visualised upside down, so that the "root" is in the sky, and the "leaves" are at the bottom.) Of course, Sytadel can have multiple root topics. Parent topic typed relations only exist with a topic at either end.

Another kind of relation is the one with type Related topic. Just about any kind of content item (not just topics) can have a Related topic relation. This means you can connect topics horizontally across the hierarchy, not just up and down, providing more interesting navigation and discovery opportunities. And while you can only have one Parent topic, you can have as many Related topics as you like. Thus an article or a press release might appear related to many different topics in the hierarchy. Sytadel has many other kinds of typed relations in use as well.

These typed relations provide a much richer set of data to mine for creating and discovering associative meaning as Udell points out. However, elegant and intuitive interfaces for searching these relations remain difficult to design and implement. Hopefully the resources of Microsoft may assist with that.

My suspicion is that generic ad hoc search interfaces over connected structured data will remain a pipe dream. No one I know voluntarily goes and types in SQL statements to interrogate their relational database. In controlled environments (by which I mean, content whose types and relationships are understood by the supporting system, such as Sytadel), interfaces to search specific kinds of relationships within this structured data may well provide valuable new ways to find information.

In the meantime, we'll continue to resort to brutally effective free text search. It's interesting to note that most effective Web search engines now use algorithms which incorporate the extraction of additional information (such as anchor text, URL text, surrounding paragraph text) about the hyperlink connections between items. This is a form of untyped semantic meaning.

Configuring the weights of a result ranking algorithm - you can't turn a VW Beetle into a Porsche Boxter

One of the requirements I often come across in RFTs which talk about search concerns the ability of the customer to "configure search engine weightings and other settings".

 

This comes up frequently enough that I know people believe it to be important. However, I have always felt this to be blatantly misguided.

 

Some years ago, I worked in a research group which, among other things, organised the Web track at TREC. Highly trained information retrieval researchers spend years designing, testing, and evaluating search engine ranking algorithms. Conducting tests to see whether the changes you make to algorithms is difficult, and relies on standardised test collections, including queries and the known relevant answers to these queries.

 

Where I suspect the requirement arises from is fundamental dissatisfaction with the performance (in particular, the result ranking algorithm) of poor quality search engines.

 

People appear to believe that by being able to "fiddle with the knobs" they will somehow be able to solve the problem. The sad reality is that this is not the case - a poor quality ranking algorithm doesn't improve by such fiddling; it requires replacement by a proven higher quality ranking algorithm, in a high quality search engine. It's analogous to thinking that maybe if I could have fiddled with the engine settings on my old Volkswagon Beetle, I could have made it perform like a brand new Porsche Boxter.

 

Sometimes of course, the problem also arises because of poor quality data. If you don't have good content, with the right information in it, no search engine is going to be able to find it.

 

Another problem is that there is a vanishingly small number of people with the appropriate expertise to meddle with ranking algorithms, and this usually requires deep understanding of the algorithms being used, and the effects of modifying any such settings. A not insignificant percentage of these people now work at Google Labs, who assiduously went around vaccuuming up many good information retrieval researchers from 1999 onwards. Others work at Microsoft Research, and for Yahoo, among others. Universities and research institutes retain still more.

 

That said, what I suspect is really the business requirement is that there are particular searches (e.g. "Panoptic search engine") for which there is a known answer (e.g. http://www.panopticsearch.com), and customers would like to be able to instruct the search engine to return at least this answer as a highly visible result, regardless of what else is returned. If that's the case, then that should be what people ask for. After all, we don't usually buy a car so that we may practice becoming motor mechanics.

Does search need metadata (schemas)?

I recently read through a case study that James Robertson points to on metadata based search and browsing functionality. I'm not going to talk about the browsing aspect of that study. I am going to talk about the metadata-based search facility however.

First up, let me state my opinion: I do not believe that metadata is a solution / aid / requirement for good search. Either inside an enterprise or on the Web.

Why so? After all, governments around the world have recommended or mandated the use of metadata with the stated intention of improving discoverability.

A short digression about precision

When I talk about search, what I often really mean is precision - that is, finding relevant results within some number of resources that are retrieved.

In the information retrieval community (as exemplified by the people who organise and participate in TREC), precision is often measured with respect to a cutoff value such as 10 or 50. For example, precision @ 10 is a measure of how many results are relevant for your search in the first 10 results returned.

One of the reasons Google was so successful (there are several), was that they did/do a great job of returning high precision @ 1! This was their "I'm feeling lucky" search. In other words, very often the first result was a relevant result. If you want to see how precision works in practice, try a search on your favourite search engine, and count how many of the results are relevant to you in the top 10 that are returned.

Average precision is a metric used to assess how well a search engine performs over a number of queries, for a particular precision level. So if a search engine on average obtains 4 out of 10 documents relevant, then its average precision will be 0.4.

Why does all this stuff from information retrieval research matter?

Well, firstly, studies have shown that two humans agree about whether a document is relevant or not only 80% of the time. Therefore, it is understood that if humans agree only 80% of the time (at best), then search engines will only be able to achieve the same (at best). Average precision is thus absolutely limited to 0.8.

Secondly, in practice, for common kinds of general information queries, search engines typically have difficulty achieving much beyond 0.4 in their average precision, and for difficult queries, often as little as 0.2.

So what can be done to try and improve this?

Back to metadata

Several years ago, people decided that having metadata - information about a resource - would be really helpful for search engines trying to improve their ranking algorithms.

And it turns out that some kinds of metadata (in the loose sense of the term) are useful. Google for instance uses information about the links between documents to help score their relevance.

This kind of metadata however, is not the kind of metadata referred to in the case study. The latter kind is that captured in metadata schemas, such as AGLS, Dublin Core, or even just plain old Netscape metatags. It's usually embedded within a document (at least when that document is delivered through a web server). This is the type of metadata that is often recommended should be added to documents in an enterprise setting to help discoverability, and it's this kind of metadata that I'm going to refer to throughout the rest of this post.

Due to the presence of people who would like to influence their document rankings on search engines for particular queries, metadata embedded in a document is completely ignored by search engines on the Web.

Why? Because the metadata would contain all sorts of query terms that were just not relevant at all to the document itself. So for example, a porn site might include metadata subject terms about farming, to try and make their site appear serendipitously when people were searching for farming information.

But search engine designers got smarter and decided to just ignore all embedded textual metadata for the purpose of ranking. (A side note, Andrei Broder, once chief researcher for Alta Vista and now at IBM Almaden, described the tension between the interests of web site publishers and of search engine companies, as a constant war being waged against the spammers.)

And guess what, we've got fantastic search engines on the Web that can search 3 billion documents in a couple of seconds and give you a really good set of results. All without metadata at all.

Enterprise search and metadata

So how come a much simpler task, such as searching a few thousand documents, in an enterprise environment where there is no spamming going on, suddenly requires metadata to help out the search?

Answer: it doesn't.

What it does need is a good search engine (e.g. Panoptic), that takes advantage of all the other contextual information that exists in documents and web sites to help produce good results. There are a whole lot of very clever algorithms used in modern search engines to improve ranking of results. These algorithms use not just probabilitic scoring, but lots of other information about the hyperlinked documents from web sites.

In an enterprise setting, it may be that certain queries have known answers. For example, a search for leave form in an organisation with only one leave form, should return as its first result the link to the leave form. Enterprise search engines often provide a facility for mapping particular queries to known results to address this issue. Their ranking algorithms may do a good job, but identifying common queries which have known answers can help people searching for this information. However, this doesn't rely on metadata being present, other than as directed search engine mappings.

Even subject-specific thesaurii and/or controlled vocabularies - common elements of metadata schemas - do not need to be applied to documents themselves. Again, enterprise search engines can use such thesaurii either by expanding queries (e.g. you search for vitamin A and the search engine automatically puts ascorbic acid into the query for you, or returns the results for vitamin A, and asks if you would also like to search for ascorbic acid. (After all, you might have been looking for all documents that had vitamin A in them, so that you could and change them to say ascorbic acid instead.)

Where now, metadata?

Don't get me wrong, I do think there is a role for metadata. It's great for record-keeping purposes. Say I want to find all articles authored by a certain person, or created in a particular year. In these circumstances, accurate metadata is essential.

Such search activities are measured in the information retrieval community by the recall metric (which counts how many relevant documents - of the entire set of relevant documents - have been retrieved when some number of documents overall have been retrieved).

In a modern context however, assuming the existence of a content management system such as Sytadel, this activity is much better left to the CMS: firstly in assigning the metadata accurately and secondly in carrying out the retrieval activity (which is typically just a straightforward database query).

But in terms of improving your users' ability to search for documents in the way the people expect to search these days (that is, by issuing two or three query terms to the search engine and getting back a relevant set of results in a couple of seconds), metadata is completely irrelevant (excuse the pun).


For more reading on this issue, read Cory Doctorow's delightful article Metacrap - Putting the torch to seven straw men of the meta-utopia.

For more reading on the fundamentals of search, see Tim Bray's series On Search. (Tim takes the broad view of metadata, not the narrow schema-based view I discuss here. I also think he ascribes too much weight to Google's PageRank value as a significant component of Google's result ranking algorithm, but that's another story.)

I'm indebted to David Hawking for discussions over several years on the subject of metadata and search. I'm also looking forward to an upcoming study from him and Justin Zobel that he mentioned to me yesterday, which sets about objectively measuring the effectiveness of metadata-based search versus non-metadata based search in an enterprise setting with extensive metadata.

Relatively ancient knowledge

At lunch today, catching up with some people from the Panoptic team, I found out that someone from the information retrieval community I've not seen in years is in town tomorrow. I'm looking forward to speaking with him, and hearing what's been going on in his life and research. Had I not gone to lunch, I doubt I'd have ever known. Once again, a serendipitous discussion, not mediated by any form of technology, will allow me to expand my horizons.

At Synop, we're currently trying to expand our horizons a little too. Expanding a horizon is a peculiar concept - I assume it is related to the practice at sea of climbing up a mast to see a greater distance than you can from sea level. I wonder if the phrase is in use in any other language but English.

What we've been doing is looking beyond our historical strengths in content management systems (their construction, deployment and use) to more generalised consulting and tools to assist knowledge workers. In the modern setting, climbing up the mast means taking time out from the daily list of tasks which must be completed to do some research (on the Web).

Some of the questions we've had to ask ourselves are:

  •  Will organisations pay for their knowledge workers to use more productive tools?
  • How can you sell fuzzy concepts of improving knowledge work(er) productivity?
  • What would you do to even measure the benefits for an organisation in using such tools?
  • What organisations employ knowledge workers? How many in each organisation?
  • How much of their work is about the gathering and consumption of information, how much about publishing their insights, and how much is about the creation, synthesis and processing of new ideas and conglomerations of existing ideas in new ways but for internal (as in, inside a person's head) use only?

So given these questions all revolve around knowledge workers (the users of our tools and services) it seemed best to understand first who/what a knowledge worker is/does. Using my new favourite search engine (interface) - A9 - I went looking for "knowlege worker". In what would be considered relatively ancient knowledge (but is a mere decade old), I came across Peter F. Drucker's 1994 Godkin Lecture on Knowledge Work and Knowledge Society - The Social Transformations of this Century. Drucker, who coined the term "knowledge worker", writes lucidly and knowledgeably about the shifts within the nature of work over the last 100 years, and the implications for the future. That it's ten years on since he gave this talk hasn't changed any of the fundamental remarks he made.

Drucker's position on a knowledge worker (an educated person) is one who has learned how to learn and goes on learning throughout their lifetime either in or out of formal education systems.

What does that mean for society? Sometimes it's best just to quote other people's words, rather than trying to reformulate them - this is one such time:

In the knowledge society, knowledge basically exists only in application.

...

The central work force in the knowledge society will, therefore, consist of highly specialized people. In fact, it is a mistake to speak of generalists. What we mean by that term, increasingly, will be people who have learned how to acquire additional specialties, and especially to acquire rapidly the specialized knowledge needed for them to move from one kind of work and job to another. But generalists in the sense in which we used to talk of them are becoming dilettantes rather than educated people.

...

That the knowledge in the knowledge society has to be highly specialized to be productive implies two new requirements: 1. knowledge workers work in teams; and 2. knowledge workers have to have access to an organization which, in most cases, means that knowledge workers have to be employees of an organization.

What does this mean for us?

First, knowledge workers are going to work in teams, and they are going to work inside or affiliated with organisations. Most of their organisation's work will be for other organisations. We should try and sell to teams, not to organisations, because organisations don't change rapidly, but teams change all the time and generate lots of information and knowledge very quickly, with a consequent need for the team's knowledge workers to stay in the loop with it all. 

Second, generalists (in Drucker's new characterisation of them) are the people who have the most to gain from tools which make their assimilation of information and knowledge more productive. Specialists (who are happy to stay in one speciality) have much less use for these tools, as they generally know most of what they need to know already, and are highly connected in professional networks or communities of practice to keep themselves informed about important changes.

Do the tools we provide matter very much then?

... in the knowledge society the employees, that is knowledge workers, again own the tools of production. …
Increasingly, the true investment in the knowledge society is not in machines and tools. It is in the knowledge of the knowledge worker. Without it, the machines, no matter how advanced and sophisticated, are unproductive.

...

In the knowledge society the most probable assumption and certainly the assumption on which all organizations have to conduct their affairs is that they need the knowledge worker far more than the knowledge worker needs them.

So in other words, no and yes. Any tool itself we may provide is meaningless, without a knowledge worker sitting behind it, absorbing information and creating their own. The knowledge worker might be equally happy with a different set of tools. An organisation however would dearly like to make sure their knowledge workers can be as productive as possible while working, even if they only benefit while they can keep their knowledge workers (happy). Thus we need to help teams make the case to their organisation that improving their productivity is valuable.

How can teams accomplish this?

One final conclusion: Because the knowledge society perforce has to be a society of organizations, its central and distinctive organ is management.

… the essence of management is not technique or procedure. The essence of management is to make knowledges productive. Management, in other words, is a social function. And, in its practice, management is truly a liberal art.

Which brings it all back to the people once again. From a management point of view, look after them, reward them, encourage them, make the most of the opportunities they present and try to help them help you. And from the team point of view, it's exactly the same. Even management are people. 

Tool swapping and impedance mismatch

Dave Pollard asks some good questions about why we haven't developed good work-arounds for blogging limitations. And why switching from different textual or electronic communication tools to meeting in person can be so awkward.

Many years ago in 1992, I was heading to the USA to present my first paper at a workshop. I was pretty excited, both about the paper and also because it was the first time I was visiting the USA. Since Australia is a long way away from pretty much anywhere, it seemed like a good idea to fit in some visits to people along the way. One of these was David Tarditi, then working at CMU, doing a bunch of great work with Standard ML, which I was interested in. In those days, overseas phone calls were expensive from Australia, and only academics had access to them from their office. So I organised all the logistics of meeting up with David via email.

When I arrived at the airport at Pittsburgh, and got off the plane, David and I somehow recognised each other, even though he wasn't holding a sign saying “Peter Bailey“ or anything. We must have gone through some interesting mental calculations - male, early 20s, searching for someone, academic minded. I don't remember exactly what we said at first, but it was indeed slightly awkward - David probably asked me how my flight was, I probably thanked him for coming to pick me up from the airport. I do recall commenting later on how amazing it was to be able to organise all this without ever using the phone to communicate with - email was not ubiquitous in 1992, hard as that is to remember. By the end of my visit, any awkwardness between us was long gone however, as we'd got to know each other.

My take on why there is such initial awkwardness is due to the impendance mismatch between tools. Impedance mismatch refers to a situation from electrical engineering where two tranmission lines or circuits are joined together and their different impedance or resistance can cause problems like the signals reflecting or being lost to some degree. Computer science co-opted the term to refer to situations where two conceptual models were brought together and the difficulties arising in trying to connect them - such as functional and object-oriented programming.

Different textual communciation tools, such as blogs, email, and IM, while sharing a common conversational medium (text) are quite different in their application. Blogs tend to lend themselves more to a reflective style (tho they can be used in short conversation pieces, especially with comments), and have an unknown audience. Emails are more restricted - directed to a few individuals with subtle rules about who the primary reader is and who should also be in on the exchange. There is an expectation that the email will be read, but not exactly when. IM is used when immediacy is required - to other people who are online at the time, and lends itself to a very quick and rapid exchange of comments. All of them thus sit on a gradient of immediacy, and typically also of quantity of text to be written/read.

For many years, I've typed faster than I could write. However, I can talk faster than I can type. What I can't do is talk as fluently as I can write. My conversation, like almost everyone's, is littered with ums and ers, with half finished sentences, side tracks, digressions. Frankly, it's a miracle that anyone can understand me at all!

As Stephen Pinker discusses in The Language Instinct, the ability of humans to follow such conversations is remarkable, but essential, and probably genetically manufactured. And we are able to use all sorts of cues - tone of voice, hand or body gestures, immediate environmental context, past history, shared cultural understanding - to help filter out all of these barriers to understanding what the person is trying to say. Of course, this is why transcribed speeches are always cleaned up, as in a written form they would be almost unintelligble.

To my mind, the dramatically different way we communicate in person to how we communicate in text - the impedance mismatch - is why things are initially a little awkward when first we meet. We have to overcome this impedance mismatch - mapping our understanding of the person's textual conversation style with their real world embodiment, with all the slips and hesitancies.

My work day is usually filled with a number of quite different activities. At Synop, we refer to the time lost in mentally swapping between these as swap time. Much like computer operating systems, having to swap programs in and out, to give us the impression that they are all running at the same time, even though they are not. Swap time is significant, and thus expensive.

As we use different tools for communication, swap time between them becomes substantial, and again there is an impedance mismatch between every two of these tools - a different user interface, a different purpose to which we are putting them. Even if a new tool comes along which helps us do this swapping, it still becomes a new tool we have to integrate into our mental models - understanding how it works, what the techniques are for making it work. Most of us I believe are fundamentally lazy - we like to spend the least amount of time learning something, and then only to enable us to get the job done just a little bit better. Once we've invested time learning how to use one tool effectively (for example, I use Outlook for my email), we don't wish to spend an equal amount of time learning something just a little better. We don't want to invest lots of time learning something unless we really believe it's going to make a substantial difference - say 100% improvement in our efficiency or allows us to do something we've never been able to do before.

The challenge for tool builders like Synop is therefore to create tools which really are 100% more efficient or add the capability to do things we've not been able to do before at all. One of the things we hope to do with Sauce Reader is add the ability to blog directly while reading posts.

Postscript: Having written this entry, I added a category, and our current web based blogging system managed to lose the entire 45 minutes of writing into thin air. I'm looking forward to having my blog reading and writing integrated, preferably with something that autosaves anything I write every 5 or 10 minutes. Then I won't have to spend another 30 minutes re-writing the entire thing from scratch, as I did this time.

Capturing knowledge in a distributed organisation

I've just got back from a few days on holiday. Synop (being a busy organisation) hasn't stopped working just because I'm not here thankfully. As a company with offices distributed across the country, we rely on daily standups (borrowed from agile software development methodologies) to keep us all in the loop with what's going on. However, when we go away for a week at a time, a whole lot of stuff can happen. Today while catching up with what's been going on in the blogosphere, I caught up first with what Synop people have been writing about in their blogs. Most exciting of all to me was Nathan's news that FAQTs has been fixed at last. This is exciting news indeed, as it's been on our list of important but not urgent things to address for some time.

The reason I draw attention to it is that it serves as an excellent story (Dave Pollard's Principle 9 for Knowledge Management) to illustrate why knowledge capture is so important in a distributed organisation. I'll define a distributed organisation to be one where you don't either see or speak to everyone in the organisation every day. It may be that I would have found out about the change to FAQTs either serendipitously or by Nathan telling me about it. But I couldn't rely on it. Whereas, once he wrote it down, it became available for me to discover on my own, at a time convenient to me.

Providing people with an ability to record significant events in their daily worklife is invaluable. If Synop was an organisation 20 times larger than it currently is, we might need a small team of information editors to gather together interesting news for company-wide dissemination. (This is the role Richard describes as an A-list conduit.) In the meantime, we can rely on subscribing to individual feeds.

Recording the interesting events of course is not sufficient. You also need ways to archive and search the information so that other people can discover it in the future. Lastly you need ways of converting the short term contextual information which is important from a long term historical perspective into a distilled and structured format.

Memex and associating a distributed information repository

In Vannevar Bush’s prescient paper As We May Think, the foundations of modern information retrieval were laid. The memex as described by Bush is an intriguing combination of tools to aid humans in discovering, cataloguing and/or commenting upon, and associating information. Bush discusses the capacity to make associative leaps between information as one to which humans are supremely adapted, whereas our ability to memorise and mechanically sift information ourselves is relatively very poor. As a consequence, we’ve developed some amazing technologies for recording information and retrieving it.

Most, if not all, of the pieces of the technology puzzle which are needed to build an effective memex exist today. However, they are yet to be combined into a single seamless environment.

I have: a web browser for general information access; highly effective Internet search engines such as those from Google and Yahoo to find information from the Web at large; Microsoft’s OneNote for writing notes and cataloguing them; this weblog to record publicly my ideas and thoughts; and our own Sauce Reader for subscribing to chosen high value information feeds. But nothing exists to tie them all together and to provide highly effective search over both Internet resources and my own personal information.

(Sue Dumais’s work at Microsoft Research on Stuff I’ve Seen is a great example of where Microsoft is heading with this kind of concept, as are their much reported intentions of integrating search thoroughly into the Longhorn release of the Windows operating system, sometime in the coming years.)

These tools are essential as part of our armoury to prevent us from drowning under the information deluge. By not yet having them integrated, we impose appreciable cognitive overloads on our brains as we switch between different software packages, leaving us less productive than we could be. I am constantly staggered by how much information is available to me, almost instantaneously. I am also convinced that if only I could have less disruption between different information manipulation tasks, the task of associating information in useful new ways would become easier.

Bush envisaged the memex as being capable of storing, recording and retrieving vast quantities of information, all within a desktop environment. While our capacity to carry out this vision due to the massive increase in storage capacities and processing power over the past decades has increased, we have simultaneously accelerated the quantity of information being produced and recorded. A UC Berkley report How Much Information estimates that written information has been growing at the rate of 36% a year in the three years since their previous study, and in 2002 was estimated at 1.6 petabytes (a petabyte is 1000 terabytes, and a terabyte is 1000 gigabytes). Even compressed, the data is approximately 0.3 petabytes. Right now, no one will be able to afford the money or space to store this volume of information locally on their desktop. (It may be the case that this balance will change over time given ongoing improvements in technology, but ongoing increases in the amount of information may continue to prevent it being accomplished.)

But then, why would you given the highly effective distributed nature of the Web? The key to building a memex to meet Bush's grand and almost 60 year old vision will rely not just on building an integrated suite of highly effective tools for accessing, indexing and recording information in this vast distributed information repository, but building tools that enable people to create or augment their own associative content architecture over it as well.

Why distribution requires standardisation

I have recently been reflecting on distributed computing again. Or more precisely, what it is that should be distributed, and how we can build effective services with it.

Years ago, I worked as part of team building a distributed Smalltalk implementation for OTI (part of IBM). The lesson from this work was that ad hoc distributed computation is hard -- very hard. The current GRID computing initiative is addressing this difficult problem in a generic fashion. I wish them luck. The difficulty lies I believe in the ad hoc nature of the systems which must be connected when carrying out generic computation.

Ad hoc distributed data is much easier to handle, so long as it’s text (often in some commonly readable format such as HTML). Google and other general purpose search engines are a great example of a service over distributed data. Perhaps the single biggest factor which makes it possible for them to work so comprehensively is that there is no additional complexity required from the individual computer systems which manage the data -- as they already have web server software serving up the pages.

As soon as we wish to build more complex services over distributed data, or the data itself gets more complex to generate, we need additional computing resources involved. Instant messaging is a great example -- the distributed data is people who wish to have ad hoc textual conversations in real time. To this end, people who wish to participate must install and operate a new bit of software – an instant messaging client such as Trillian or Yahoo Messenger.

Another example I was told about today is the Open Archives Initiative, supported by software from OCLC such as OAICat. Finding ways to efficiently harvest publication information from a wide variety of collection sources (especially large national libraries) means that you can’t just take the obvious approach of sucking down all the catalogue metadata, or you’ll be there for several days whenever three new publications are added. The OAICat software provides information about the catalogue on an incremental basis. For example, tell me about all the new publications in the last week.

In Sytadel, we call this filtering, and it is used everywhere when deploying web sites, as it allows us to provide links to restricted sets of information. For example, creating a link to all the news releases published in a particular year.

Both OAICat and Sytadel are examples of the more complex computing resources needed to make better use of rich data. You definitely do not wish to trawl through all the news releases ever published by an organisation. This approach is all about providing custom data query interfaces.

Building a central service which draws together distributed collections of rich data relies on knowing what the computational systems do and how to interact with their data query interfaces. Thus it will rely on standardised software running on each of the distributed nodes, or on the distributed nodes responding to standardised query interfaces. Even in the case of general purpose text search engines, the distributed nodes (web sites) all respond to a standard query interface (HTTP). And without standardisation, you can't build interesting services over the top.