WordOwlAdv. SearchAboutBlogCopyrightFAQPrivacyRandomStatistics
WordOwl: Kapow!

Archive for the ‘Technical’ Category

Technical - The riddles of the Sphinx index

Saturday, December 6th, 2008

Since July 2008, WordOwl has relied on a technology known as Sphinx in order to handle the searching of all the comics. It’s no small job, of course, but here we’re going to look at the riddle of the Sphinx and how it applies to WordOwl.

First up, I’m going to delve a little into the theory of searching text, and how Sphinx does it, before I get down and dirty with the details for WordOwl.

How do you search text?

It sounds so easy, doesn’t it, but actually it isn’t. Consider a test case, looking for the word “the” in the phrase “The quick brown fox jumped over the lazy dog.”

Now, you and I can see the phrase is present twice, but a computer doesn’t know that. All it knows is that it has two collections of characters, one three characters long, one 45 characters long. Other than that, it knows nothing.

Approach 1: compare letter by letter

Well, we have two strings of letters. To simplify matters we’ll assume lower case instead of mixed case.

Now, the first one begins with a ‘t’. So we start looking through the second string letter by letter until we find a ‘t’. Bing! We found one at the start. Is it followed by the next letter, an ‘h’? Yeah. Is it followed by an ‘e’? Yes! We found a match.

But are there any more? We repeat the process, and find a ‘t’, ‘h’ and ‘e’ later. So, two matches.

But we’re scanning through the string letter by letter. It’s slow and unwieldy, not to mention problematic. So far we haven’t mentioned what would happen if the word wasn’t ‘the’ but ‘these’. This approach would fail since it would match.

We can introduce the concept of a word boundary, a character that isn’t a letter, that it additionally has to hit - in this case it’s a space, but you’re checking each letter one at a time. For those programming types, try it. See how fast it is.

Approach 2: word-based matching a.k.a. text indexing

There is a better way. How about instead of going through and matching every letter, we match a word at a time. In order to do this we have to know what the different words are that we’re looking for.

So we’re going to look through the words we will be searching through and trying to work out what are words. We can give it a set of rules and have the computer do the work for us.

A word is:

a sequence of one or more consecutive letters, separated by one or more characters that are not letters.

Let’s look back at our test phrase: “The quick brown fox jumped over the lazy dog.”

Starting at the start: t is a letter, thus it’s part of a word. h is a letter and follows t, it’s part of the same word. e is a letter and follows t and h, it’s part of the same word. Then there is a space, we’re no longer in a word. Then we have a q, so we start a new word… and so on.

We can conclude then, that our phrase is the words “the”, “quick”, “brown”, “fox”, “jumped”, “over”, “the”, “lazy”, and “dog”.

Now comes the clever part. We give them numbers.

“the” is 1, “quick” is 2… “over” is 6… hang on, we’ve seen “the” before as it’s number 1… “lazy” is 7…

So now we have a word list of unique words and also know that ‘the’ occurs twice to the once of every other word. All we now do is store a list of what words are in our source phrase in what order.

Our phrase is words: 1 2 3 4 5 6 1 7 8

Now, when we want to find, say, ‘quick’, we can look through our big list of words and say that it is word number 2, so we need to find documents where word number 2 is present.

You could even have it as: word 1 is in document 1, positions 1 and 7. word 2 is in document 2, position 2. Then you know that if you have a word, you can very quickly find out which documents it is in.

This process is pretty much what Sphinx does. You give it a set of documents, it breaks them down according to a series of rules, and it makes a list of what words there are, where they are and so on.

Admittedly it doesn’t number them 1, 2, 3, but a special system. It turns the words it finds into special numbers, using a method known as CRC32, which turns it into a number somewhere between 0 and 4,296,000,000. Since it is then purely a number, there are many efficient computational methods for searching through it, such a binary search (start in the middle, if the number you want is lower, eliminate the top half. Look at the middle of what you have left, if the number is lower, eliminate the top half, repeat ad nauseum until finished)

To Sphinx, all of the transcriptions are just lists of numbers it can search through, and since it is numerical comparisons, it can be done by the processor extremely quickly (a few machine instructions each)

What does this all mean for WordOwl?

Right, glad you asked. Of course, all of the text gets mashed up into long sequences of numbers, but that doesn’t help me find the strip, not directly. It’s only part of the story, in fact.

Sphinx, as well as storing lists of words-mashed-into-numbers, stores other numbers alongside the wordlists. These numbers can represent whatever you want in your application.

Consider it like a single tab on a spreadsheet - the left most column is the row number, then several columns for storing useful numbers (called attributes), and lastly a column that holds my text (called a field)

I have plenty of attributes:

  • cid - comic id number. Bruno is comic 1, C&H is comic 2 and so on.
  • sid - strip id number.
  • gid - grouped id number, made up of both the comic and strip id’s and is unique at strip level. It’s worked out from 65536 times the strip id plus the comic id, which means I could in theory work back to find the details from it, but I don’t - they’re already stored anyway. I use this to establish distinction, since as we’ll see each strip can and will appear multiple times in the Sphinx index.
  • sdate - the strip’s date calculated using MySQL’s TO_DAYS() function, ie. the number of days since year 0. It gives a nice round number that I can use for ordering strips and other things.
  • type - the type of text in this row, whether it is speech (=1), thought (=2) and so on
  • speaker - if the type of text is related to a character, such as speech or thoughts, this will contain their character number, otherwise 0.
  • unsafe - whether the strip is marked NSFW (=1) or not (=0)
  • guest - whether the strip is a guest strip (=1) or normal (=0)
  • animated - whether the strip is animated (=1) or not (=0)
  • tag - a special column. tag contains the different tags that can be applied to a strip, represented numerically based on the drop-down order as seen on the advanced search page. (The first entry other than ‘all strips’ is 1 and so on.) This is a special column since it can contain multiple values at once, e.g. a strip might belong to tags 5 and 6 - the column effectively holds both 5 and 6 at once.

It might already be clear what is going on at this point, but as you can see, this contains all the different things - other than text itself - that can be searched on, and the comic and strip ids are present so that when the search is done, it gives back that information (plus all the others, but I don’t really worry about that)

What happens is that there is a script which runs through the database and pulls out all the text for a given strip, and prepares it for Sphinx so that I have one row for each person/text-type combination per strip.

For example, take the first strip from InkTank. I have two different characters speaking, Sophia and InkTank. But instead of having a separate row in the index for each line of dialogue, I can combine them into one row for all of Sophia’s dialogue, and one for all of InkTank’s dialogue.

The first row - Sophia’s - looks something like this:

  • cid - 12
  • gid - 65548
  • sid - 1
  • sdate - 733498
  • speaker - 1
  • type - 1
  • unsafe - 0
  • guest - 0
  • animated - 0
  • tag - 0
  • text: hey, are you drawing again ~ what are you working on ~ aww ~ just look at you ~ youre skinny in the future ~

What we’re saying is that this is: comic id 12 (InkTank), strip 1, unique id (65536*1+12), date of 733498 (1 Apr 2008), speaker 1 (Sophia, she’s the first person to speak in the first strip), text type 1 (spoken), not unsafe, not guest strip, not animated, no tags and then finally the text. (Tildes replace most non word characters)

There is a similar row for InkTank himself immediately afterwards, of course, and this applies to all of the dialogue in all the comics.

There would be two rows if a character both spoke and thought in the same strip - one for their thought and one for their speech.

Additionally there is another text type that deserves explanation: present. In order to be able to search and track characters who have not said anything but are otherwise present in a strip, there is a special marker for them - the rest of the attributes are as expected, but the text simply consists of a tilde, a placeholder. (A tilde is not searchable by the user; it is simply an internal placeholder and is treated as another word in the index.)

Having actually stored all the data like that - at the time of writing, 7,533 strips translates into 26,691 rows - we must now actually search it.

This presents two problems: filtering out what we don’t want (on the advanced page), and making sure we only get unique strips, or just the best individual strip (for the front page)

In both cases, Sphinx comes to the rescue - the first problem is neatly taken care of with Filters, and the second by GroupBy.

Simply put, Sphinx has the ability to restrict data and matches based on set criteria. I can specify (comic id = 12) as a filter rule, to limit to InkTank only, for example. Or I say (comic id = 12, speaker = 2) to limit to strips with InkTank present/speaking within the InkTank comic. Since these are separate to the text searches, they can be applied to return all values - it is possible to hit (comic id = 12, speaker = 2) and no search text, and see all the strips where InkTank is present or speaking, in one hit.

The second point, regarding uniqueness, brings us back to the point of the groupid id. In the advanced search, I want to find the best match amongst the strips, with the best hits in a single strip first. All I do is tell Sphinx to group results based on the grouped id, so that it can combine the multiple entries into a single block (subject to filters) for the purposes of ranking and scoring.

Similarly for the front page search, I need to group the strips by comic, to get the best strip in each comic, but at the same time I need it to group by unique comic strips - again, not a problem. I tell it to group on the comic id, and treat the grouped id as a distinct value; and instantly I can see the best strip in a group, and how many distinct strips that incorporates.

It’s all done by Sphinx - I’d be lying if I claimed any credit for that.

What is meant by the ‘best strip’?

This is probably the most complex question with regards to Sphinx yet. How does it establish the best scores, and for that matter, how does it match it? Does it look for all words? Some words?

Whow, let’s slow down a minute. As I’ve said, Sphinx is doing the work regarding searching. It’s also doing all of the work regarding ranking, but I still need to explain what it is actually doing.

First up, you can give it a combination of words, and it will attempt to match the best strip on having all the words you give it, followed by all but one, etc. Where you have multiple combinations of “all but one”, the order is generally based off the rarest words (i.e. fewest appearances) being ranked higher since they are rarer. The total number of documents is taken into account too - it’s a general form of the BM25 ranking method.

It also takes into account the longest common subsequence - i.e. if you give it four words, an instance of the four words consecutively in the right order will be considered higher than the same strip having the four words separated across it.

It is possible that some matches sometimes buck the trend; I recently found a phrase with four words where all four were only in one strip, but since I had the order wrong, a strip where three of them were in the right order got ranked very slightly (1 point) higher than the strip with all four words in separately, but this is not at all common.

Again, this is all handled internally within Sphinx, and behaves consistently when doing this. In the event of a tie-breaker, the order is to take earliest strip first, then lowest id first (e.g. two strips inside a comic that both have the word ‘the’ in will find themselves ranked earliest first). The logic for providing the id as well is that some comics have strips with duplicate dates in - e.g. the first 40 or so xkcd strips are all listed as having the same date. By giving it an order criteria that cannot ever result in a tie-breaker, it will always return the order consistently.

Anyway, I think that about wraps it up for Sphinx for now.

Technical - The Comic XML storage format

Friday, December 5th, 2008

Some people have asked me about the format I use for the comic storage. Now, this will give you a background in how it works, however this does not include the true subtleties of the format, as the format is always under review for new features or other changes.

However, I’m going to go through the things that are pretty well set in concrete and can’t be changed without revisiting several MB of files.

For the technical people: it’s a UTF-8 format, straightforward XML. It’s usually also given a .comic extension to mark it out from other files and also means my text editor knows how to highlight it specifically.

At the top level, there is a block called comic, which contains everything else. It also specifically stores the comic id information, such as its master ID, like ‘blankit’ or ‘angsttech’.

Within the comic block, two, three or four blocks will exist, usually three. They are:

  • comicmeta - stores meta information about the comic itself, such as if it has an RSS feed, or a forum.
  • strips - stores the strip data; by far the biggest block
  • aliases - stores mappings for shortname to longname (e.g. the strip display says ‘Bruno’, but the advanced search drop down states ‘Bruno Bunkleyutz’. This is stored here.)
  • tagmap - stores the master list of tags, their descriptions and what order they should be displayed in

Let’s look at those in turn.

<comicmeta> contains the following data:

  • <comicname> - the comic’s name (for use in dropdowns and so on)
  • <baseurl> - the front page of the comic (for the Copyright page)
  • <copyright> - who the strip is copyrighted to (not necessarily who created the individual strips, e.g. C&H is individually credited at strip level, but is blanket-copyrighted to Explosm.net)
  • <feedurl> - the URL of the primary RSS feed if there is one, optional. No distinction is made for RSS vs. Atom since we haven’t seen any (except WordPress-based ones) that use Atom only.
  • <forumurl> - URL of any discussion forum if there is one, optional.
  • <earliest> and <latest> - not every comic uses these, but for comics where the archive is only partly transcribed, the dates are provided so that the timeline can display the full range of dates.

It’s all pretty straightforward stuff, really.

<strips> contains the following data:

  • <url> - strip’s URL link
  • <title>, <caption>, <mouseover> - the additional metadata for a strip if these are present
  • <date> - the date the strip was published, optional.
  • <authors> - contains the names of everyone who created a given strip, individually in <author> blocks.
  • <merchandise> - contains any merchandise for this strip, which is blocked in <merchitem> blocks, one per item. Each <merchitem> consists of an <itemdesc> for description/link text, and <itemurl> for the page to link to.
  • <tags> - contains one or more <tag> links to bring this strip into a given group. Used by C&H and xkcd for occasional linked strips, used by Bruno, Melonpool and others to hold storylines. A tag is usually a short text, without punctuation, e.g. “mrblackhat”.
  • <panels> - contains the meat of the strip, in one or more <panel> blocks.

Right, I’ve ended that list to be able to explain <panel> properly.

Each strip is divided up into panels, explicitly defined by <panel>. This might be empty (denoted in the code as <panel /> and no ending </panel>) if the panel has no text at all. A panel can also be borderless (with <panel border=”no”>)

Having established the different panel blocks inside the <panels> block, each <panel> will contain 0 or more <text> blocks.

<text> all follow the same basic format: <text type=”what it is” speaker=”who said it”>what was said</text>

For example, if Mr Black Hat says something, it would look like:

<text type=”spoken” speaker=”Mr Black Hat”>Attention…</text>

If the type of text is ’spoken’, ‘thought’, or ‘present’ (for strips where a character is drawn but not saying anything), there needs to be a speaker attribute to confirm who is speaking/thinking/present. For other types, ‘caption’, ‘inset’, ’sound’, ‘typed’, ‘written’, no speaker is required (indeed, the parser will throw an error if one is provided)

That’s really all there is to the main block of the file, just that it can get very wordy.

<aliases> contains the following data:

  • <alias> blocks, each consisting of a <shortname> and a <longname> block, simply for mapping from shortname to longname on the advanced search screen.

Additionally there is an attribute on shortindex, called colourindex, used to indicate the colour that this character should be. (There are currently 18 set colours used in dialogue in WordOwl, all numbered)

Simply put:

<aliases>
  <alias>
    <shortname colourindex="8">Bruno</shortname>
    <longname>Bruno Bunkleyutz</longname>
  </alias>
</aliases>

There is one entry for every alias, and currently if a colour is to be assigned, the entry has to be present in full, even if no alias is being created. (E.g. if the character is Mr Black Hat, the entry to force it to grey does actually have Mr Black Hat as both shortname and longname.) This is something to be rectified in due course.

<tagmap> contains the following:

  • <tagitem> blocks, one per tag, with <tagid> and <tagname> blocks inside.

A tag is simply a short name I can use to refer to a strip, and a fuller name to be displayed for users. For example, the “Shatner Saga” strips on Melonpool all have the tag “shatnersaga”.

The order is also important; the tags are ordered in the same order they should be displayed in, on the advanced search page. If the tags are arbitrary strips (e.g. as in xkcd or C&H), the list should be kept alphabetically, while if they represent storylines, they should be in chronological order. There is also a flag to state whether a tag is for storylines (story=”yes”) or not (story=”no”, defaults to no normally)

Well, that’s how a data file works in theory. Here’s an extract from Blank It’s comic file (it’s the smallest), and demonstrates how most things work.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE comic SYSTEM "comics.dtd">
<comic id="blankit">
  <comicmeta>
    <comicname>Blank It</comicname>
    <baseurl>http://blankitcomics.com/</baseurl>
    <copyright>Aric McKeown and Lemuel 'Lemmo' Pew</copyright>
    <feedurl>http://feeds.feedburner.com/BlankIt</feedurl>
    <forumurl>http://blankitcomics.com/forums/</forumurl>
  </comicmeta>
  <strips speakers="named">
    <strip id="strip1">
      <url>http://blankitcomics.com/2008/06/08/blankit-0001/</url>
      <title>Well, what would YOU do?</title>
      <date>2008-06-08</date>
      <mouseover>Well, what would YOU do?</mouseover>
      <authors>
        <author>Aric McKeown</author>
        <author>Lemuel Pew</author>
      </authors>
      <panels>
        <panel>
          <text type="present" speaker="Aric"></text>
        </panel>
        <panel>
          <text type="spoken" speaker="Aric">Hello...?</text>
        </panel>
        <panel>
          <text type="spoken" speaker="Aric">Hmm... all alone...</text>
        </panel>
        <panel>
          <text type="present" speaker="Aric"></text>
        </panel>
      </panels>
    </strip>
  </strips>
  <aliases>
    <alias>
      <shortname colorindex="3">Aric</shortname>
      <longname>Aric</longname>
    </alias>
  </aliases>
</comic>

Anyway, that about wraps it up for our data format. If you have any questions, please do ask. (Oh, and I use two spaces to indent here, but the live files use 1 tab each level.

Technical - Generation 4: November 2008 to present

Friday, December 5th, 2008

Many minor changes have occurred since the July move to generation 3 (the move to Sphinx), such as removing the original storyline code and upgrading the tags system to deal with it. Tags, by the way, were introduced somewhere around generation 3 but I don’t remember precisely when I did implement it. Certainly it was not before the move to Sphinx.

But what about the fourth generation? The structure works great with Sphinx. What could I possibly want to change?

Actually I wanted to change as little as possible; it worked and it worked pretty well (except for needing a few tweaks to the tag system along the way). But I wanted to offer an XML-RPC interface.

I want to offer the ability to integrate my search facilities into the wider world, so that instead of site-hopping, you can keep everything in a single place.

And so toward the end of October I figured out what I was going to do, and into November I went and did it.

The changes are superficially different but monumentally so underneath. What’s happened is that the entire base for searching, for information retrieval, for input gathering etc has all been merged into four classes, WordOwlSearch, WordOwlData, WordOwlOutput and WordOwlInput_HTTP. (This last one was named as I might yet provide other API types, so there may yet be a WordOwlInput_SOAP or something, but I don’t know yet)

Anyway, the point is that these four structures between them do all the heavy lifting; dosearch.php and doadvsearch.php do little more than form the skeleton between the calls to these methods now.

That’s really good for me since it means that writing new interfaces such as the XML-RPC interface become a ton easier. I no longer have multiple copies of the same code, either, which I did before.

The reason input, output, searching and data are all separate is because not every module will need everything. WordOwlData manages the backend and all the render files, such that if I later move it or do it differently, I update WordOwlData’s methods and boom - every interface inherits the changes.

WordOwlSearch does what it says on the tin; but for modules that work with strips and data without searching (e.g. Share This Strip, the “What I want to find” page), they won’t need the search object, but will need the data.

It now looks something like this:

The fourth generation structure for WordOwl

The fourth generation structure for WordOwl

There are no plans at present to restructure WordOwl into a fifth major generation of code, however if the opportunity presents itself to either increase its speed or improve its function, it will be investigated.

Technical - Generation 3: July 2008 - November 2008

Friday, December 5th, 2008

After the hasty scramble in June to strip out MySQL use (it was done in 2 days total), I took some time to reflect on where things were going.

As June wore on I did more transcriptions, but was getting further and further fed up with Lucene. Having to write code to trap errors is not uncommon, but we’re not talking code-thrown exceptions from Lucene. It was throwing errors due to index corruption, and I was getting fed up with it, plus the memory usage.

Around the 12th of June I responded to an email on the Zend Framework general mailing list - someone was having problems with ZSL, and I suggested ways on keeping the memory use down (based off my own experience pruning the word list, this seemed to be the easiest method of keeping memory use down)

I began to have more and more doubts about ZSL as memory use tended to creep up to 50MB. Given that the server is a modest VPS, it doesn’t have the RAM capacity to keep handling 50MB allocations from PHP, not with any regularity.

So, after I’d posted, somone else posted a message referring me to Sphinx. It wasn’t a database, simply an indexing and searching engine. Even more so than ZSL, this was what I wanted.

I did a bit more reading, did a few calculations in my head about assigning numeric ids to everything since Sphinx can’t handle text data other than that it will be searching, and began trying it out.

Three things struck me about Sphinx, the first two sold me on it:

  • Memory footprint was about 2MB in total (compared to 50MB) and this would grow only as my database grew, but in a predictable and measurable way.
  • It’s Fast. ZSL is fast considering it is purely PHP. Sphinx is FAST. (Even now, with a database 5x bigger, searches regularly take less than 0.03 seconds to occur.)
  • Index size is comparable, perhaps slightly smaller than ZSL’s.

Here is an author that has thrown off the regular shackles of design and built something truly stupendous. I was sold after only a short read of the manual. And reimplementation wasn’t actually that difficult either.

And so, I began the task of ripping apart the search code from the ground up and rebuilding it. It shrank by a surprising margin since I didn’t have to do all the fluff associated with Lucene. It also meant I didn’t have 33 separate file includes that I got with Lucene, I simply included one master API file.

Oddly, though, the core structure remained largely unaffected. The only real difference is that instead of incrementally building the Sphinx index, the contents of MySQL are dumped each day into a bulk XML file that Sphinx reads. (On a technical note, in October, this was split into two files simply so I didn’t have to upload strips that don’t change every day or so - things like Ashfield Online, Lethal Doses et al that don’t change are ‘archived’, as opposed to ‘current’, but the principle holds true)

I let Sphinx do all the heavy lifting now. The core of the program simply gets all of the information and fits it into the boxes Sphinx asks for, which are really no different to the boxes Lucene asked for, and then it flies.

Although it was a major shot in the arm and ultimately means that I don’t have to worry about WordOwl scaling until it gets well into the tens of thousands of strips (when it might start to outgrow the VPS), there really isn’t a lot to say about this generation. Not a lot happened really. The replacement took about 3 weeks and involved me rewiring everything from Lucene mentality to Sphinx mentality - same base ideals, but the devil’s in the details.

That all important structure chart:

The third generation of WordOwl

The third generation of WordOwl

The only real difference is that instead of the data going from the parser to the index, the render process picks up the database in a single gulp from MySQL and dumps it to the index.

And this structure works extremely well. So well, in fact, that I only changed to the fourth generation in November because I wanted to start doing even more magical things.

Technical - Generation 2: June 2008

Friday, December 5th, 2008

At this point, WordOwl was motoring along quite nicely. I had three comics underway and transcriptions were flowing kinda nicely. And then it all started to go horribly wrong.

I had two problems, the first could be mitigated without a complete rewrite but the second could not as much.

Problem 1: Memory use

For those not familiar with Lucene in general, it’s pretty memory intensive. It loads the word list into memory and works on it, before working on the data.

Now a typical document might not have too many unique words but WordOwl sure did. A unique word, by Lucene’s definition, is a pattern of letters (and numbers) that was not repeated elsewhere.

Even with the dataset I had then, I was looking at 9,800 different unique words. There are many reasons for this, but the most common is intentional spelling variations, such as Bruno saying “ansheshtor” instead of “ancestor” because he was drunk.

Or the ever popular “AAAAARGH” type scream, with various numbers of letters.

The solution to this, and the major part of mitigating memory usage, was to reduce the dictionary size. Now, since I had a list of all the words in the dictionary, I could see precisely what words were there and what could be collapsed into equivalent words.

I could also solve a very large problem with this: the issue of English versus American English spellings. I just wrote the code to normalise to the American spelling.

Essentially it’s a block of search/replace commands, half using basic strings, half using regular expressions. The first set catches comic author spelling mistakes, such as ‘professionsal’, or ’sandwhich’ and the aforementioned Englishisms such as ‘colour’.

This is run on both sides of the search so that it remains consistent, but I was already despairing of the memory use; I was regularly seeing 40MB usage in larger queries, but that was actually not my pressing concern. Problem 2 was.

Problem 2: massive server load

Ironically, the problem here was not Lucene bottlenecking the process; MySQL was. I perhaps haven’t been clear about the situation.

At this point, strips were stored entirely in MySQL; Lucene spat out the strip and comic IDs that needed displaying and the search programs looked those strips up in MySQL and displayed them. Sounds harmless enough, but they were just looking at the raw data in MySQL.

Let me explain the information needed to render a single strip:

  • the comic’s name (for the link above the strip)
  • the strip’s date
  • the strip’s URL
  • any title, caption, mouseover or description
  • details of the panels (in case there is no text, we still have to know what panels are necessary)
  • character colours (if the character has their own set colour)
  • the text in each of the panels
  • details of any story lines
  • plus a new feature I introduced: merchandise (i.e. a strip has linked merchandise in a store somewhere)

Now, looking back at our structure we already had (Comic -< Strip -< Panel -< TextEntry) it’s clear we need to query each of those 4 tables, plus looking up the characters and storylines tables and now also the merchandise table.

Don’t forget we need to do this for every strip we display. The decision to stick to 10 to a page now suddenly becomes clear as if I needed reminding.

I’m not going to dwell on this too much; I only implemented it reasonably well the first time, hindsight shows I could have done it a variety of ways, but ultimately it ended up being 1 big query covering the 4 big tables with a link to the characters table, plus separate queries to storylines and merchandise separately.

We’re suddenly doing 3 queries per strip - 30 queries a page! - and unsurprisingly performance was… um… lacklustre.

Now, before I hear cries of ‘WordPress does more than that’ it does. However, consider what it is querying. It’s querying a relatively small number of rows - a few hundred in most cases, with strong indexes. As a result it’s possible to cache it reasonably well. Also if you look at the queries actually being made, the permutations of queries are pretty small.

But in my case the 30 queries are virtually random-access. There is no guarantee of the order queries will be performed.

So I had to do something drastic. And I did.

I completely rewrote the display system. This time, instead of calling on MySQL to get the data to display strips, it has each strip dumped to a single file. That file is loaded, parsed and the strip data extracted. Here’s the catch: the strip is ‘pre-rendered’, the file contains all the HTML needed to display the strip.

Every time a strip is updated, this file is reproduced and a simple synchronise with WinSCP makes sure the most recent are always on the server.

Consider the difference: with MySQL the odds were in the favour that the data was on disk and would need to be loaded, then output back to PHP, for PHP to then build the display. This time, it’s definitely on disk, but since we already know that, we can skip some of the operation. We’re now I/O bound, as opposed to being probably I/O bound, but we’re doing less I/O operations than before. And less processing, though at the cost of space.

But it is faster (benchmarked at up to 10% faster on my machine), and actually consumes less bandwidth than sending the XML files over, making updates each day rather than the originally considered weekly rota, much more practical

And it scales better - it’s now essentially limited by how fast the HD can locate and load the file only. Since we’re not doing disk-seeks within large files, we’re almost always doing the read on a single cluster - it’s 1 HD I/O operation. Plus, the operating system can actually cache the inode (the live server is a Linux server with ext2 or 3 file system, can’t remember which off-hand though) so it is faster.

But wait, I should point something out. With Generation 1, the process map was identical between my PC and the live server, both do the same operations. Now there is a shift.

My PC holds the XML files. It also holds the master MySQL database, still. But now that’s used to output the rendered strips. Since that is the master repository it knows what strips were there before and what details were there, so I’m not rebuilding 7,000 renders every day and uploading them, I’m only re-doing the changed/new ones and uploading those. Very efficient too.

It’s time for the handy-dandy process flow.

The structure for the second generation of WordOwl

The structure for the second generation of WordOwl

I had stabilised its growth in that respect; server load dropped off massively - MySQL was consuming 90% of server resources!

And although it wasn’t intentional, the reshaping of the process did make it more efficient to do smaller updates rather than big updates. At this time I entirely stopped running processes on the server, instead copying the Lucene index rather than trying to build it online.

As a useful by-product this meant that the index would be more consistently ‘good’ - Lucene did have a habit of corrupting the index during updates, but all of the problems with Lucene would lead to Generation 3…

Technical - Generation 1: February 2008 - June 2008

Friday, December 5th, 2008

The origins of WordOwl are probably not interesting, except from a technical perspective.

I wanted to be able to search comics, based on people speaking, as well as within storylines. That was, actually, the only design brief I gave myself, just the ability to do searching on all dialogue, and additionally by speaker and/or storyline.

Many of the decisions I made in this time are more than still present in the current workings.

Original design criteria for implementation:

  • Must be able to continue searching whilst adding new strips
  • New strips must be able to be transcribed offline, and added online later.
  • Must be able to store different kinds of text within a strip and be able to filter on different kinds of text.
  • Must be able to store the speaker against speech and search against them.
  • Must be able to store all storyline details and search against them.

Decision 1: How to store the data overall

Right up front I realised that if I was doing an offline/online thing I either had to somehow synchronise a regular relational database, or store the strips in a flat file type construction, probably one file for each webcomic.

I realised even earlier than that, that data was necessarily in four levels, with extra metadata separately:

Comic -< Strip -< Panel -< TextEntry

Essentially one comic has many strips (and each strip belongs to a single comic), one strip has many panels, one panel has many text instances etc.

There are also other tables, too. I had a table listing the story lines, and a table for characters. But they sit outside of the above structure, and are pulled in when necessary.

Decision 2: Storage nuts ‘n’ bolts

I need to be able to move the data around between machines and keep it in a single file per webcomic. Since most of it is text with metadata, an XML variant is most helpful, so that’s what I did.

I also need to centrally store it, though, to not dig through the files each time to display it, plus be able to aggregate it. I’m familiar with MySQL and that’s what the server already has so that’s a fairly straightforward argument.

Decision 3: FInd: see find

So, having solved the data storage issue, I needed to work out how I would actually do the searching. Back in February, I started doing a little research. I already knew that MySQL wasn’t up to the job with its full-text searching (FTS), so I was looking elsewhere. I had - unfortunately - learnt the lesson the hard way before of the pitfalls of MySQL FTS.

Around the time, Wikipedia had just stopped using MySQL’s FTS system and jumped to using Lucene. Lucene is part of the Apache project and is a solution for providing searching in a Java environment. Now, I’m not in the slightest familiar with Java and didn’t really want the hassle of setting up just a Java runtime environment on both my home PC and the live server, let alone the system resource hit of doing that - the server is a limited VPS solution, so I can’t afford to devote too many resources away from things already running - MySQL is already in use there, so that’s not a problem. Additionally I didn’t want to have to learn Java if I wanted to do any customisation of the keyword analyser.

The Wikipedia page pointed me to a native PHP implementation of Lucene, as part of Zend Framework. Now I figured that if it has the Zend name on it, it must be of at least a reasonable standard. I did a fair bit of reading on Zend_Search_Lucene and figured it was a viable solution.

And so, after requesting permission on the first comic, I sat down and began to design the structures involved, the XML format, the Lucene index, the MySQL data structures and everything else.

I also quickly coded up a tool to take a basic tabular structure in OpenOffice, and turn that into the XML blocks as used by the comic data files, but more on that later. I should briefly mention that the base tool first created on 19th February for Bruno is still in use and with only really minor customisations.

By late March I had a working system in place, and about 400 Bruno the Bandit strips indexed. I even had the first vestiges of advanced search in place. It wasn’t ideal, but it worked.

Around then I also got permission to add Cyanide & Happiness and carried on transcriptions.

I’ve rambled enough on how we got there. Here’s the structure of Generation 1:

The first architecture for WordOwl

The first architecture for WordOwl

Comic XML files get fed into the parser, which reviews it against MySQL and updates the MySQL tables and the Lucene index. (At first to keep it fast, I actually had two indexes, one that didn’t filter on anything, just contained the strip text amalgamated together, just for the front page’s use, since that would be smaller and faster to search)

The search code called the Lucene index, performed the search, which returned the comic and strip ID information, which was then looked up in MySQL in order to display it. Unlike later generations, the developmental server (my PC) was a true duplicate of the live search.

This continued until June, with 1457 strips in 3 webcomics, when Generation 2 was required.

Already, though I was finding problems with this approach. The two-index approach was efficient on the front page, but I started to get issues with different result orders between the two indexes, so at the end of March I folded it into a single index. I did also have huge memory issues with the parser eating too much memory throughout April, but did resolve it into an issue with ZSL. It was clear even here that ZSL was not a long-term proposition (indeed, one person did actually describe it as a toy search engine) but I didn’t really worry too much about it throughout Generation 1. As June emerged, I had more serious problems to contend with…