Technical - The riddles of the Sphinx index
Saturday, December 6th, 2008Since July 2008, WordOwl has relied on a technology known as Sphinx in order to handle the searching of all the comics. It’s no small job, of course, but here we’re going to look at the riddle of the Sphinx and how it applies to WordOwl.
First up, I’m going to delve a little into the theory of searching text, and how Sphinx does it, before I get down and dirty with the details for WordOwl.
How do you search text?
It sounds so easy, doesn’t it, but actually it isn’t. Consider a test case, looking for the word “the” in the phrase “The quick brown fox jumped over the lazy dog.”
Now, you and I can see the phrase is present twice, but a computer doesn’t know that. All it knows is that it has two collections of characters, one three characters long, one 45 characters long. Other than that, it knows nothing.
Approach 1: compare letter by letter
Well, we have two strings of letters. To simplify matters we’ll assume lower case instead of mixed case.
Now, the first one begins with a ‘t’. So we start looking through the second string letter by letter until we find a ‘t’. Bing! We found one at the start. Is it followed by the next letter, an ‘h’? Yeah. Is it followed by an ‘e’? Yes! We found a match.
But are there any more? We repeat the process, and find a ‘t’, ‘h’ and ‘e’ later. So, two matches.
But we’re scanning through the string letter by letter. It’s slow and unwieldy, not to mention problematic. So far we haven’t mentioned what would happen if the word wasn’t ‘the’ but ‘these’. This approach would fail since it would match.
We can introduce the concept of a word boundary, a character that isn’t a letter, that it additionally has to hit - in this case it’s a space, but you’re checking each letter one at a time. For those programming types, try it. See how fast it is.
Approach 2: word-based matching a.k.a. text indexing
There is a better way. How about instead of going through and matching every letter, we match a word at a time. In order to do this we have to know what the different words are that we’re looking for.
So we’re going to look through the words we will be searching through and trying to work out what are words. We can give it a set of rules and have the computer do the work for us.
A word is:
a sequence of one or more consecutive letters, separated by one or more characters that are not letters.
Let’s look back at our test phrase: “The quick brown fox jumped over the lazy dog.”
Starting at the start: t is a letter, thus it’s part of a word. h is a letter and follows t, it’s part of the same word. e is a letter and follows t and h, it’s part of the same word. Then there is a space, we’re no longer in a word. Then we have a q, so we start a new word… and so on.
We can conclude then, that our phrase is the words “the”, “quick”, “brown”, “fox”, “jumped”, “over”, “the”, “lazy”, and “dog”.
Now comes the clever part. We give them numbers.
“the” is 1, “quick” is 2… “over” is 6… hang on, we’ve seen “the” before as it’s number 1… “lazy” is 7…
So now we have a word list of unique words and also know that ‘the’ occurs twice to the once of every other word. All we now do is store a list of what words are in our source phrase in what order.
Our phrase is words: 1 2 3 4 5 6 1 7 8
Now, when we want to find, say, ‘quick’, we can look through our big list of words and say that it is word number 2, so we need to find documents where word number 2 is present.
You could even have it as: word 1 is in document 1, positions 1 and 7. word 2 is in document 2, position 2. Then you know that if you have a word, you can very quickly find out which documents it is in.
This process is pretty much what Sphinx does. You give it a set of documents, it breaks them down according to a series of rules, and it makes a list of what words there are, where they are and so on.
Admittedly it doesn’t number them 1, 2, 3, but a special system. It turns the words it finds into special numbers, using a method known as CRC32, which turns it into a number somewhere between 0 and 4,296,000,000. Since it is then purely a number, there are many efficient computational methods for searching through it, such a binary search (start in the middle, if the number you want is lower, eliminate the top half. Look at the middle of what you have left, if the number is lower, eliminate the top half, repeat ad nauseum until finished)
To Sphinx, all of the transcriptions are just lists of numbers it can search through, and since it is numerical comparisons, it can be done by the processor extremely quickly (a few machine instructions each)
What does this all mean for WordOwl?
Right, glad you asked. Of course, all of the text gets mashed up into long sequences of numbers, but that doesn’t help me find the strip, not directly. It’s only part of the story, in fact.
Sphinx, as well as storing lists of words-mashed-into-numbers, stores other numbers alongside the wordlists. These numbers can represent whatever you want in your application.
Consider it like a single tab on a spreadsheet - the left most column is the row number, then several columns for storing useful numbers (called attributes), and lastly a column that holds my text (called a field)
I have plenty of attributes:
- cid - comic id number. Bruno is comic 1, C&H is comic 2 and so on.
- sid - strip id number.
- gid - grouped id number, made up of both the comic and strip id’s and is unique at strip level. It’s worked out from 65536 times the strip id plus the comic id, which means I could in theory work back to find the details from it, but I don’t - they’re already stored anyway. I use this to establish distinction, since as we’ll see each strip can and will appear multiple times in the Sphinx index.
- sdate - the strip’s date calculated using MySQL’s TO_DAYS() function, ie. the number of days since year 0. It gives a nice round number that I can use for ordering strips and other things.
- type - the type of text in this row, whether it is speech (=1), thought (=2) and so on
- speaker - if the type of text is related to a character, such as speech or thoughts, this will contain their character number, otherwise 0.
- unsafe - whether the strip is marked NSFW (=1) or not (=0)
- guest - whether the strip is a guest strip (=1) or normal (=0)
- animated - whether the strip is animated (=1) or not (=0)
- tag - a special column. tag contains the different tags that can be applied to a strip, represented numerically based on the drop-down order as seen on the advanced search page. (The first entry other than ‘all strips’ is 1 and so on.) This is a special column since it can contain multiple values at once, e.g. a strip might belong to tags 5 and 6 - the column effectively holds both 5 and 6 at once.
It might already be clear what is going on at this point, but as you can see, this contains all the different things - other than text itself - that can be searched on, and the comic and strip ids are present so that when the search is done, it gives back that information (plus all the others, but I don’t really worry about that)
What happens is that there is a script which runs through the database and pulls out all the text for a given strip, and prepares it for Sphinx so that I have one row for each person/text-type combination per strip.
For example, take the first strip from InkTank. I have two different characters speaking, Sophia and InkTank. But instead of having a separate row in the index for each line of dialogue, I can combine them into one row for all of Sophia’s dialogue, and one for all of InkTank’s dialogue.
The first row - Sophia’s - looks something like this:
- cid - 12
- gid - 65548
- sid - 1
- sdate - 733498
- speaker - 1
- type - 1
- unsafe - 0
- guest - 0
- animated - 0
- tag - 0
- text: hey, are you drawing again ~ what are you working on ~ aww ~ just look at you ~ youre skinny in the future ~
What we’re saying is that this is: comic id 12 (InkTank), strip 1, unique id (65536*1+12), date of 733498 (1 Apr 2008), speaker 1 (Sophia, she’s the first person to speak in the first strip), text type 1 (spoken), not unsafe, not guest strip, not animated, no tags and then finally the text. (Tildes replace most non word characters)
There is a similar row for InkTank himself immediately afterwards, of course, and this applies to all of the dialogue in all the comics.
There would be two rows if a character both spoke and thought in the same strip - one for their thought and one for their speech.
Additionally there is another text type that deserves explanation: present. In order to be able to search and track characters who have not said anything but are otherwise present in a strip, there is a special marker for them - the rest of the attributes are as expected, but the text simply consists of a tilde, a placeholder. (A tilde is not searchable by the user; it is simply an internal placeholder and is treated as another word in the index.)
Having actually stored all the data like that - at the time of writing, 7,533 strips translates into 26,691 rows - we must now actually search it.
This presents two problems: filtering out what we don’t want (on the advanced page), and making sure we only get unique strips, or just the best individual strip (for the front page)
In both cases, Sphinx comes to the rescue - the first problem is neatly taken care of with Filters, and the second by GroupBy.
Simply put, Sphinx has the ability to restrict data and matches based on set criteria. I can specify (comic id = 12) as a filter rule, to limit to InkTank only, for example. Or I say (comic id = 12, speaker = 2) to limit to strips with InkTank present/speaking within the InkTank comic. Since these are separate to the text searches, they can be applied to return all values - it is possible to hit (comic id = 12, speaker = 2) and no search text, and see all the strips where InkTank is present or speaking, in one hit.
The second point, regarding uniqueness, brings us back to the point of the groupid id. In the advanced search, I want to find the best match amongst the strips, with the best hits in a single strip first. All I do is tell Sphinx to group results based on the grouped id, so that it can combine the multiple entries into a single block (subject to filters) for the purposes of ranking and scoring.
Similarly for the front page search, I need to group the strips by comic, to get the best strip in each comic, but at the same time I need it to group by unique comic strips - again, not a problem. I tell it to group on the comic id, and treat the grouped id as a distinct value; and instantly I can see the best strip in a group, and how many distinct strips that incorporates.
It’s all done by Sphinx - I’d be lying if I claimed any credit for that.
What is meant by the ‘best strip’?
This is probably the most complex question with regards to Sphinx yet. How does it establish the best scores, and for that matter, how does it match it? Does it look for all words? Some words?
Whow, let’s slow down a minute. As I’ve said, Sphinx is doing the work regarding searching. It’s also doing all of the work regarding ranking, but I still need to explain what it is actually doing.
First up, you can give it a combination of words, and it will attempt to match the best strip on having all the words you give it, followed by all but one, etc. Where you have multiple combinations of “all but one”, the order is generally based off the rarest words (i.e. fewest appearances) being ranked higher since they are rarer. The total number of documents is taken into account too - it’s a general form of the BM25 ranking method.
It also takes into account the longest common subsequence - i.e. if you give it four words, an instance of the four words consecutively in the right order will be considered higher than the same strip having the four words separated across it.
It is possible that some matches sometimes buck the trend; I recently found a phrase with four words where all four were only in one strip, but since I had the order wrong, a strip where three of them were in the right order got ranked very slightly (1 point) higher than the strip with all four words in separately, but this is not at all common.
Again, this is all handled internally within Sphinx, and behaves consistently when doing this. In the event of a tie-breaker, the order is to take earliest strip first, then lowest id first (e.g. two strips inside a comic that both have the word ‘the’ in will find themselves ranked earliest first). The logic for providing the id as well is that some comics have strips with duplicate dates in - e.g. the first 40 or so xkcd strips are all listed as having the same date. By giving it an order criteria that cannot ever result in a tie-breaker, it will always return the order consistently.
Anyway, I think that about wraps it up for Sphinx for now.




