WordOwlAdv. SearchAboutBlogCopyrightFAQPrivacyRandomStatistics
WordOwl: Kapow!

Archive for June, 2008

(no title)

Monday, June 30th, 2008

The last two days has seen a massive upsurge in comics; two days ago I had done 525 Ashfield, 401 C+H, now I’ve hit over another 200 Ashfield and Liz did a few more C+H for me this morning, the result is that we’ve actually broken through 2,000 strips indexed this morning.

The end for Ashfield is in sight, 741 indexed, leaving 316 left to do.

The to-do list right now, in order of implementation:

* Finish Ashfield.

* Implement the strip titles, captions, mouseovers, keywords and descriptions data into the Sphinx index.

* Go back and ensure that every strip in Ashfield has something it can be searched against; there are a few strips at the moment that have no dialogue; these will need either a description or keywords to match against.

* Tidy up the advanced search form now I’ve added the guest strip and animated strip options.

* Add the keywords and descriptions options to the advanced search options.

* Add the lookups for animated/guest strips (and by extension, keywords and descriptions) to the advanced search XHR requests.

* Add graphs for animated and guest strips %ages.

* Compile and test Sphinx on the live server.

* Do the update!

* Start work on some of the user account ideas. Need to talk to a few other people about this, though, since I’m still really not 100% this is where I’m actually going with this. Maybe when the site gets bigger it might be more worthwhile… (Ideas: investigate how SMF handles sessions, this seems a nice reliable way of doing it. Also investigate the APIs for ReCAPTCHA as this seems to be pretty reliable)

Current Mood:busy emoticon busy

(no title)

Sunday, June 29th, 2008

Random comic jump is implemented, the user stuff is still in my brain.

I have also today written a mass one-off optimisation to the indexer scripts which re-ranks everything in the database in a more balanced way. I won’t get to do this one differently later but it’s something I should have done before. And as a bonus it halves the database by removing needless duplications.

Possible whizzy bits

Friday, June 27th, 2008

The problem I had before seems to have an end in sight, I just need to implement it now… I still have to figure out a couple of implementation quirks before I can nail it.

In other news, I’ve been thinking about new features for the sight.

User registration, which allows:
* Not only Share this Comic (available anyway), also email this comic
* Flag up errors in comic
* Preferences for the comics (e.g. user selected “hide comics x and y”)
* Comic browser (frameset allowing browsing buttons, which allows “not this comic” type stuff)
* Suggest a comic

Also:
* Jump to a random comic (random strip out of all indexed)

Since the earliest days of design I did consider a forum for comic-related matters but I don’t think that’s necessary or ideal, so I’d be leaving it at the above.

Current Mood:curious emoticon curious

More comics

Friday, June 27th, 2008

Last night and this morning have proved productive; I added 93 C+H and 100 Ashfield last night to the archive, and a further 120 to Ashfield and 20-odd to Bruno this morning, but even with that I feel I’ve been neglecting Bruno a bit.

In all fairness, C+H and Ashfield are considerably quicker to add to the archive - Ashfield is usually a single caption or a couple of lines of dialogue, C+H is usually 4-6 panels with maybe one line per panel, while Bruno has exercised the limits of my engine before. No matter, though, as Sphinx has pretty much nuked any problems I had with size.

At the actual moment of posting, I’ve added a total of 1,786 strips in my database. I’ll definitely not update before hitting 2,000 strips, but part of me thinks I should wait until Ashfield’s finished (another 600 or so to do) and then add the entire load up, plus Sphinx, in a single massive update. Doable, but I’m not sure yet. I’ll see how it goes. (I haven’t hit any of the Multiple Mondays yet)

Further testing against some of the new Bruno strips has revealed some interesting facts about the way Sphinx was optimising its searches, and it has told me that I can’t rely on the default behaviour; it’s just not suited to what I want.

Thus: the plan of action:

* Rewrite chars table to include a numeric ID, this should be in order of appearance but as long as it’s reasonably consistent it shouldn’t matter too much.

* Include this chars numeric field instead of the current speaker field; given the size of Bruno’s entries within the chars table, 9 bits would be enough but I may as well give it the full 16 to play with; I don’t believe significant optimising can be done against that otherwise.

* Refactor the advanced search to filter using the numeric id instead of the current appended term (use SetFilter against the field), this will be faster and mean I can separately optimise term list input without worrying about any other meta data now.

* Work out how the required/explicitly not required/optional term patterns must fall into “SphinxQL” then implement it.

This will all happen before the next site update, so yeah. Actually the list looks huge; it’s not really.

Update

Thursday, June 26th, 2008

Well, I’m actually happy with the output, so I don’t think much more testing is required.

All I’m going to do now is get the comics up to 2,000 then I’ll sweep up all the changes at once.

Current Mood:busy emoticon busy

Sphinx

Thursday, June 26th, 2008

Hah. And Hah, again. Hah, I say.

Having now rewritten the search results processor and restructured the index very slightly, I’ve solved the problems I was having.

Just want to do a little more testing and then I’ll begin work on deployment. (This will be so much fun…)

Current Mood:accomplished emoticon accomplished

Argh

Wednesday, June 25th, 2008

Things have actually proved more complex than first thought.

The internal data structure where result sets were assembled and parsed after leaving Lucene have now needed to be completely rewritten to utilise Sphinx correctly and effectively.

I had hoped to deploy Sphinx-based searching today but it just isn’t going to happen. Hopefully though it shouldn’t take too long but it’s not going to be today. Tomorrow is possible depending on how the rewrite and refactor go.

Current Mood:curious emoticon curious

Today’s update

Wednesday, June 25th, 2008

Well, in between everything else I’ve done some stuff today.

The Sphinx core is settling nicely; having stopped trying to shoehorn the existing data model into it, and relaxed some of the internal data structures very slightly, I’ve been able to finish the majority of the advanced search tools; I just need to be completely sure it’s running correctly before I deploy.

I’ve also been doing some more work with the social sharing page around linking to various services. I’m getting there with getting the services up; plenty more still to do though…

Current Mood:contemplative emoticon contemplative

Sphinx update

Tuesday, June 24th, 2008

It’s been interesting, using Sphinx.

After a slight swearing escapade trying to make the search daemon run this morning, I’ve been adapting the search code to query Sphinx rather than Lucene.

So far I’ve got the front-page search running, now to rewrite the advanced search. That’s a much more complex task due to all of the permutations.

I do need to do some work around and/or permutations, re-allowing + and - prefixing, but now I’m also considering how I might add quoted and fuzzy searching options.

Sphinx update

Monday, June 23rd, 2008

I’ve been playing with Sphinx, and once I got my head round the setup (only on my development machine!), I realised how awesome it actually is.

OK, it will require a fair amount of change to implement, but to be honest I’d rather that than not right now. I gave it a complete dump of comic data to index, which took less than half a second, compared to the approximately 15 minutes in current process.

It will need a jiggling about in terms of reporting fields, but nothing that can’t be dealt with relatively easily.

The biggest issue is still going to switchover at update time. I’m still not entirely comfortable about updating the server using regular indexing updates; it’s still going to hammer the server IO, thus I still need to be able to copy over the indexed indexes and simply switcheroo. Hmm.

Current Mood:contemplative emoticon contemplative