a while ago i helped a colleague with moving away from mysql-based text search based on MATCH .. AGAINST to sphinxsearch to improve the performance. without much surprise this helped a lot to improve the response time. recently i was asked to take a look at the setup again and see if we can do anything to sort results based on relevancy. in the current setup there was no any form of ordering used and – for that particular data – sorting by weight() – did not yield good results either.
to simplify things i’ll skip some details and assume that each row of indexed data contains:
- numerical id [ primary key ]
- text field containing a company name,
- text field containing more details about company,
results with match on the name are ‘better’ than results with match on the details field; ‘fuller’ matches are better than partial matches. that’s what i came up with:
for the indexing we’ll need:
- hitless_words= – so not only the presence but also position of each of found keywords is stored in the index
- index_field_lengths=1 so we can calculate how ‘full’ the match is – how many words from the field were matched
- separate fields for the name and details – so it’s possible to distinguish between matches on both
the initial query – without custom ranker looked as follows:
SELECT id, WEIGHT() AS _weight,RANKFACTORS() FROM index0 WHERE MATCH ('ericss*') ORDER BY _weight DESC
would give me plenty of results having Ericsson in the name or description but none of the first 25 would be expected Ericsson AB
i’ve started experimenting with custom rankers and ended up with the following addition to the original query:
OPTION ranker=export('(300/sum(min_best_span_pos)+bm25/100)+(field_mask&1==1)*300/name_len+sum(lcs)*100')
the higher value returned from the export expression – the higher on the results list given row will show
- 300/sum(min_best_span_pos) – the lower value of the best match position the better – so i’m taking the inverse of it so i get higher value for match on position 1 [ 300/1 ] than on position 5 [ 300/5 ]
- bm25 – it’s a built in ranker – based on the frequency of matched keywords; it was not very useful for this dataset so i downplay it a lot
- (field_mask&1==1)*300/name_len – my field #1 corresponds to the name – i’m prioritizing the matches on it yet try to get shortest of available match at the top of results – hence the /name_len part. “Ericsson ab” will be better match than “Lennart Ericsson Holding AB” for the query “ericss*”
- sum(lcs) – will help to prioritize results with match containing few of the search keywords in a continuous string over other matches that contain the same keywords but not directly following one another.
all of the variables can be found here.