{"id":2103,"date":"2014-03-17T22:37:45","date_gmt":"2014-03-17T21:37:45","guid":{"rendered":"http:\/\/kudzia.eu\/b\/?p=2103"},"modified":"2016-09-07T10:00:20","modified_gmt":"2016-09-07T09:00:20","slug":"sphinx-text-search-sorting-the-results-by-relevancy","status":"publish","type":"post","link":"https:\/\/kudzia.eu\/b\/2014\/03\/sphinx-text-search-sorting-the-results-by-relevancy\/","title":{"rendered":"sphinx text search &#8211; sorting the results by relevancy"},"content":{"rendered":"<p>a while ago i helped a colleague with moving away from mysql-based text search based on <a href=\"https:\/\/dev.mysql.com\/doc\/refman\/5.5\/en\/fulltext-search.html\">MATCH .. AGAINST<\/a> to <a href=\"http:\/\/sphinxsearch.com\/\">sphinxsearch<\/a> to improve the performance. without much surprise this helped a lot to improve the response time. recently i was asked to take a look at the setup again and see if we can do anything to sort results based on relevancy. in the current setup there was no any form of ordering used and &#8211; for that particular data &#8211; sorting by weight() &#8211; did not yield good results either.<\/p>\n<p><!--more--><\/p>\n<p>to simplify things i&#8217;ll skip some details and assume that each row of indexed data contains:<\/p>\n<ul>\n<li>numerical <i>id<\/i> [ primary key ]<\/li>\n<li>text field containing a company <i>name<\/i>,<\/li>\n<li>text field containing more <i>details<\/i> about company,<\/li>\n<\/ul>\n<p>results with match on the <i>name<\/i> are &#8216;better&#8217; than results with match on the <i>details<\/i> field; &#8216;fuller&#8217; matches are better than partial matches. that&#8217;s what i came up with:<\/p>\n<p>for the indexing we&#8217;ll need:<\/p>\n<ul>\n<li><a href=\"http:\/\/sphinxsearch.com\/docs\/current.html#conf-hitless-words\">hitless_words=<\/a> &#8211; so not only the presence but also position of each of found keywords is stored in the index<\/li>\n<li><a href=\"http:\/\/sphinxsearch.com\/docs\/current.html#conf-index-field-lengths\">index_field_lengths=1<\/a> so we can calculate how &#8216;full&#8217; the match is &#8211; how many words from the field were matched<\/li>\n<li>separate fields for the <i>name<\/i> and <i>details<\/i> &#8211; so it&#8217;s possible to distinguish between matches on both\n<\/ul>\n<p>the initial query &#8211; without custom ranker looked as follows:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nSELECT id, WEIGHT() AS _weight,RANKFACTORS() FROM index0 WHERE MATCH ('ericss*') ORDER BY _weight DESC\r\n<\/pre>\n<p>would give me plenty of results having Ericsson in the <i>name<\/i> or <i>description<\/i> but none of the first 25 would be expected Ericsson AB<\/p>\n<p>i&#8217;ve started experimenting with custom rankers and ended up with the following addition to the original query:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nOPTION ranker=export('(300\/sum(min_best_span_pos)+bm25\/100)+(field_mask&amp;1==1)*300\/name_len+sum(lcs)*100')\r\n<\/pre>\n<p>the higher value returned from the export expression &#8211; the higher on the results list given row will show<\/p>\n<ul>\n<li>300\/sum(min_best_span_pos) &#8211; the lower value of the best match position the better &#8211; so i&#8217;m taking the inverse of it so i get higher value for match on position 1 [ 300\/1 ] than on position 5 [ 300\/5 ]<\/li>\n<li>bm25 &#8211; it&#8217;s a built in ranker &#8211; based on the frequency of matched keywords; it was not very useful for this dataset so i downplay it a lot<\/li>\n<li>(field_mask&#038;1==1)*300\/name_len &#8211; my field #1 corresponds to the <i>name<\/i> &#8211; i&#8217;m prioritizing the matches on it yet try to get shortest of available match at the top of results &#8211; hence the \/name_len part. &#8220;Ericsson ab&#8221; will be better match than &#8220;Lennart Ericsson Holding AB&#8221; for the query &#8220;ericss*&#8221;<\/li>\n<li>sum(lcs) &#8211; will help to prioritize results with match containing few of the search keywords in a continuous string over other matches that contain the same keywords but not directly following one another.\n<\/li>\n<\/ul>\n<p>all of the variables can be found <a href=\"http:\/\/sphinxsearch.com\/docs\/current.html#field-factors\">here<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>a while ago i helped a colleague with moving away from mysql-based text search based on MATCH .. AGAINST to sphinxsearch to improve the performance. without much surprise this helped a lot to improve the response time. recently i was asked to take a look at the setup again and see if we can do [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[17],"tags":[44],"class_list":["post-2103","post","type-post","status-publish","format-standard","hentry","category-tech","tag-sphinx-search"],"_links":{"self":[{"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/posts\/2103","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/comments?post=2103"}],"version-history":[{"count":5,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/posts\/2103\/revisions"}],"predecessor-version":[{"id":2125,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/posts\/2103\/revisions\/2125"}],"wp:attachment":[{"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/media?parent=2103"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/categories?post=2103"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/tags?post=2103"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}