solotimes
4/19/2011 - 4:05 PM

Simplistic Full-Text Search With Redis' Sorted Sets

Simplistic Full-Text Search With Redis' Sorted Sets

Simplistic Full-Text Search With Redis's Sorted Sets

Howto

git clone git://gist.github.com/923934.git redisearch

cd redisearch

./redisearch.rb index file1.txt file2.txt file3.txt

./redisearch.rb search ruby
./redisearch.rb search ruby programming
./redisearch.rb search ruby diamond

Resources

The ruby is a pink to blood-red colored gemstone, a variety of the mineral corundum (aluminium oxide). The red color is caused mainly by the presence of the element chromium. Its name comes from ruber, Latin for red. Other varieties of gem-quality corundum are called sapphires. The ruby is considered one of the four precious stones, together with the sapphire, the emerald, and the diamond.
Ruby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto. It was influenced primarily by Perl, Smalltalk, Eiffel, and Lisp.
"Ruby" is a song by English rock band Kaiser Chiefs and is the lead track on their second album, Yours Truly, Angry Mob. It was released as the lead single from that album in the United Kingdom as a download on February 5, 2007 and as a limited edition 7 in and CD single on February 19 that year. It became the band's first ever #1 single on February 25, 2007, and ended 2007 as the year's 10th biggest-selling single in the UK with total sales of 313,765.
#!/usr/bin/env ruby

require 'rubygems'
require 'redis'
require 'benchmark'

module SimpleSearch

  def index input
    document    = File.new(input)
    document_id = document.path
    tokens      = analyze document.read

    store document_id, tokens
    puts "Indexed document #{document_id} with tokens:", tokens.inspect, "\n"
  end

  def analyze content
    # >>> Split content by words
    content.split(/\W/).
    # >>> Downcase every word
    map    { |word| word.downcase }.
    # >>> Reject stop words, digits and empty tokens
    reject { |word| STOPWORDS.include?(word) || word =~ /^\d+/ || word == ''  }
  end

  def store document_id, tokens
    tokens.each do |token|
      # >>> Update score for this posting
      R.zincrby "search:index:#{token}", 1, document_id
    end
  end

  def search query
    # >>> Just split query into terms
    terms = query.split(' ')
    # >>> Perform intersection (logical AND) of sets for the terms
    R.zinterstore "search:result", terms.map { |term| "search:index:#{term}" }
    # >>> Load top 10 matches from the resulting sorted set, as {<DOCUMENT_ID> => <SCORE>}
    results = Hash[ *R.zrevrange("search:result", 0, 10, :withscores => true) ]

    results.each do |document_id, score|
      puts "* #{document_id} (Score: #{score})"
    end
  end

  R = Redis.new

  STOPWORDS = %w|a an and are as at but by for if in is it no not of on or that the then there these they this to was will with|

  extend self
end


if __FILE__ == $0
  case command = ARGV.shift
    when 'index'
      elapsed = Benchmark.realtime do
        SimpleSearch::R.keys("search:*").each { |key| SimpleSearch::R.del key }
        ARGV.each { |file| SimpleSearch.index file }
      end
      puts '-'*80, "Indexing done in #{sprintf("%1.2f", elapsed)} seconds", '-'*80

    when 'search'
      puts '='*80
      query = ARGV.join(' ')
      elapsed = Benchmark.realtime do
        SimpleSearch.search query
      end
      puts '-'*80, "Query '#{query}' finished in #{sprintf("%1.5f", elapsed)} seconds"

    else
      puts "USAGE:\n  #{$0} index <FILE>\n  #{$0} search <QUERY>"
  end
end