coderplay
12/21/2012 - 6:02 AM

搜索相关知识

搜索相关知识

Grouping 与Facet的区别

  • Grouping was first released with Lucene 3.2, its related jira issue is LUCENE-1421: it allows to group search results by specified field. For example, if you group by the author field, then all documents with the same value in the author field fall into a single group. You will have a kind of tree as output. If you want to go deeper into using this lucene feature, this blog post should be useful.
  • Faceting was first released with Lucene 3.4, its related jira issue is LUCENE-3079: this feature doesn't group documents, it just tells you how many documents fall in a specific value of a facet. For example, if you have a facet based on the author field, you will receive a list of all your authors, and for each author you will know how many documents belong to that specific author. After, if you want to see those documents, you have to query one more time adding a specific filter (author=whatever). The faceted search is in fact based on browsing documents applying multiple filters to progressively reach the documents you're really interested in.

Join实现

这篇文章

IndexSearcher articleSearcher = ...
IndexSearcher commentSearcher = ...
String fromField = "id";
boolean multipleValuesPerDocument = false;
String toField = "article_id";
// This query should yield article with id 2 as result
BooleanQuery fromQuery = new BooleanQuery();
fromQuery.add(new TermQuery(new Term("title", "byte")), BooleanClause.Occur.MUST);
fromQuery.add(new TermQuery(new Term("title", "norms")), BooleanClause.Occur.MUST);
Query joinQuery = JoinUtil.createJoinQuery(fromField, multipleValuesPerDocument, toField, fromQuery, articleSearcher);
TopDocs topDocs = commentSearcher.search(joinQuery, 10);

看Lucene 4.0的代码,是这样实现的

  /**
   * Method for query time joining.
   * <p/>
   * Execute the returned query with a {@link IndexSearcher} to retrieve all documents that have the same terms in the
   * to field that match with documents matching the specified fromQuery and have the same terms in the from field.
   * <p/>
   * In the case a single document relates to more than one document the <code>multipleValuesPerDocument</code> option
   * should be set to true. When the <code>multipleValuesPerDocument</code> is set to <code>true</code> only the
   * the score from the first encountered join value originating from the 'from' side is mapped into the 'to' side.
   * Even in the case when a second join value related to a specific document yields a higher score. Obviously this
   * doesn't apply in the case that {@link ScoreMode#None} is used, since no scores are computed at all.
   * </p>
   * Memory considerations: During joining all unique join values are kept in memory. On top of that when the scoreMode
   * isn't set to {@link ScoreMode#None} a float value per unique join value is kept in memory for computing scores.
   * When scoreMode is set to {@link ScoreMode#Avg} also an additional integer value is kept in memory per unique
   * join value.
   *
   * @param fromField                 The from field to join from
   * @param multipleValuesPerDocument Whether the from field has multiple terms per document
   * @param toField                   The to field to join to
   * @param fromQuery                 The query to match documents on the from side
   * @param fromSearcher              The searcher that executed the specified fromQuery
   * @param scoreMode                 Instructs how scores from the fromQuery are mapped to the returned query
   * @return a {@link Query} instance that can be used to join documents based on the
   *         terms in the from and to field
   * @throws IOException If I/O related errors occur
   */
  public static Query createJoinQuery(String fromField,
                                      boolean multipleValuesPerDocument,
                                      String toField,
                                      Query fromQuery,
                                      IndexSearcher fromSearcher,
                                      ScoreMode scoreMode) throws IOException {
    switch (scoreMode) {
      case None:
        TermsCollector termsCollector = TermsCollector.create(fromField, multipleValuesPerDocument);
        fromSearcher.search(fromQuery, termsCollector);
        // termsCollector.getCollectorTerms()是一个Lucene自己写的高效hashmap, 保存了fromField到ids的映射
        return new TermsQuery(toField, fromQuery, termsCollector.getCollectorTerms());
      case Total:
      case Max:
      case Avg:
        TermsWithScoreCollector termsWithScoreCollector =
            TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode);
        fromSearcher.search(fromQuery, termsWithScoreCollector);
        return new TermsIncludingScoreQuery(
            toField,
            multipleValuesPerDocument,
            termsWithScoreCollector.getCollectedTerms(),
            termsWithScoreCollector.getScoresPerTerm(),
            fromQuery
        );
      default:
        throw new IllegalArgumentException(String.format(Locale.ROOT, "Score mode %s isn't supported.", scoreMode));
    }
  }