MichaelB.
7/29/2018 - 6:30 PM

CREATE explicit FULLTEXT INDEX [with TEXT ANALYSIS]

The fulltext index is attached as a hidden/shadow column to the source table column. In SQL you specify:

CREATE FULLTEXT INDEX <index_name> ON <table_name> ( <column_name> ) [<fulltext_parameter>] The subsequent fulltext index created can only be accessed indirectly via search, i.e. in SQL using SELECT with the CONTAINS predicate. The fulltext index is not a table, which you could directly access.

In contrast, text analysis results are found in a table and is an option to the fulltext index. In SQL you add the following parameter:

CREATE FULLTEXT INDEX <index_name> ON <table_name> ( <column_name> ) TEXT ANALYSIS ON The subsequent $TA_<index_name> table created in the database allows you to enjoy the benefits of SQL operations such as calculate and aggregate functions. In addition, you could utilize other HANA engines like Spatial or Predictive on the text analysis annotations.

Requirements: Columnar table with character or binary content

  • VARCHAR and CLOB for ASCII text
  • NVARCHAR and NCLOB for Unicode text
  • BLOB for binary document formats (and text using multi-byte encodings)

FAST PREPROCESS OFF is the key to activate text analysis -> then uses text analysis for tokenization and for the stemming TEXT ANALYSIS ON does not have to be specified

An explicit full-text index is created after the source table (needs to already be there). You must drop and re-create an explicit index to modify text analysis parameters

By default, all inputs are assumed to be English! Specify the possible languages when creating the full text index. If the confidence is low for the first language, then it will bypass to the next one. There is a parameter where you can change this confidence level. You can also specify one language for each row with LANGUAGE COLUMN and bypass automatic language detection.

  • Specify ISO 639 language code (check SYS.M_TEXT_ANALYSIS_LANGUAGES for values)
  • NULL to fall back to automatic language detection You can also combine language for each row with automatic language detection.

Binary Document Formats:

  • Format is automatically detected
  • Text content is extracted
  • Officially supports formats listed in SYS.M_TEXT_ANALYSIS_MIME_TYPES

You can also specify the input format

  • Only meaningful for HTML, XML, and plain text
  • Use registered IANA MIME types (e.g. "text/html")

You can change the tokenization behavior of the system using TOKEN SEPARATORS parameter. It is an additional level of token-breaking applied after text analysis processing. The default if not specified: /;,.:-_()[]<>!?*@+{}="& Can be disabled by specifying empty TOKEN SEPARATORS

Specify TEXT ANALYSIS ON to access text analysis results. Result:

  • Full text index is created, plus SAP HANA automatically generates an additional table containing text analysis results
  • Generated table name is $TA_[INDEX_NAME]

Synchronous processing

  • indexing
-- Standard syntax explicit fulltext index
CREATE FULLTEXT INDEX <index_name> ON <schema_name> <table_name> ( <column_name> ) [<fulltext_parameter>]

-- Explicit fulltext index with text analysis activation in MLP engine for tokenization and normalization
CREATE FULLTEXT INDEX <index_name> ON <schema_name> <table_name> ( <column_name> ) FAST PREPROCESS OFF

-- Access text analysis results
CREATE FULLTEXT INDEX <index_name> ON <schema_name> <table_name> ( <column_name> ) TEXT ANALYSIS ON

-- Drop a full text index
DROP FULLTEXT INDEX [INDEX NAME];

-- Example of how to specify the possible languages when creating the full text index
CREATE FULLTEXT INDEX PRODUCT_REVIEWS_IDX ON PRODUCT_REVIEWS(CONTENT) FAST PREPROCESS OFF
    LANGUAGE DETECTION ('EN', 'DE', 'FR');
    

-- Example of how to specify one language for each row
CREATE FULLTEXT INDEX PRODUCT_REVIEWS_IDX ON PRODUCT_REVIEWS(CONTENT) FAST PREPROCESS OFF
    LANGUAGE COLUMN language;
    
    
-- Example of how to combine language for each row with automatic language detection
CREATE FULLTEXT INDEX PRODUCT_REVIEWS_IDX ON PRODUCT_REVIEWS(CONTENT) FAST PREPROCESS OFF
    LANGUAGE COLUMN language LANGUAGE DETECTION ('EN', 'DE', 'FR')
    

-- Example of how to specify the input format with Text Analysis
CREATE FULLTEXT INDEX PRODUCT_REVIEWS_IDX ON PRODUCT_REVIEWS(CONTENT) FAST PREPROCESS OFF
    MIMETYPE COLUMN MIMETYPE
        MIMETYPE 'text/plain';


-- How to customize token breaks
CREATE FULLTEXT INDEX <index_name> ON <schema_name> <table_name> ( <column_name> ) TEXT ANALYSIS ON
    TOKEN SEPARATORS 'token_separator';
    
    
-- How to disable token breaks
CREATE FULLTEXT INDEX <index_name> ON <schema_name> <table_name> ( <column_name> ) TEXT ANALYSIS ON
    TOKEN SEPARATORS '';
    

-- Define a configuration for the fulltext index
CREATE FULLTEXT INDEX <index_name> ON
<schema_name> <table_name> ( <column_name> ) FAST PREPROCESS OFF
    LANGUAGE DETECTION ('EN', 'DE', 'FR') TEXT ANALYSIS ON
CONFIGURATION 'EXTRACTION_CORE/EXTRACTION_CORE_ENTERPRISE/EXTRACTION_CORE_PUBLIC_SECTOR/EXTRACTION_CORE_VOICEOFCUSTOMER/
               GRAMMATICAL_ROLE_ANALYSIS/LINGANALYSIS_BASIC/LINGANALYSIS_FULL/LINGANALYSIS_STEMS';