How do I use full text search in MongoDB? image

How do I use full text search in MongoDB?

MongoDB, one of the leading NoSQL databases, is well known for its fast performance, flexible schema, scalability and great indexing capabilities. At the heart of this fast performance are MongoDB indexes, which support efficient query execution by avoiding full collection scans and thus limiting the number of documents that MongoDB searches.

Starting with version 2.4, MongoDB started with an experimental feature supporting full-text search using text indexes. This feature is now an integral part of the product (and no longer an experimental feature). Using MongoDB’s full-text search, you can define a text index on any field in the document whose value is a string or an array of strings. When we create a text index on a field, MongoDB tokenizes and slices the textual content of the indexed field, and sets up the indexes accordingly.

In this tutorial, we will explore the full-text search features of MongoDB.

CREATION OF A MONGODB SERVER ON HIDORA

First, we need to install MongoDB, so let’s see how quickly and easily MongoDB can be installed on the Hidora PaaS:

select mangoDB environment created hidora

CONNECTING TO THE MONGODB WITH SSH

Now let’s see how you can access your Hidora account with all its environments and containers.

Note. SSH access is provided to the whole account but not to a separate environment.

settings Hidora

In the Settings tab of the open account, navigate to SSH Keychain > Public.

Note. The availability of this option is only activated for billing customers. If you need this access during the trial period, please let us know and we will grant you the necessary access.

Putty gateway hidora

CREATION OF A SAMPLE OF DATA

Data in MongoDB has a flexible schema. Unlike SQL databases, where you must determine and declare the schema of a table before inserting data, MongoDB collections do not impose the structure of documents. This flexibility makes it easy to map documents to an entity or object. Each document can match the data fields of the represented entity, even if the data has significant variations. In practice, however, documents in a collection share a similar structure.

The main challenge in data modelling is to balance the needs of the application, the performance characteristics of the database engine and the data recovery models. When designing data models, one must always consider the application’s use of the data (i.e. queries, updates and data processing) as well as the inherent structure of the data itself.

MongoDB stores data records as BSON documents. BSON is a binary representation of JSON documents, although it contains more data types than JSON. For the BSON specification, see bsonspec.org.

MongoDB stores BSON documents, i.e. data records, in collections; collections in databases. In MongoDB, databases contain collections of documents.

To select a database to use, in the mongo shell, issue the use statement, as in the following example:

use myDB

Creating a database

If a database does not exist, MongoDB creates the database when you first store data for it. So you can switch to a non-existent database and perform the following operation in the mongo shell:

use myNewDB
db.myNewCollection1.insertOne( { x: 1 } )

The insertOne() operation creates both the myNewDB database and the myNewCollection1 collection if they do not already exist.

MongoDB stores documents in collections. Collections are analogous to tables in relational databases.

Create a collection

If a collection does not exist, MongoDB creates the collection when you first store data for it.

db.myNewCollection2.insertOne( { x: 1 } )
db.myNewCollection3.createIndex( { y: 1 } )

The operations insertOne() and createIndex() create their respective collections if they do not already exist.

Explicit creation

MongoDB provides the db.createCollection() method to explicitly create a collection with various options, such as setting the maximum size or documentation validation rules. If you do not specify these options, you do not need to explicitly create the collection since MongoDB creates new collections when you first store data for collections.

DOCUMENT SEARCH

Starting with MongoDB 3.2, MongoDB introduces a version 3 of the text index

MongoDB provides text indexes to support text search queries on the contents of strings. Text indexes can include any field whose value is a string or an array of string elements.

Create a text index

IMPORTANT: A collection can have a maximum of one text index.

To create a text index, use the db.collection.createIndex() method. To index a field that contains a string or array of string elements, include the field and specify the string literal “text” in the index document, as in the following example:

db.reviews.createIndex( { comments: “text” } )

You can index multiple fields for the text index. The following example creates a text index on the subject and comments fields:

db.reviews.createIndex( { subject: “text”, comments: “text” } )

A compound index can include text index keys in combination with ascending or descending index keys. To file a text index, use the index name.

Specify weight

For a text index, the weight of an indexed field indicates the importance of that field relative to the other indexed fields in terms of the text search score.

For each indexed field in the document, MongoDB multiplies the number of matches by the weight and sums the results. Using this sum, MongoDB then calculates the document score.

The default weight is 1 for indexed fields. To adjust the weights of indexed fields, include the weights option in the db.collection.createIndex() method.

Index of greeting card texts

When creating a text index across multiple fields, you can also use the wildcard specifier ($**). With a wildcard text index, MongoDB indexes each field containing string data for each document in the collection. The following example creates a text index using the wildcard specifier :

db.collection.createIndex( { “$**”: “text” } )

This index allows textual searches to be performed on all fields containing strings. Such an index can be useful with very unstructured data if one does not know which fields to include in the textual index or for ad-hoc queries.

Wildcard text indexes are text indexes over multiple fields. As such, you can assign weights to specific fields when creating the index to control the ranking of the results.

Wildcard text indexes, like all text indexes, can be part of a compound index. For example, the following example creates a compound index on the field a and the wildcard specifier :

db.collection.createIndex( { a: 1, “$**”: “text” } )

As with all compound text indexes, since the a precedes the text index key, to perform a $text search with this index, the query predicate must include an a equality condition.

Case insensitivity

The version 3 text index supports common case folding C, simple case folding S and, for Turkish languages, special case folding T, as specified in the Unicode 8.0 Case Folding character database.

Case folding extends the case insensitivity of the text index to include characters with diacritics, such as é and É, and characters from non-Latin alphabets, such as ‘И’ and ‘и’ in the Cyrillic alphabet.

Version 3 of the textual index is also insensitive to diacritics. Thus, the index does not distinguish between é, É, e and E.

Previous versions of the textual index are case insensitive for [A-z] only; that is, case insensitive for non-diacritical Latin characters only. For all other characters, previous versions of the textual index treat them as distinct.

Insensitivity to diacritics

With version 3, the text index is insensitive to diacritical marks. In other words, the index does not distinguish between characters that contain diacritical marks and their unmarked counterparts, such as é, ê and e. Specifically, the text index eliminates characters of Unicode 8..

Version 3 of the text index is also case insensitive for characters with diacritics. For example, the index does not distinguish between é, É, e and E.

Previous versions of the text index treat characters with diacritics as distinct.

Tokenization delimiters

For tokenization, the version 3 text index uses delimiters categorized as Dash, Hyphen, Pattern_Syntax, Quotation_Mark, Terminal_Punctuation, and White_Space in the Prop list of the Unicode 8.0 character database.

For example, given the string “He said he was the best player in the world”, the text index treats “,” and spaces as delimiters.

Previous versions of the index treat “as part of the term ‘was’” and “as part of the term ‘world’”.

Index entries

The textual index tokenises and sorts the terms in the indexed fields for the index entries. The textual index stores an index entry for each unique term cleaned up in each indexed field for each document in the collection. The index uses simple stemming by language-specific suffix.

Supported languages and stop words

MongoDB supports text searching in multiple languages. Text indexes ignore language-specific stop words (e.g., the, an, a, and, etc.) and use simple language-specific suffixes. For a list of supported languages, see Text search languages.

If you specify the value “none” for the language, the text indexing uses simple tokenisation without stop word lists or derivations.

Scattered ownership

Text indexes are sparse by default and ignore the sparse: true option. If a document has no text index field (or if the field is null or an empty array), MongoDB does not add an entry for the document to the text index. For inserts, MongoDB inserts the document but does not add anything to the text index.

For a compound index that includes a text index key as well as keys of other types, only the text index field determines whether the index refers to a document. The other keys do not determine whether the index refers to documents or not.

Restrictions

Only one text index per collection. A collection can have a maximum of one text index.

Text search and advice

You cannot use hint() if the query includes a $text query expression.

Index and text sorting

Sort operations cannot obtain the sort order from a text index, even from a compound text index; that is, sort operations cannot use the order of the text index.

Composite index

A compound index can include a text index key in combination with ascending/descending index keys. However, these compound indexes have the following restrictions:

Submit a text index

To delete a text index, pass the index name to the db.collection.dropIndex() method. To get the index name, run the db.collection.getIndexes() method.

Storage requirements and performance costs

Text indexes have the following storage requirements and performance costs:

Text search Support

The text index supports $text search operations. For examples of text search, see the $text reference page. For examples of $text operations in the aggregation pipelines, see Searching for text in the aggregation pipeline.

IS THERE A WAY TO IMPROVE PERFORMANCE?

The full-text search does not work properly for very large datasets, as all matches are returned as a single document and the command does not support a “skip” parameter to retrieve results page by page. Although the search is only on the “_id” field, a huge result set will not be returned in its entirety if the result exceeds Mongo’s limit of 16 MB per document. A compound text index cannot include other types of indexes, such as multi-key or geo-spatial indexes. In addition, if your compound text index includes index keys before the text index key, all queries must specify equality operators for the preceding keys. Text indexes create an overhead when inserting new documents. This has the effect of reducing the insertion rate. Some queries, such as phrase searches, can be relatively slow.

CONCLUSION

MongoDB’s full-text search is not offered as a complete replacement for search engine databases such as Elastic, SOLR, etc. However, it can be used effectively for most applications that are built with MongoDB today. However, it can be used effectively for the majority of applications that are built with MongoDB today.

profile picture

Written By

Matthieu ROBIN

16/11/2017

Matthieu Robin is the CEO of Hidora, an experienced strategic leader, a former system administrator who has managed and configured more environments manually than anyone else on the planet and after realising that it could be done with a few clicks created Hidora SA. Follow him on Twitter.

Start your free trial

No credit card required. Free 14 day trial.

We only use your personal data to create your account, promise!

Choose your currency

chf
eur

Read more articles

bg

Receive our news