Concept Search – Intelligent context sensitive morphological search
The Semitic languages (e.g. Hebrew, Arabic) have a very complex linguistic structure. This complex structure makes it a challenge to obtain high quality full text search on materials such as Internet sites, archives, and any database that includes textual information.
The complexity of Semitic languages stems from two main facts: a single word can have hundreds or even thousands of different inflections. Also, a single root may be manifested in different words that share the same semantic concept. This prevents regular search engines from identifying all the tokens of the search word, which may be manifested in very different forms in the text. Additionally, the Semitic scripts do not regularly exhibit the vowels, and hence there is lot of ambiguity in texts; more often than not, a single word can often be interpreted in many ways, and only the context enables reader to realize what right interpretation is. As a result, a regular search engine returns in average only around 50% of relevant tokens, often much less, while some of the returned results are irrelevant due to lack of disambiguation.
An example of the challenge:
A user searches for the Hebrew word אישה (“woman”). The word women, in plural, is not only very different (נשים), but happens to be ambiguous with a totally different word – the first person plural future form of the verb “to put” is spelled exactly the same –נשים. The word נשים as women has its own inflection – e.g.ושנשי means “and that the women of “. So, to get all the relevant documents, and only the relevant documents, the search engine needs to address all of this complexity.
This is just what Melingo’s Concept Search does – it is an module that enables search engines not to miss any of the inflections of each word, while overcoming ambiguity and not yielding excess irrelevant results. In addition, Concept Search includes a complete Hebrew and/or Arabic thesaurus and an additional option for user-defined synonyms – and everything is morphologically enhanced, so that when a synonym is defined, it will capture all its inflectional variants.
How Concept Search works
Melingo’s Concept Search is a modular product that integrates seamlessly with leading enterprise search engines, such as Microsoft Sharepoint and MS SQL Server, Oracle, Attivio, dtSearch and more, as well as open source search solutions Lucene and Solr.
Melingo’s module intervenes both at the indexing level and at the query level. At indexing time, all words are disambiguated and normalized to a single form, representing the essence of each word – the concept behind it, regardless of the superficial form of the actual token. For example, Hebrew words as different as נחטף, חטיפתו, החוטפים, החטופים, ייחטפו, חוטפיהם all share the concept of “kidnapping”, and therefore will all be returned when the search query contains any word of this root, such as חטיפה. Similarly, in Arabic, when one searches for a form such as أمـوال (“money”), Concept Search will return forms such as مـال, المال ممول, المالية, أموالنا – rather than treat each of these forms as a different search, they are recognized as belonging to the same concept – money & and financing. In real life queries, that usually contain more than one word, this is even more crucial, since the number of different legitimate forms of the query words is the product of the multiplication of the number of inflections of each word.
Contact us about any question or demo requests.
Melingo’s Concept Search chracteristics
Overcoming multiple meanings
Precise morphologic analysis
Search by sound - Soundex
Search by root and semantic family
Missing and full spelling
Name oriented search
Highlighting of searched words
Support for word combinations
Supports 'near' searches
Search including thesaurus
Example of a Concept Search
In the following example the Wikipedia website was searched for the word ‘virus’. The first search was without the morphologic component, and the second was with Melingo’s morphologic component.
without using the morphological element
In this example 137 results were returned, including only the word ‘virus’. Results in which other formations of the word appear were not returned.