Concept Search – Intelligent context sensitive morphological search
The Semitic languages (e.g. Hebrew, Arabic) have a very complex linguistic structure. This complex structure makes it a challenge to obtain high quality full text search on materials such as Internet sites, archives, and any database that includes textual information.
The complexity of Semitic languages stems from two main facts: a single word can have hundreds or even thousands of different inflections. Also, a single root may be manifested in different words that share the same semantic concept. This prevents regular search engines from identifying all the tokens of the search word, which may be manifested in very different forms in the text. Additionally, the Semitic scripts do not regularly exhibit the vowels, and hence there is lot of ambiguity in texts; more often than not, a single word can often be interpreted in many ways, and only the context enables reader to realize what right interpretation is. As a result, a regular search engine returns in average only around 50% of relevant tokens, often much less, while some of the returned results are irrelevant due to lack of disambiguation.
An example of the challenge:
A user searches for the Hebrew word אישה (“woman”). The word women, in plural, is not only very different (נשים), but happens to be ambiguous with a totally different word – the first person plural future form of the verb “to put” is spelled exactly the same –נשים. The word נשים as women has its own inflection – e.g.ושנשי means “and that the women of “. So, to get all the relevant documents, and only the relevant documents, the search engine needs to address all of this complexity.
This is just what Melingo’s Concept Search does – it is an module that enables search engines not to miss any of the inflections of each word, while overcoming ambiguity and not yielding excess irrelevant results. In addition, Concept Search includes a complete Hebrew and/or Arabic thesaurus and an additional option for user-defined synonyms – and everything is morphologically enhanced, so that when a synonym is defined, it will capture all its inflectional variants.
How Concept Search works
Melingo’s Concept Search is a modular product that integrates seamlessly with leading enterprise search engines, such as Microsoft Sharepoint and MS SQL Server, Oracle, Attivio, dtSearch and more, as well as open source search solutions Lucene and Solr.
Melingo’s module intervenes both at the indexing level and at the query level. At indexing time, all words are disambiguated and normalized to a single form, representing the essence of each word – the concept behind it, regardless of the superficial form of the actual token. For example, Hebrew words as different as נחטף, חטיפתו, החוטפים, החטופים, ייחטפו, חוטפיהם all share the concept of “kidnapping”, and therefore will all be returned when the search query contains any word of this root, such as חטיפה. Similarly, in Arabic, when one searches for a form such as أمـوال (“money”), Concept Search will return forms such as مـال, المال ممول, المالية, أموالنا – rather than treat each of these forms as a different search, they are recognized as belonging to the same concept – money & and financing. In real life queries, that usually contain more than one word, this is even more crucial, since the number of different legitimate forms of the query words is the product of the multiplication of the number of inflections of each word.
Melingo’s Concept Search chracteristics
Overcoming multiple meanings
Semitic languages have many instances of words written the same way but which have completely different meanings. Melingo’s Concept Search overcomes this issue. Searching for a word will return instances where it appears, but not when it is a homonym and contextually irrelevant.
Precise morphologic analysis
Each word of the text is morphologically analyzed, so a search for the word ‘doctor’ for example will also pull up instances of ‘doctors’, ‘the doctor’, ‘their doctors’ etc. In other words results will include documents containing all possible permutations of the word, in all tenses and with all suffixes, without homonyms, while properly treating varying spellings, verb conjugations and Soundex.
Search by sound - Soundex
Melingo’s Concept Search has a sound search option. This is particularly relevant when searching names – Auerbach will also give results including the name with different spellings, as will Mazda or other name searches. This capability includes specific extensions for treating names of non-Hebrew origin, such as Arabic and Persian names written with Hebrew letters.
Search by root and semantic family
Melingo’s Concept Search can identify words which are not conjugations of the same basic form, but share the same root and are semantically related. Use of the broadened search capability will allow a search for ‘journal’ for example to give results including the words ‘journalist’, ‘journalism’ and ‘journals’. Searching for ‘photo’ will result give results including ‘photographer’, ‘photography’ etc. The broadened search can be controlled by the user according to his needs.
Melingo’s Concept Search can recognize the various conjugations of verbs, in all their tenses, thus allowing for the return of results including all those conjugations. For example, a search for the word ‘went’ will give results including ‘go’, ‘goes’ etc.
Missing and full spelling
Melingo’s Concept Search can switch between missing and full spelling, so searching one format will also give results which use the other.
Name oriented search
Melingo’s Concept Search gives higher precision when searching for names. When searching for the word ‘Barak’ for example, results will include all forms of the word, in its meaning of lightening, and as a name. But when making a name oriented search (a setting determined by the user) the results will only include instances where the word is used as a name. Intonation will also be broadened to include names.
Hebrew has many cases of words which can be spelled in more than one way. Melingo’s Concept Search allows the search engine to give results with all possible spellings.
Highlighting of searched words
Melingo’s Concept Search has the option of highlighting all the search requests in the results, including all their formations, synonyms etc.
Support for word combinations
Melingo’s Concept Search supports conjugations of word combinations. For example a search for attorney at law will give results including the morphological conjugations of the combination as it appears in the text – attorneys at law etc.
Supports 'near' searches
Melingo’s Concept Search supports near searches, when these are supported by the search engine. Near searches are used for searching for a number of words which appear in proximity to each other in a text.
Search including thesaurus
Melingo’s Concept Search allows for the recognition of a wealth of synonyms, in all their formations. This is an ability based on dictionaries which automatically recognize these identities. Using a thesaurus opens up a much broader expanse of results, without losing the regular morphologic results. The product even allows one to broaden the thesaurus by combining it with a user dictionary adapted to the customer’s needs and the type of texts being searched.
Example of a Concept Search
In the following example the Wikipedia website was searched for the word ‘virus’. The first search was without the morphologic component, and the second was with Melingo’s morphologic component.
without using the morphological element
In this example 137 results were returned, including only the word ‘virus’. Results in which other formations of the word appear were not returned.