ICA – Intelligent Content Analysis
Automatic analysis of the text, formatting unformatted data.
Melingo’s Intelligent Content Analysis (ICA) is an advanced system, developed by Melingo using algorithm based text analysis and entity extraction tools. Given texts in Hebrew, Arabic or Persian, the system gives two outputs:
Complete analysis of the text – the system takes the inputted text and outputs an analysis of each word according to its root, part of speech, ascription to a word combination, prefix, tense etc.
Textual entities found in the text – a text is inputted, and the system extracts the main entities appearing in it, and categorizes them in categories such as names, places, organizations, addresses, non-verbal chains such as telephone numbers, car license numbers, credit cards, email addresses, URLs etc.
The entities are extracted to a synopsis where they are listed according to type, subtype and number of appearances.
ICA functions as an open interface under Windows in net, C++ and JAVA. It is actually an API which allows the user to make wide and flexible use of its output, while being easily combined in existing software.
UDK – User Defined Keyword
The UDK component is an add-on which makes possible the addition and enrichment of organizational categories (as a category dictionary personally adapted and managed by the customer) for the purpose of entity extraction. This capability makes it possible to ascribe words or names to new categories or to add them to existing categories.
For example, the user can define the word ‘lily’ as an entity in the category ‘weapons’, or as an entity in a new category defined according to his needs, for instance ‘flowers’ or ‘plants’.
The organizational lexicon component is another add-on to the ICA system which can specifically affect the resulting analyses. The customer can affect resulting homonyms by scoring them, giving a higher score to the desired result.
Chracteristics of Melingo’s ICA
Recognizes entities from a wide world of concepts
ICA can identify central entities from many built in categories without need for manual definitions. Among them: Names of countries, cities, people, medical terms, weapons, names of organizations and more.
Use of morphology
Concepts are identified in the text even when used in different conjugations, spellings and forms, in a way which ensures optimal recognition of the central concepts of the text according to their context.
Overcoming multiple meanings
The system performs a precise analysis of the text, while overcoming multiple meanings. In this way the noun ‘barak’ will be identified and analyzed differently from the name ‘Barak’.
ICA can be personalized and adapted to the needs of the customer and to his world of content, giving preference to concepts from his world. The customer can also define new concepts to be identified according to his needs.
Support for many programming languages
The system works as an API with .net, Java and C++ envelopes, so it is easy to integrate with systems written in these languages.
Implemented in large systems
The system is currently being used successfully by large systems.
Possible uses for ICA
Analysis and comprehension of texts
Automatic extraction of keywords from the entire document
Cataloguing and labeling of documents
Identification of business opportunities - identification of texts dealing with a particular product
Formatting unformatted data
Integration of ICA in the search/indexing process
Example of Melingo’s ICA function
The following example portrays the entity extraction capability of Melingo’s ICA. In this example an article was examined, textual entities it contains were highlighted, their appearances were counted and they were divided into categories according to subjects.
The following table shows an example of the text’s analysis. The column on the left shows individual words from the article, and the other columns – the analysis – shows each word’s part of speech, basic form etc.