Carabao Language Kit

A customizable language construction framework
Publisher: Digital Sonata Pty Ltd
Category: Language
License: freeware
Cost: 0$
Size: 115.14 MB
Updated: 17 Aug 2009
Carabao is a family of multipurpose linguistic tools. It provides the following capabilities:

* Sense disambiguation
* Detailed, sentence by sentence domain extraction
* Deep morphological analysis and synthesis
* Automatic linguistic profiling
* Idiom extraction
* Universal measure conversion
* Transliteration between scripts
* Machine readability evaluation of texts
* Automatic translation between languages

The most distinctive feature of Carabao is its complete abstraction from the linguistic point of view. All the linguistic logic resides in a database complete with a powerful GUI data editor. By removing the linguistic logic from the source code, a few goals are achieved:

* Separation of tasks between software developers and linguists
* Faster and more reliable development of new linguistic engines which does not require participation of IT people
* Ease of programmatic use and customization

Version: May 2010)
* Handling of control priority greater than 2, when some of the members have no feasible agreement graph. The result was, that some parts of the sequence worked, and some didn't.
* Truncation of very long sentences

* A utility to validate and correct rule unit values
* A generic support for formatted processing, e.g. HTML, XML, SGML including embedded formatting elements in the text flow
* Automatic conversion of double-byte space characters into standard single-byte

* Regular expressions for segmentation into character classes for double-byte languages
* Perl-compatible regular expressions have been introduced for unknown heuristics
* Frequency-based backtracking added to the tokenization algorithm
Version: Feb 2009)
* Regression: "phantom capitalization" of re-used words
* Regression: sequence style forcing / avoiding
* Repositioning errors in sentences with attached tokens
* Sequence processing in languages not using white spaces

* Lattice-based processing for speech recognition an
Version: Dec 2008)
* Handling of single quotes as syntax delimiters in English

* A segmentation mode more effectively handling languages that don't use white spaces (e.g. Chinese, Japanese, Korean, Thai). In this mode, different character classes are broken into tokens (e.g. Chinese, and t
Version: Sep 2008)
* Unknown patterns were translated as hypernyms
* Regression: certain category-based sequences were omitted on second execution because of a malfunctioning guess scan caching mechanism
* In analytical mode (Carabao DeepAnalyzer), there was a mismatch between word index number and an idiom member index,
in sentences with attached tokens such as 'em, 'm
* When copying a token with 1 rule units or less, the text is always reset to the original

* Capability to match numbers as patterns
* When a translation is not found, the engine tries to fall back to a matching hypernym instead
* New methods to Carabao DeepAnalyzer that enable accessing the members of the detected idioms
* New methods to Carabao CDA that enable accessing the unknown heuristics table
* New sequences
* Russian morphological exceptions

* If an "unknown pattern" is forced to match a known word, it will not create a new guess if a guess with a same hypernym already exists.
For example, if you force to check, whether a known word can be a city, a new record will not be created, if there is already a guess with a known city
* Automatic input language switching in locator fields
* Locator fields are pre-filled with the list of all existing languages in the database, eliminating the need to jump to the next language
Version: Mar 2008)
* Crash when using sequence extraction option (regression from

* Capability to import sequences by data entry directly from the Sequence Sheet
* Capability to manually set sequence descriptions
* Some sequences

* Processing speed and memory consumption - further boost
* Token GUI

* Volatility of newly assigned rule units in late sequences
* Inconsistencies in the generation of inflected forms in design time

* All (or nearly all) the Russian morphological exceptions - over a 1,000 of new prefixes
* Friendly GUI of meta-rules such as lemmatized forms and generation of inflected forms
* MorphoLogic now inspects the design time data generation meta-rules when generating inflected forms

* Processing speed and memory consumption
* Increased maximum length of the meta-rule content field
* Increased some fields to accommodate large sequences and a lot of grammatical data

* Various tagging problems
* A bug with mid-sentence sequences priority setting

* A button to tag new entries morphologically
* A handful of commonly used business entities (e.g., address, phone, fax, business hours)

* Accuracy of sequences
* Domains

* Inflection generation problems of TagLemma results (words not in the dictionary) in Carabao MorphoLogic

* Capability to inspect other guesses. For example, in a sequence like "adverb" + "adverb", it is possible to quickly scrap the entire sequenec if the second adverb can be a preposition
* Comprehensive morphology of Russian language

* Removed description of negative constraint elements (those that do not have an identity) in sequence in order to make the descriptions less cluttered
* Performance of sequence processing
* Accuracy of sequences
* Domains reviewed

* Various validation problems with attached tokens
* Lookup windows are no longer maximized on opening
* Incorrect tooltips after deletion in the dictionary table

* GUI support for negative constraints in sequences
* Handling of irregular 'smart quotes' in Translation Console
* Manual disambiguation table in Carabao Linguist Edition
* Style tags to the tooltips in the dictionary table

* Supplied sequences
* In the translation console, the original thesaurus article is suppressed when the word is part of an idiom - to prevent confusion

Carabao Language Kit has been released

download (carabaoFree.exe - 115.14 MB)