Multilingual Semantic Similarity of Unstructured Enterprise Content
To communicate with their end users, businesses regularly produce written documents such as letters, notices, statements, etc. in various languages. A set of rules are usually used to ensure that information in these documents is 'correct' and consistent across languages and communication channels. However, with the increasing volume and variety of information being sent out to clients, it becomes difficult to preserve the semantics of client messages across vocabulary and language variations. State of the art algorithms that solve this problem involve a translation algorithm which creates an additional overhead and makes the entire pipeline less practical to use.
This project aims at creating algorithms capable of measuring semantic similarity between unstructured enterprise contents with less overhead regardless of the natural language being used for each document. The set of similarity algorithms must scale with the size of the corpus being used.