Information extraction from cross lingual short text

The main goal of this proposed project is to develop a novel bilingual topic model, which explicitly models the word co-occurrence cross-lingual in document-aligned comparable data using a novel merging and shuffling strategy, called CL-BTM. Given a document-aligned multilingual corpus, CL-BTM can be employed to extract latent cross-lingual topics that optimally describe the observed data and discover language-specific per-topic word distributions in each language. A novel bilingual topic model is used to obtain the shared global topic distributions and language-specific topic-word distributions. Ideally, the hierarchical representations of text would be well applied for text understanding and classifications. For further application, the topic coherence and the correlation between entities can be accurately extracted in a document using both the local information (represented as biterm) and the global knowledge (topic knowledge) in a knowledge base, by jointly modeling and exploiting the context compatibility.

Faculty Supervisor:

Yong Zeng

Student:

Partner:

Huazhong University of Science and Technology

Discipline:

Computer science

Sector:

Education

University:

Concordia University

Program: