Topic Segmentation for Text Mining on Legal Documents
Text mining is the process of automatically extracting knowledge from unstructured, natural language documents. It aims to support users in dealing with large amount of textual information. Examples for specific text mining tasks are entity detection, summarization, and opinion mining. Due to the complexity and ambiguity of natural language, this analysis is broken down into individual processing steps, which are based on the techniques from the fields of machine learning, natural language processing, and semantic computing.
In this project, the goal is to enrich the text mining pipelines developed at KeaText for the processing of legal documents. Specifically, the analysis is to be enriched with a topic segmentation module that is tailored to the specific domain and application requirements. Automatic topic segmentation, also known as text tiling, structures documents into individual parts, each representing a distinct theme. It is well-known that topic segmentation can improve several information retrieval and text analysis tasks. In this project, the following tasks are to be completed: (1) Survey of existing research literature to identify suitable methods and tools; (2) Design of a new topic segmentation algorithm specifically for legal documents; and (3) Implementation and evaluation of this algorithm based on the General Architecture for Text Engineering (GATE) framework.