Improving textual summarization of source code using Latent Dirichlet Allocation (LDA)

To perform changes to large software systems correctly, software developers must communicate efficiently and effectively about the intended change and must perform the technical work associated with the change precisely and completely. Currently, as developers collaborate with each other and interact with the many artifacts involved in a software change task, they must work frequently and intensively with highly detailed information, such as long discussions in bug reports and the many lines of code associated with the change. Dealing with all of these details all of the time raises the complexity of an already difficult task, increasing the time it takes to complete the task and increasing the likelihood of introducing errors. The proposed project is part of a research program that aims to improve software developer productivity and the quality of a software developer’s work by enabling developers to work in terms of tasks rather than being continually mired in the details of each task.  

As part of this research program, we are developing approaches for summarizing various artifacts involved in a software change tasks, such as bugs and source code. Our goal is to raise the level of abstraction of much of a developer’s work by allowing them to interact with project artifacts in terms of automatically generated summaries, with developers delving into the details of the artifacts only when needed. This particular project will involve enhancing an abstractive summarization approach we have recently developed for summarizing crosscutting source code. Specifically, in our approach, we populate an ontology describing the crosscutting concern. We plan to investigate enhancing this ontology with the results of applying Latent Dirichelt Allocation to identify topics from the crosscutting code being summarized and from the source code for the entire system. We will investigate how this topic analysis can be used to improve a summary generated from the ontology by increasing the ranking of patterns we identify in the ontology. Our approach generates a natural language summary of the crosscutting code based on patterns identified. We will evaluate the improvement using human judges.

The student will be involved in the design of how we enhance our abstractive summarization approach with LDA and in the implementation of the designed enhancements in Java. If time permits, the student will also be involved in the evaluation with human judges. Our goal is to have a publishable result at the end of the project.

Faculty Supervisor:

Dr. Gail Murphy


Kalyana Sundaram



Computer science


Information and communications technologies


University of British Columbia


Globalink Research Internship

Current openings

Find the perfect opportunity to put your academic skills and knowledge into practice!

Find Projects