Hate speech and toxicity pose a serious threat in online spaces, particularly for marginalized communities. Detecting and preventing harmful speech in online games is challenging, and current methods lack transparency and reliability. To address this issue, this project aims to develop a robust and trustworthy toxicity detection model for in-game chat. On the robustness side, we will reiterate on an existing context-aware toxicity detection model to address four main areas: rare categories, continuous learning, adversarial learning, and human-in-the-loop.