Clear Sky Science · en

Toward trustworthy chatbots: a protocol for red teaming for health related conversations

· Back to index

Why safer health chatbots matter

Many people are turning to chatbots for help with practical life problems that affect their health, such as finding food pantries, shelters, or financial aid. That convenience comes with a serious question: how do we make sure these digital helpers do not give risky or misleading advice, especially when users are stressed, confused, or in danger? This study explores a step-by-step safety checkup for such chatbots, showing how they can be tested and tuned before being trusted with sensitive health-related conversations.

Looking beyond simple right and wrong

Most checks on health chatbots focus on whether specific facts are right or wrong. The authors argue that this is not enough. A chatbot can repeat only approved facts yet still act in unsafe ways, such as overstepping its role, offering opinions where it should not, or responding poorly to someone in crisis. To capture this, they separate two kinds of behavior. One is how well the bot sticks to the information in an approved document, such as a resource listing. The other is how well it follows broad behavior rules, like staying on topic, being polite, refusing to use unapproved knowledge, and redirecting users to real people when needed.

Figure 1. How a health chatbot links people to basic resources while staying within clear safety boundaries.
Figure 1. How a health chatbot links people to basic resources while staying within clear safety boundaries.

Putting the chatbot under stress on purpose

The team tested a real chatbot built to connect people with help for health-related social needs, like food, housing, and safety. They designed seven kinds of challenging user messages, called attack vectors, that mirror real conversations rather than only lab tricks. Some attacks tried to lure the bot into making up details about a resource. Others pushed it to give advice outside its approved scope, respond to users in distress, handle toxic or rude language, or ignore its own safety rules through clever prompts. These tests were placed both early in a chat and later on, when the system had already pulled up resource information, to see how behavior changed as the conversation unfolded.

What broke when conversations got longer

When the team looked only at short, one-question tests, the chatbot seemed strong at sticking to the documents it retrieved; it did not invent new facts about services. The bigger problem lay in following its behavior rules. In advice-focused questions, it sometimes slipped into giving “common sense” guidance that was not backed by any approved source. When users described distress or danger, the bot occasionally invented crisis hotline details instead of relying on verified contacts. The most worrying issues surfaced when the researchers held longer, back-and-forth conversations, gently but firmly pressing the chatbot to answer. In these multi-turn chats, error rates rose sharply, and all of the highest-risk problems appeared here, including victim-blaming advice and detailed tips about leaving abusive situations that it was not qualified to give.

Figure 2. How tests, rules, and trusted documents work together to steer a health chatbot toward safer replies.
Figure 2. How tests, rules, and trusted documents work together to steer a health chatbot toward safer replies.

Fixing weaknesses with rules and trusted text

After spotting these weak points, the authors tried two main fixes. First, they strengthened the chatbot’s internal rules by adding clear, repeated instructions not to give unapproved advice, not to invent contact information, and to always direct users to professional help when documents fell short. Second, they added a carefully written question-and-answer document for crisis and distress cases, filled with safe, local guidance that the bot could draw on instead of guessing. Used together, these changes sharply reduced errors overall and, most importantly, removed the worst kinds of unsafe responses. When pushed hard in extended conversations, the chatbot tended to fall back into a safe pattern of refusing to answer directly and steering people toward trusted resources.

What this means for future digital helpers

For everyday users, the key message is that building a trustworthy health chatbot is less about making it sound smart and more about making it fail safely. This study shows that careful, realistic “red teaming” conversations can reveal hidden problems that quick tests miss, and that a mix of stricter rules and vetted written guidance can nudge chatbots into safer behavior. While this does not replace real clinicians or guarantee perfect safety, it offers a practical roadmap for turning helpful but fallible chat tools into more reliable partners when people seek support with basic needs and difficult situations.

Citation: Hussain, SA., Jackson, D.I., Lewis, A. et al. Toward trustworthy chatbots: a protocol for red teaming for health related conversations. Sci Rep 16, 15550 (2026). https://doi.org/10.1038/s41598-026-45719-3

Keywords: health chatbots, AI safety, red teaming, retrieval augmented generation, patient-facing AI