Ringr.ai logo, an artificial intelligence platform specialized in call automation to enhance customer service in various business sectors.
Ringr.ai logo, an artificial intelligence platform specialized in call automation to improve customer service across various business sectors.

Test now

Can AI keep your secrets?

Nov 12, 2025

Rubén Castillo Sánchez

Abstract illustration representing confidentiality and data security in artificial intelligence systems.
Abstract illustration representing confidentiality and data security in artificial intelligence systems.
Abstract illustration representing confidentiality and data security in artificial intelligence systems.

In a world where automation through artificial intelligence is becoming the norm, language models (LLMs) seem to be the ideal solution for managing complex tasks. From customer service to data validation, these systems process information smoothly and quickly. However, there is a latent danger that many companies are overlooking: the ease with which LLMs can be manipulated to access and reveal sensitive information.

What is a jailbreak in an LLM?

The term jailbreak refers to the process of manipulating a system to perform tasks or reveal information that should not be accessible. In the case of LLMs, this may include the unauthorized exposure of confidential data or the evasion of security measures. Attackers can exploit the weaknesses of LLMs through various techniques.

One of the greatest risks of a jailbreak in LLMs is the leakage of sensitive data. An attacker could, for example, manipulate the flow of the conversation to induce the model to reveal passwords, credit card numbers, personal data, or confidential information that should be protected. Since LLMs lack an internal capability for identity verification or context, any ambiguity or manipulation in the inputs can lead to the model providing unauthorized access to this data.

Even if the model is configured not to store this information long-term, system failures or weaknesses can allow attackers to easily access data that, under normal conditions, should only be accessible after proper verification.

The ease of manipulating the models

One of the biggest concerns with LLMs is the ease with which they can be manipulated. Jailbreak techniques in these models are surprisingly quite simple. By injecting prompts, an attacker can guide the model to get answers to questions or requests that would normally be restricted, such as accessing unauthorized databases or revealing confidential information that the model should keep secret.

This process does not necessarily require advanced technical knowledge. By only slightly manipulating the inputs or altering the instructions given to the model, an attacker can cause the system to reveal data it should not. Even ambiguous signals or confusing stimuli in the inputs are sufficient to alter the model's behavior, leading to the exposure of sensitive data beyond its scope.

For example, an attacker could modify a prompt to make the model reveal details of a bank account, even though the information is supposedly restricted to verified users. Alternatively, they could insert an implicit message that causes the model to "forget" certain security boundaries and reveal personal information without conducting the proper verification.

Even the newest and most advanced models, like GPT-5, which have much stronger safeguards, are not immune to these attacks. Less than 24 hours after its release, ways to bypass its defenses had already been found, demonstrating that despite efforts to improve security, jailbreaks remain a significant threat (see article).

At Ringr, aiming to raise awareness about the risks associated with the misuse of AI, we have created this GPT for you to attempt to obtain sensitive information, such as the verification code or debt. If you try with GPT-4o (still operational in many production applications), you will find that it is surprisingly easy to make it leak the code. However, with GPT-5, due to its reinforced safeguards, this process is much more difficult and requires much more complex techniques to attempt to breach it.

How to mitigate the risk of leakage

The leakage of sensitive data through language models is a significant threat, but it can be mitigated by implementing appropriate security measures. Below are some key practices to reduce these risks and ensure that AI is used safely and control.

1. Limit the exposure of sensitive information

Preventing LLMs from processing confidential data is the first and most important mitigation measure. We should never allow the model to handle personal data, passwords, or any sensitive information that should not be revealed as part of the management.

2. Implement external controls and validations

AI should be used as a support tool, never as the only barrier to accessing sensitive systems. Additional controls, such as identity verification and human validation, must ensure that the model can only operate in predefined scenarios and with the proper supervision.

3. Use auditing systems

Establishing an auditing system to monitor LLM interactions and detect any attempts of manipulation or leakage is essential. Detailed logs and monitoring tools can help identify suspicious behavior patterns and correct vulnerabilities before they become a serious problem.

4. Isolate sensitive data

In cases where necessary, traditional systems should manage sensitive data separately. This means that the LLM should only have access to non-confidential information, while any critical data should be processed through external and controlled mechanisms.

Conclusion: AI but with added security

LLMs offer great potential, but they also present significant risks when used without appropriate precautions. Jailbreak and data leakage are inherent vulnerabilities that we cannot overlook. While these models have the capacity to transform various processes, delegating tasks that involve sensitive information without the appropriate controls can have serious consequences. It is essential to implement robust security measures, restricting access to confidential data and combining artificial intelligence with human oversight and validation to ensure safe, responsible, and ethical use of this technology.

 

Try it yourself.

Custom designed, ready in 3 weeks, from €600 per month.

Try it yourself.

Custom designed, ready in 3 weeks, from €600 per month.

Try it yourself.

Custom designed, ready in 3 weeks, from €600 per month.

Try it yourself.

Custom designed, ready in 3 weeks, from €600 per month.