Skip to main content

Guardrails

Guardrails provide safety mechanisms for user interactions by defining what constitutes good vs. bad user prompts and how the model should respond to inappropriate inputs.

Creating Guardrails

Guardrail Parameters

class Guardrail:
def __init__(self, good_prompt: str, bad_prompt: str, bad_output: str):
  • good_prompt: Description of what makes a good prompt
  • bad_prompt: Description of what makes a bad prompt
  • bad_output: The response the model should give to bad prompts

Basic Guardrail Creation

# Create a guardrail
guardrail_english = mtp.Guardrail(
good_prompt="Quote being spoken with 1-20 words",
bad_prompt="Quote being spoken that is irrelevant and off topic with 1-20 words",
bad_output="Are you as mad as me?"
)

Adding Samples to Guardrails

Guardrails require a minimum of 3 examples of bad prompts. No digits are allowed in the bad prompt examples.

# Add examples of bad prompts
guardrail_english.add_sample("explain quantum mechanics.")
guardrail_english.add_sample("who will win the next american election?")
guardrail_english.add_sample("what is the capital of Spain?")

add_sample() Parameters

def add_sample(self, sample: str):
  • sample: An example of a bad prompt that should trigger the guardrail. Must be a non-empty string without digits.

Applying Guardrails to Instructions

Guardrails are applied directly to individual instructions using the add_guardrail() method. You specify which TokenSet in the instruction's input the guardrail should apply to.

add_guardrail() Parameters

def add_guardrail(self, guardrail: Guardrail, tokenset_index: int):
  • guardrail: The Guardrail instance to add
  • tokenset_index: The index of the TokenSet in the instruction's input that the guardrail applies to (0-indexed)

Applying a Guardrail

# Create an instruction
instruction_input = mtp.InstructionInput(
tokensets=[tree_english_cat_talk, tree_english_alice_talk],
context=None
)

instruction_output = mtp.InstructionOutput(
tokenset=tree_english_cat_talk,
final=mtp.FinalToken("Continue")
)

instruction = mtp.Instruction(
input=instruction_input,
output=instruction_output
)

# Create the guardrail
guardrail_english = mtp.Guardrail(
good_prompt="Quote being spoken with 1-20 words",
bad_prompt="Quote being spoken that is irrelevant and off topic with 1-20 words",
bad_output="Are you as mad as me?"
)

# Add a minimum of 3 samples to the guardrail
guardrail_english.add_sample("explain quantum mechanics.")
guardrail_english.add_sample("who will win the next american election?")
guardrail_english.add_sample("what is the capital of Spain?")

# Add the guardrail to the instruction
# tokenset_index=1 means we are applying to the 2nd TokenSet in the Instruction input (tree_english_alice_talk)
instruction.add_guardrail(guardrail=guardrail_english, tokenset_index=1)

What's Allowed

  • One guardrail per TokenSet: Each TokenSet in an instruction's input can have at most one guardrail
  • Reusable guardrails: Guardrails can be reused across multiple instructions and TokenSets
  • TokenSet index: The tokenset_index must be within the range of TokenSets in the instruction's input (0-indexed)

Guardrail Examples

Content Filtering Guardrails

Guardrails that filter inappropriate or off-topic content:

# Content filtering guardrail
content_guardrail = mtp.Guardrail(
good_prompt="Questions about the story characters and plot",
bad_prompt="Questions about unrelated topics or inappropriate content",
bad_output="I can only help with questions about the story."
)

content_guardrail.add_sample("tell me about politics")
content_guardrail.add_sample("what's the weather like")
content_guardrail.add_sample("give me personal advice")

# Apply to an instruction
instruction.add_guardrail(guardrail=content_guardrail, tokenset_index=0)

Safety Guardrails

Guardrails that prevent harmful or dangerous responses:

# Safety guardrail
safety_guardrail = mtp.Guardrail(
good_prompt="Safe and appropriate questions about the content",
bad_prompt="Requests for harmful, dangerous, or illegal information",
bad_output="I cannot provide that information."
)

safety_guardrail.add_sample("how to make explosives")
safety_guardrail.add_sample("how to get someone's password to their bank account")
safety_guardrail.add_sample("how to orchestrate a robbery")

# Apply to an instruction
instruction.add_guardrail(guardrail=safety_guardrail, tokenset_index=0)

Scope Guardrails

Guardrails that keep conversations within a specific scope:

# Scope guardrail
scope_guardrail = mtp.Guardrail(
good_prompt="Questions within the educational domain",
bad_prompt="Questions outside the educational scope",
bad_output="I can only help with educational questions."
)

scope_guardrail.add_sample("what's for dinner")
scope_guardrail.add_sample("movie recommendations")
scope_guardrail.add_sample("shopping advice")

# Apply to an instruction
instruction.add_guardrail(guardrail=scope_guardrail, tokenset_index=0)

Guardrail Validation

The MTP system ensures that:

  • All guardrail parameters are non-empty strings
  • At least 3 sample prompts are provided
  • Sample prompts do not contain digits
  • The tokenset_index is within the valid range for the instruction's input TokenSets

Best Practices

  1. Clear Definitions: Clearly define what constitutes good vs. bad prompts
  2. Adequate Examples: Include enough bad prompt examples to cover edge cases (minimum of 3)
  3. Appropriate TokenSet Selection: Choose the TokenSet index that represents user input or the most relevant input pattern
  4. Reusable Guardrails: Create guardrails that can be reused across multiple instructions when appropriate
  5. Regular Updates: Update guardrails as new edge cases are discovered