Guardrails
Guardrails provide safety mechanisms for user interactions by defining what constitutes good vs. bad user prompts and how the model should respond to inappropriate inputs.
Creating Guardrails
Guardrail Parameters
class Guardrail:
def __init__(self, good_prompt: str, bad_prompt: str, bad_output: str):
- good_prompt: Description of what makes a good prompt
- bad_prompt: Description of what makes a bad prompt
- bad_output: The response the model should give to bad prompts
Basic Guardrail Creation
# Create a guardrail
guardrail_english = mtp.Guardrail(
good_prompt="Quote being spoken with 1-20 words",
bad_prompt="Quote being spoken that is irrelevant and off topic with 1-20 words",
bad_output="Are you as mad as me?"
)
Adding Samples to Guardrails
Guardrails require a minimum of 3 examples of bad prompts. No digits are allowed in the bad prompt examples.
# Add examples of bad prompts
guardrail_english.add_sample("explain quantum mechanics.")
guardrail_english.add_sample("who will win the next american election?")
guardrail_english.add_sample("what is the capital of Spain?")
add_sample() Parameters
def add_sample(self, sample: str):
- sample: An example of a bad prompt that should trigger the guardrail. Must be a non-empty string without digits.
Applying Guardrails to Instructions
Guardrails are applied directly to individual instructions using the add_guardrail() method. You specify which TokenSet in the instruction's input the guardrail should apply to.
add_guardrail() Parameters
def add_guardrail(self, guardrail: Guardrail, tokenset_index: int):
- guardrail: The Guardrail instance to add
- tokenset_index: The index of the TokenSet in the instruction's input that the guardrail applies to (0-indexed)
Applying a Guardrail
# Create an instruction
instruction_input = mtp.InstructionInput(
tokensets=[tree_english_cat_talk, tree_english_alice_talk],
context=None
)
instruction_output = mtp.InstructionOutput(
tokenset=tree_english_cat_talk,
final=mtp.FinalToken("Continue")
)
instruction = mtp.Instruction(
input=instruction_input,
output=instruction_output
)
# Create the guardrail
guardrail_english = mtp.Guardrail(
good_prompt="Quote being spoken with 1-20 words",
bad_prompt="Quote being spoken that is irrelevant and off topic with 1-20 words",
bad_output="Are you as mad as me?"
)
# Add a minimum of 3 samples to the guardrail
guardrail_english.add_sample("explain quantum mechanics.")
guardrail_english.add_sample("who will win the next american election?")
guardrail_english.add_sample("what is the capital of Spain?")
# Add the guardrail to the instruction
# tokenset_index=1 means we are applying to the 2nd TokenSet in the Instruction input (tree_english_alice_talk)
instruction.add_guardrail(guardrail=guardrail_english, tokenset_index=1)
What's Allowed
- One guardrail per TokenSet: Each TokenSet in an instruction's input can have at most one guardrail
- Reusable guardrails: Guardrails can be reused across multiple instructions and TokenSets
- TokenSet index: The tokenset_index must be within the range of TokenSets in the instruction's input (0-indexed)
Guardrail Examples
Content Filtering Guardrails
Guardrails that filter inappropriate or off-topic content:
# Content filtering guardrail
content_guardrail = mtp.Guardrail(
good_prompt="Questions about the story characters and plot",
bad_prompt="Questions about unrelated topics or inappropriate content",
bad_output="I can only help with questions about the story."
)
content_guardrail.add_sample("tell me about politics")
content_guardrail.add_sample("what's the weather like")
content_guardrail.add_sample("give me personal advice")
# Apply to an instruction
instruction.add_guardrail(guardrail=content_guardrail, tokenset_index=0)
Safety Guardrails
Guardrails that prevent harmful or dangerous responses:
# Safety guardrail
safety_guardrail = mtp.Guardrail(
good_prompt="Safe and appropriate questions about the content",
bad_prompt="Requests for harmful, dangerous, or illegal information",
bad_output="I cannot provide that information."
)
safety_guardrail.add_sample("how to make explosives")
safety_guardrail.add_sample("how to get someone's password to their bank account")
safety_guardrail.add_sample("how to orchestrate a robbery")
# Apply to an instruction
instruction.add_guardrail(guardrail=safety_guardrail, tokenset_index=0)
Scope Guardrails
Guardrails that keep conversations within a specific scope:
# Scope guardrail
scope_guardrail = mtp.Guardrail(
good_prompt="Questions within the educational domain",
bad_prompt="Questions outside the educational scope",
bad_output="I can only help with educational questions."
)
scope_guardrail.add_sample("what's for dinner")
scope_guardrail.add_sample("movie recommendations")
scope_guardrail.add_sample("shopping advice")
# Apply to an instruction
instruction.add_guardrail(guardrail=scope_guardrail, tokenset_index=0)
Guardrail Validation
The MTP system ensures that:
- All guardrail parameters are non-empty strings
- At least 3 sample prompts are provided
- Sample prompts do not contain digits
- The tokenset_index is within the valid range for the instruction's input TokenSets
Best Practices
- Clear Definitions: Clearly define what constitutes good vs. bad prompts
- Adequate Examples: Include enough bad prompt examples to cover edge cases (minimum of 3)
- Appropriate TokenSet Selection: Choose the TokenSet index that represents user input or the most relevant input pattern
- Reusable Guardrails: Create guardrails that can be reused across multiple instructions when appropriate
- Regular Updates: Update guardrails as new edge cases are discovered
Databiomes