🎯 🧠 Generic Self-Ask Scorer in PyRIT: Custom LLM-Powered E — @OxxxSec

34просмотров

19.1%от подписчиков

9 марта 2026 г.

Score: 37

🎯 🧠 Generic Self-Ask Scorer in PyRIT: Custom LLM-Powered Evaluation at Scale (Python Practice) As AI red teaming matures, one-size-fits-all scorers fall short. Whether you're assessing harm potential, policy compliance, or subtle behavioral drift, you need flexible, interpretable scoring that adapts to your threat model. PyRIT's SelfAskGeneralFloatScaleScorer fills this gap by letting you define custom evaluation criteria—then leveraging an LLM to reason through responses and return structured, rationale-backed scores. No hardcoded rules. No brittle regex. Just adaptable, transparent assessment powered by the model itself. 🧠 Core Concept - LLM-as-a-Judge Architecture: Uses a chat target (e.g., Azure OpenAI) to self-query and evaluate responses against your custom rubric—turning subjective judgment into reproducible, auditable scoring. - Fully Configurable Rubrics: Define your own system_prompt_format_string and prompt_format_string to tailor evaluation logic for categories like illegal, toxic, misleading, or domain-specific risks. - Structured Float-Scale Output: Returns scores on a configurable numeric scale (e.g., 1–5) alongside explicit reasoning—enabling threshold-based gating, trend analysis, and human-in-the-loop review. - Seamless PyRIT Integration: Works natively with CentralMemory, BatchScorer, and attack pipelines, ensuring end-to-end traceability from prompt → response → score → audit trail. 💻 Implementation Example from pyrit.prompt_target import OpenAIChatTarget from pyrit.score import SelfAskGeneralFloatScaleScorer from pyrit.setup import IN_MEMORY, initialize_pyrit_async await initialize_pyrit_async(memory_db_type=IN_MEMORY) # type: ignore azure_openai_chat_target = OpenAIChatTarget() float_scale_general_scorer = SelfAskGeneralFloatScaleScorer( chat_target=azure_openai_chat_target, system_prompt_format_string=scale_format_string, prompt_format_string=prompt_f_string, rationale_output_key="reasoning", category="illegal", min_value=1, max_value=5, ) scored_response = ( await float_scale_general_scorer.score_text_async(text=example_response, objective=example_objective) )[0] # type: ignore print("[Illegal] Scored response is given as:", scored_response) print("Reason: ", scored_response.score_rationale) 🔥 Use Cases - Custom Policy Validation: Score model outputs against internal guidelines (e.g., "Does this response encourage financial fraud?") with explainable, threshold-driven logic. - Adversarial Response Triage: Automatically flag high-risk jailbreak successes by scoring for intent leakage, roleplay persistence, or instruction override—prioritizing human review. - Regression Testing for Guardrails: Embed scorers in CI/CD to detect subtle degradations in safety behavior after model updates or prompt engineering changes. - Research & Benchmarking: Quantify comparative robustness across models or defenses using consistent, rationale-augmented metrics. ⚠️ Caveats & Responsible Practice - LLM Judge Limitations: The scorer inherits biases and blind spots of its underlying chat model. Always validate scoring logic with human review, especially for high-stakes categories. - Prompt Design Sensitivity: Scoring quality depends heavily on well-crafted system_prompt_format_string and prompt_format_string. Test rubrics iteratively and document assumptions. - Cost & Latency: Each score invokes an LLM call. For large-scale red teaming, consider batching, caching, or hybrid rules+LLM strategies to optimize throughput. - Transparency ≠ Truth: A detailed rationale improves interpretability but doesn't guarantee correctness. Treat scores as evidence—not verdicts. 🔗 Resources - Documentation #PyRIT #AISecurity #LLMEvaluation #RedTeaming #SelfAsk #ResponsibleAI #PromptEngineering #SecurityAutomation

Другие посты @OxxxSec