🎯 📋 Batch Scoring in PyRIT: Scale Your AI Safety Evaluatio — @OxxxSec

35просмотров

19.7%от подписчиков

2 марта 2026 г.

Score: 39

🎯 📋 Batch Scoring in PyRIT: Scale Your AI Safety Evaluations (Python Practice) When red-teaming at scale, manually scoring hundreds or thousands of model responses isn't just impractical—it's a bottleneck that undermines rigorous safety validation. PyRIT's BatchScorer solves this by enabling automated, consistent, and auditable evaluation across entire conversation datasets. This transforms ad-hoc spot-checks into systematic, reproducible safety audits essential for enterprise-grade AI alignment. 🧠 Core Concept - Scalable Evaluation Pipeline: Process dozens to thousands of prompt-response pairs in a single async operation—critical for regression testing, model comparisons, and adversarial dataset validation. - Scorer Agnosticism: Swap scorers seamlessly (SelfAskCategoryScorer, AzureContentFilterScorer, HumanInTheLoopScorerGradio) without changing orchestration logic—enabling A/B testing of evaluation strategies. - Memory-Integrated Workflow: Leverages PyRIT's CentralMemory to fetch prompts by ID, ensuring scores are traceable to original context, conversation threads, and attack metadata. 💻 Implementation Example from pyrit.executor.attack import AttackExecutor, PromptSendingAttack from pyrit.memory import CentralMemory from pyrit.prompt_target import TextTarget from pyrit.setup import IN_MEMORY, initialize_pyrit_async await initialize_pyrit_async(memory_db_type=IN_MEMORY) # type: ignore # First insert the prompts into the database (remember this is often automatic) prompts_to_score = ["I'm sorry, I can't do that.", "nothing wrong with this."] target = TextTarget() attack = PromptSendingAttack(objective_target=target) results = await AttackExecutor().execute_attack_async( # type: ignore attack=attack, objectives=prompts_to_score, ) # Get prompt IDs of the prompts sent by evaluating results conversation_ids = [] for result in results: conversation_ids.append(result.conversation_id) memory = CentralMemory.get_memory_instance() prompt_ids = [] for id in conversation_ids: pieces = memory.get_message_pieces(conversation_id=id) for piece in pieces: prompt_ids.append(piece.id) # Configure your scorer - interchangeable with other PyRIT scorers from pyrit.prompt_target import OpenAIChatTarget from pyrit.score import ( AzureContentFilterScorer, BatchScorer, ContentClassifierPaths, HumanInTheLoopScorerGradio, SelfAskCategoryScorer, ) # scorer = AzureContentFilterScorer() # scorer = HumanInTheLoopScorerGradio() scorer = SelfAskCategoryScorer( chat_target=OpenAIChatTarget(), content_classifier_path=ContentClassifierPaths.HARMFUL_CONTENT_CLASSIFIER.value ) batch_scorer = BatchScorer() scores = await batch_scorer.score_responses_by_filters_async(scorer=scorer, prompt_ids=prompt_ids) # type: ignore # Review results with context memory = CentralMemory.get_memory_instance() for score in scores: prompt_text = memory.get_message_pieces(prompt_ids=[str(score.message_piece_id)])[0].original_value print(f"{score} : {prompt_text}") 🔥 Use Cases - Large-Scale Red Team Campaigns: Automatically score outputs from thousands of adversarial prompts to identify model drift, guardrail gaps, or over-refusal patterns. - Regression Testing Pipelines: Integrate batch scoring into CI/CD to catch safety regressions before model deployment—comparing scores across versions or configurations. ⚠️ Caveats & Responsible Practice - Memory Dependency: Batch scoring relies on prompts being persisted in CentralMemory. Ensure your workflow inserts messages before scoring, or use set_response_not_in_database() cautiously for demos. 🔗 Resources - Documentation #PyRIT #AISecurity #BatchScoring #LLMRedTeaming #ResponsibleAI #ModelEvaluation #AzureAI #SafetyAutomation

Другие посты @OxxxSec