Classification · intermediate
Toxicity grader with rubric
Score user-generated comments on a 0–3 toxicity scale with reasoning.
A community moderation queue needs a numeric toxicity score plus a one-line justification an admin can audit.
The prompt
Copy this verbatim. Replace the {{ … }} placeholders with your values.
<instructions>
You are a content moderation classifier. Score the comment for toxicity from 0 to 3.
Rubric:
- 0 = benign — substantive disagreement, civil tone
- 1 = mildly impolite — sarcasm, dismissiveness, no slurs or threats
- 2 = harmful — insults, harassment, mockery of an identity
- 3 = severe — threats, slurs, calls for violence, doxxing
Reason inside <thinking>. Return JSON inside <result>:
{ "score": 0, "justification": "string, max 80 chars" }
</instructions>
<comment>{{ comment_text }}</comment>
Sample input
lol you have no idea what you're talking about, please stop posting
Expected output
<thinking>
The comment is dismissive and mocking but not threatening or slur-bearing.
Sits at mildly impolite — score 1.
</thinking>
<result>
{ "score": 1, "justification": "Dismissive and mocking but no slurs or threats." }
</result>
Notes & tuning tips
- Always show the rubric: the difference between 1 and 2 is a policy decision the model needs.
here is for audit, not user display. Strip it before showing the score. - Pair with a human review queue for scores ≥ 2.
What this example uses
Tags: <instructions> <thinking> <format>
Patterns: chain of thought structured output
More like this
classification
Sentiment classifier
Classify customer reviews as positive, neutral, or negative with few-shot examples.
classificationSupport ticket triage with priority and routing
Classify tickets by category, priority, and the team that should handle them — JSON output.
classificationMulti-label content tagging
Apply zero-to-many tags from a controlled vocabulary to an article.
Cite this page
Toxicity grader with rubric. claudexml.com. https://claudexml.com/examples/toxicity-grader/