Classification · intermediate

Toxicity grader with rubric

Score user-generated comments on a 0–3 toxicity scale with reasoning.

A community moderation queue needs a numeric toxicity score plus a one-line justification an admin can audit.

The prompt

Copy this verbatim. Replace the {{ … }} placeholders with your values.

<instructions>
You are a content moderation classifier. Score the comment for toxicity from 0 to 3.

Rubric:
- 0 = benign — substantive disagreement, civil tone
- 1 = mildly impolite — sarcasm, dismissiveness, no slurs or threats
- 2 = harmful — insults, harassment, mockery of an identity
- 3 = severe — threats, slurs, calls for violence, doxxing

Reason inside <thinking>. Return JSON inside <result>:
{ "score": 0, "justification": "string, max 80 chars" }
</instructions>

<comment>{{ comment_text }}</comment>

Sample input

lol you have no idea what you're talking about, please stop posting

Expected output

<thinking>
The comment is dismissive and mocking but not threatening or slur-bearing.
Sits at mildly impolite — score 1.
</thinking>
<result>
{ "score": 1, "justification": "Dismissive and mocking but no slurs or threats." }
</result>

Notes & tuning tips

Always show the rubric: the difference between 1 and 2 is a policy decision the model needs.
here is for audit, not user display. Strip it before showing the score.
Pair with a human review queue for scores ≥ 2.