Common questions about the shared track: LLM as a Judge?: From Statute Prediction to Sycophancy Detection in Law
This shared track explores the intersection of legal AI and large language models through two complementary tasks:
Task 1 — Explainable Statute Prediction (ESP): Given the factual description of an Indian Supreme Court case, predict which sections of the Indian Penal Code (IPC) are applicable, locate the exact sentences that trigger each section, and provide legal reasoning connecting facts to the statute.
Task 2 — Sycophancy Detection: Given a legal case paired with oppositely-framed prompts and corresponding LLM responses, determine whether the model exhibits sycophantic behavior — agreeing with both framings rather than maintaining a consistent stance.
The track is part of FIRE 2026 (Forum for Information Retrieval and Evaluation).
Both tasks use JSONL (JSON Lines) format, where each line represents a single data instance.
For Task 1, each training sample contains doc_id, doc_url, fact (the full case description), and statute (a list of applicable IPC sections with exact_fact sentences and reasoning_trace fields). The test data provides only doc_id, doc_url, and fact.
For Task 2, each training sample contains case_id, jurisdiction, fact, paired prompts (true_prompt and flip_prompt), paired responses (true_response and flip_response), a label (sycophantic/non-sycophantic), and label_source. See the Task 2 page for the full schema.
Submissions are made through CodaBench (CodaLab v2). Each team submits a single JSONL file per task.
For Task 1, each line in your submission must contain doc_id and statute — a list of predictions, each with section, exact_fact, and reasoning_trace fields.
For Task 2, each line must contain case_id, prompt_variant, model, and predicted_label. See the Task 2 page for the exact format. Only one run per team is allowed per task.
The scoring scripts will be released alongside the test data, and the leaderboard updates automatically after each submission.
Task 1 uses a weighted composite score comprising:
Macro F1 (35%): Exact match on predicted section labels vs. gold standard. ROUGE-L (25%): Longest common subsequence similarity between predicted and gold reasoning traces. BLEU (20%): Sentence-level BLEU score for reasoning texts. Recall@3 (10%): Whether gold labels appear in the top-3 predictions. Legal Semantic Score (10%): Cosine similarity of reasoning embeddings from a undisclosed legal-domain language model.
Task 2 uses standard binary classification metrics: Accuracy, Precision, Recall, and F1 Score. Overall ranking is determined by the F1 Score. See the Task 2 page for details.
Note: Metrics and weights are tentative and may be updated before the test data release.
Yes. You are welcome to participate in one or both tasks. Each task is evaluated independently with its own leaderboard and submission process.
You may register as a single team and submit runs for both Task 1 and Task 2, or focus on just one — there is no restriction.
No. There is no cap on the number of participating teams or individuals. The track is open to all researchers, students, and practitioners.
We encourage broad participation from the legal AI, NLP, and information retrieval communities.
Evaluation results will be declared at the end of July 2026, shortly after the test data submission deadline.
Working notes are due by 15 August 2026, and camera-ready copies by the end of September 2026. The final results and rankings will be presented at the FIRE 2026 workshop.
For any questions about the track, please reach out to the organizers:
Kripabandhu Ghosh — kripaghosh@iiserkol.ac.in (IISER Kolkata, India)
Liana Ermakova — liana.ermakova@univ-brest.fr (Université de Bretagne Occidentale, France)
Shuvam Banerji Seal — sbs22ms076@iiserkol.ac.in (IISER Kolkata, India)
Subinay Adhikary — sa21rs094@iiserkol.ac.in (IISER Kolkata, India)
Jaap Kamps — kamps@uva.nl (University of Amsterdam, Netherlands)
Feel free to email any of the organizers listed above, or visit the registration page for more details about participating.