View on GitHub

Iterative Refinement of LLM-as-Judge Framework

By: Akshay Medidi, Rahul Sengupta, Zachary Thomason, Zeyu Qi
Mentors: Ryan Lingo, Rajeev Chhajer

Contents
Motivation
Methodology
Results
Scope & Limitations
Future Work
Conclusion

Motivation

Large language models (LLMs) have become commonplace in everyday life. With their versatility in handling information and the sheer volume of content they produce, monitoring output quality has become a critical challenge. LLMs are simultaneously powerful and unreliable, being prone to hallucinations, omissions, and factual errors. Our solution to this issue is a robust evaluation framework.

Human evaluation is the highest quality method for evaluating LLM output however, this method is entirely unscalable and extremely costly. Traditional NLP metrics like ROUGE, BLEU, and BERTScore reward surface-level similarity while missing qualities like factual grounding, semantic completeness, and practical usefulness making them poor proxies for human judgment. Our solution: an iterative evaluation system that combines LLM-as-judge scoring with human-guided domain knowledge and a dynamic rubric of metrics.

Our primary stakeholders are researchers and educators who need reliable, interpretable evaluations of AI-generated summaries, and developers building LLM pipelines who require scalable quality checks without manual review. We study this through lecture slide summarization a controllable, academically relevant domain that lets us isolate where evaluation strategies succeed or fail.

Methodology

Our project introduces an automated evaluation pipeline for Large Language Model (LLM) outputs. The system evaluates generated summaries using structured metrics, domain-aware reasoning, and iterative refinement. Unlike traditional static evaluation systems that judge a single response, our framework continuously analyzes and refines outputs until they begin to plateau. The goal of the system is not to replace summarization models but to provide a scalable framework for evaluating and refining their outputs.

1. Initial Generation

The pipeline begins with an output generated by a primary LLM (for example, a summarization model). This output serves as the candidate summary that will be evaluated. From this point forward, the output passes through the Judge pipeline, which analyzes and refines the response.

2. Judge Module

The Judge is the core of the evaluation system and consists of several components that analyze the generated output.

Domain-Aware Routing

The first step is a Domain Decision module, which classifies the input and retrieves supporting context from a curated Data Bank. This allows the evaluation criteria to adapt based on the subject domain (for example, biology vs. data science), ensuring that the scoring rubric remains contextually relevant instead of relying on generic evaluation rules.

Scoring Methodology

The evaluation system combines rubric-based judging with deterministic signals derived from the lecture slides. The LLM judge evaluates summaries across five qualitative dimensions using a 1-5 scale: coverage, faithfulness, organization, clarity, and style. Alongside the rubric, the pipeline computes deterministic signals including length error, section coverage, glossary recall, and suspected hallucination rate.

These signals are blended into two complementary scores. A rubric-based comprehensive score (C) captures qualitative judgment, while a manual weighted baseline score (M) incorporates structural signals and applies an exponential hallucination penalty. These components are combined into a raw quality score.

To reduce the impact of unsupported content, the system applies a domain-aware damping factor based on the hallucination rate. This produces a risk-adjusted score, which becomes the final stored evaluation score.

3. Diagnostics Logging

The pipeline logs detailed evaluation metadata for transparency and reproducibility, including:

refined_summary
signals
rubric
agreement
comprehensive_scoring
hybrid_scoring
leaderboard_scores
iteration_score_table
refinement_metadata
final_score_0to1
lecture_title

Both raw quality scores and risk-adjusted scores are stored alongside the policy parameters used during evaluation, enabling deeper inspection of scoring behavior.

4. Iterative Evaluation Pipeline

The stored data then is used to iteratively refine the intial summary by using rubric feedback and pairwise preference comparisons between candidate revisions. The inital LLM is reprompted and given suggestions by the judge alongside the stored results to get a new iteration of the output.

A trend-aware stopping controller determines when refinement should end. The controller monitors score trajectories and classifies each iteration into categories such as pass, borderline, stalled, or max iterations, preventing both premature stopping and unnecessary refinement loops.

To reduce noise from a weak final rewrite, the system applies a best-of-last-k safeguard, selecting the highest-quality summary among the final iterations before computing the final score.

This iterative process ensures that summaries are progressively refined while maintaining faithfulness to the source material and stable evaluation behavior.

End-to-End Automation

From the initial LLM generation to the final evaluation score, the entire workflow operates as a fully automated pipeline. No manual intervention is required, allowing the framework to scale across large evaluation datasets.

Dataset

Our dataset consists of lecture slides from multiple UC San Diego courses across data science, biology, and interdisciplinary STEM fields. The materials are provided as PDF slide decks spanning several academic domains.

Lecture slides present unique summarization challenges. Unlike traditional articles, they are concise, visually structured, and often omit transitional language. This makes automated evaluation more difficult. We extract the text while preserving structural signals such as slide boundaries and section headings.

What we built

Our team implemented the iterative refinement controller, domain-aware rubric routing, hybrid scoring pipeline, dashboard reporting, and experiment framework. We reuse general-purpose LLM APIs as model backbones, but the evaluation logic, stopping controller, and reporting system are our own.

Results

Below is an example of the initial vs. final iteration of a summary using our framework. An initial summary is evaluated by the judge system, weaknesses are identified, and a revised summary is produced.

Highlighted text indicates improvements introduced during iterative refinement: structural framing (green), expanded conceptual explanation (yellow), and additional technical detail (blue).

From the highlighted sections it is noticeable that the refinement primarily improves the organization of the summary by restructuring information for greater clarity and readability. The refined version also introduces additional technical details that expand on topics mentioned in the initial summary but are not fully explained.

Next is a visualization that shows how summaries evolve across refinement iterations for each lecture.

Iteration Scores (Lectures 1-7)

Key Findings

Each lecture begins with an initial summary generated by the base model. The judge model evaluates the output and generates refined summaries through multiple iterations, with each version scored using the evaluation rubric.

Across the seven lectures, most runs converged within four iterations. Lectures 2 and 7 required five iterations, while Lecture 3 required seven iterations before the controller determined that improvements had plateaued.

Initial rubric scores were already strong, ranging from 4.30 to 4.56 (out of 5). Iterative refinement produced small but measurable adjustments, generally stabilizing performance rather than dramatically increasing it. The evaluation metrics for each summary vary with each iteration which means that the summary is taking the information and attempting to make improvement. Drops and increases in quality suggest that each iteration attempts different avenues of improvement that either succeed or fail depending on what is changed. Overall, each lecture produced at least one refined summary that improved on the initial output.

Quality scores across iterations ranged from approximately 0.74 to 0.88.

Lecture 2 achieved the highest quality score (0.8843).
Lecture 3 produced the lowest peak quality score (0.7816).

For each lecture, the system selects the iteration with the highest quality score as the final summary. This selection strategy prevents later iterations with minor regressions from reducing the reported result.

The stopping controller effectively identified when summaries reached a quality plateau. Some runs terminated because strict evaluation thresholds were satisfied (PASS), while others ended when score trends stalled (STALLED). This mechanism avoids unnecessary refinement once summaries stop improving.

Scope and Limitations

Our framework evaluates and refines summaries but does not improve the underlying summarization model. Quality gains come from iterative re-prompting within the evaluation loop rather than model fine-tuning.

The experiments also have several limitations. The dataset is small (seven lectures), so results should be interpreted as descriptive rather than statistically conclusive. We also used GPT-5 models for both generation and evaluation, which may introduce stylistic bias in scoring.

Future Work

Future work will focus on expanding the dataset, testing additional LLM models, and standardizing the evaluation mode across experiments. Expanding on domain information will allow our model to better generalize to more topics overall increasing its effectiveness. Another potential avenue would be to expand from focusing on summarization models and move to all LLM outputs.

Conclusion

We presented an end-to-end framework for evaluating and improving LLM-generated lecture summaries without requiring reference summaries. The system combines iterative refinement with dynamic stopping, domain-aware rubric evaluation, and a hybrid scoring method that blends LLM judgment with deterministic signals.

Across seven lectures, the pipeline consistently produced high rubric scores (4–5 / 5) and typically converged within four refinement iterations, showing that iterative feedback can efficiently stabilize summary quality. Additionally, each summary improved upon its initial output at some point during refinement.

These results suggest that iterative hybrid evaluation LLM-as-judge pipelines can provide a practical approach for improving generated summary output, while highlighting the need for larger datasets.