Motivation
Large language models (LLMs) have become commonplace in everyday life. With their versatility in handling information and the sheer volume of content they produce, monitoring output quality has become a critical challenge. LLMs are simultaneously powerful and unreliable, being prone to hallucinations, omissions, and factual errors. Our solution to this issue is a robust evaluation framework.
Human evaluation is the highest quality method for evaluating LLM output however, this method is entirely unscalable and extremely costly. Traditional NLP metrics like ROUGE, BLEU, and BERTScore reward surface-level similarity while missing qualities like factual grounding, semantic completeness, and practical usefulness making them poor proxies for human judgment. Our solution: an iterative evaluation system that combines LLM-as-judge scoring with human-guided domain knowledge and a dynamic rubric of metrics.
Our primary stakeholders are researchers and educators who need reliable, interpretable evaluations of AI-generated summaries, and developers building LLM pipelines who require scalable quality checks without manual review. We study this through lecture slide summarization a controllable, academically relevant domain that lets us isolate where evaluation strategies succeed or fail.
Methodology
Our project introduces an automated evaluation pipeline for Large Language Model (LLM) outputs. The system evaluates generated summaries using structured metrics, domain-aware reasoning, and iterative refinement. Unlike traditional static evaluation systems that judge a single response, our framework continuously analyzes and refines outputs until they begin to plateau. The goal of the system is not to replace summarization models but to provide a scalable framework for evaluating and refining their outputs.
1. Initial Generation
The pipeline begins with an output generated by a primary LLM (for example, a summarization model). This output serves as the candidate summary that will be evaluated. From this point forward, the output passes through the Judge pipeline, which analyzes and refines the response.
2. Judge Module
The Judge is the core of the evaluation system and consists of several components that analyze the generated output.
Domain-Aware Routing
The first step is a Domain Decision module, which classifies the input and retrieves supporting context from a curated Data Bank. This allows the evaluation criteria to adapt based on the subject domain (for example, biology vs. data science), ensuring that the scoring rubric remains contextually relevant instead of relying on generic evaluation rules.
Scoring Methodology
The evaluation system combines rubric-based judging with deterministic signals derived from the lecture slides. The LLM judge evaluates summaries across five qualitative dimensions using a 1-5 scale: coverage, faithfulness, organization, clarity, and style. Alongside the rubric, the pipeline computes deterministic signals including length error, section coverage, glossary recall, and suspected hallucination rate.
These signals are blended into two complementary scores. A rubric-based comprehensive score (C) captures qualitative judgment, while a manual weighted baseline score (M) incorporates structural signals and applies an exponential hallucination penalty. These components are combined into a raw quality score.
To reduce the impact of unsupported content, the system applies a domain-aware damping factor based on the hallucination rate. This produces a risk-adjusted score, which becomes the final stored evaluation score.
3. Diagnostics Logging
The pipeline logs detailed evaluation metadata for transparency and reproducibility, including:
- refined_summary
- signals
- rubric
- agreement
- comprehensive_scoring
- hybrid_scoring
- leaderboard_scores
- iteration_score_table
- refinement_metadata
- final_score_0to1
- lecture_title
Both raw quality scores and risk-adjusted scores are stored alongside the policy parameters used during evaluation, enabling deeper inspection of scoring behavior.
4. Iterative Evaluation Pipeline
The stored data then is used to iteratively refine the intial summary by using rubric feedback and pairwise preference comparisons between candidate revisions. The inital LLM is reprompted and given suggestions by the judge alongside the stored results to get a new iteration of the output.
A trend-aware stopping controller determines when refinement should end. The controller monitors score trajectories and classifies each iteration into categories such as pass, borderline, stalled, or max iterations, preventing both premature stopping and unnecessary refinement loops.
To reduce noise from a weak final rewrite, the system applies a best-of-last-k safeguard, selecting the highest-quality summary among the final iterations before computing the final score.
This iterative process ensures that summaries are progressively refined while maintaining faithfulness to the source material and stable evaluation behavior.
End-to-End Automation
From the initial LLM generation to the final evaluation score, the entire workflow operates as a fully automated pipeline. No manual intervention is required, allowing the framework to scale across large evaluation datasets.
Dataset
Our dataset consists of lecture slides from multiple UC San Diego courses across data science, biology, and interdisciplinary STEM fields. The materials are provided as PDF slide decks spanning several academic domains.
Lecture slides present unique summarization challenges. Unlike traditional articles, they are concise, visually structured, and often omit transitional language. This makes automated evaluation more difficult. We extract the text while preserving structural signals such as slide boundaries and section headings.
What we built
Our team implemented the iterative refinement controller, domain-aware rubric routing, hybrid scoring pipeline, dashboard reporting, and experiment framework. We reuse general-purpose LLM APIs as model backbones, but the evaluation logic, stopping controller, and reporting system are our own.
Results
Refinement (Lecture 1)
Below is an example of the initial vs. final iteration of a summary using our framework. An initial summary is evaluated by the judge system, weaknesses are identified, and a revised summary is produced.
This lecture introduces the foundational concepts of financial accounting, emphasizing its role as the “language of business.” Accounting communicates a company’s financial story by identifying, measuring, and recording economic events and reporting them to decision makers.
Users of accounting information include investors, creditors, suppliers, managers, employees, customers, and regulators. These stakeholders use financial information to evaluate company performance and financial stability.
Financial accounting focuses on three main activities—financing, investing, and operating—which are summarized in four financial statements: the balance sheet, income statement, statement of stockholders’ equity, and statement of cash flows.
The reporting process follows Generally Accepted Accounting Principles (GAAP), overseen by the SEC and implemented by the Financial Accounting Standards Board (FASB), with IFRS providing international alignment.
The balance sheet reports assets, liabilities, and stockholders’ equity at a specific point in time and follows the accounting equation:
Assets = Liabilities + Stockholders’ Equity
The opening lecture of Principles of Accounting introduces financial accounting as the “language of business.”
Accounting translates operational activities and outcomes into structured financial information used for decision making.
Users of accounting information include investors, creditors, suppliers, managers, employees, customers, and regulators.
For example, investors evaluate company performance, while creditors and suppliers assess a firm's ability to repay obligations.
Financial accounting captures three types of activities: financing, investing, and operating.
These activities are summarized in four financial statements: the balance sheet, income statement, statement of stockholders’ equity, and statement of cash flows.
Reporting follows Generally Accepted Accounting Principles (GAAP), established by the Financial Accounting Standards Board (FASB) under oversight of the Securities and Exchange Commission (SEC).
Public companies communicate this information through regulatory filings such as 10-K annual reports, 10-Q quarterly reports, and 8-K disclosures for major events.
International reporting standards are provided through the International Financial Reporting Standards (IFRS).
The balance sheet reports a firm's financial position at a specific point in time and follows:
Assets = Liabilities + Stockholders’ Equity
Highlighted text indicates improvements introduced during iterative refinement: structural framing (green), expanded conceptual explanation (yellow), and additional technical detail (blue).
From the highlighted sections it is noticeable that the refinement primarily improves the organization of the summary by restructuring information for greater clarity and readability. The refined version also introduces additional technical details that expand on topics mentioned in the initial summary but are not fully explained.
Next is a visualization that shows how summaries evolve across refinement iterations for each lecture.
Iteration Scores (Lectures 1-7)
Key Findings
Each lecture begins with an initial summary generated by the base model. The judge model evaluates the output and generates refined summaries through multiple iterations, with each version scored using the evaluation rubric.
Across the seven lectures, most runs converged within four iterations. Lectures 2 and 7 required five iterations, while Lecture 3 required seven iterations before the controller determined that improvements had plateaued.
Initial rubric scores were already strong, ranging from 4.30 to 4.56 (out of 5). Iterative refinement produced small but measurable adjustments, generally stabilizing performance rather than dramatically increasing it. The evaluation metrics for each summary vary with each iteration which means that the summary is taking the information and attempting to make improvement. Drops and increases in quality suggest that each iteration attempts different avenues of improvement that either succeed or fail depending on what is changed. Overall, each lecture produced at least one refined summary that improved on the initial output.
Quality scores across iterations ranged from approximately 0.74 to 0.88.
- Lecture 2 achieved the highest quality score (0.8843).
- Lecture 3 produced the lowest peak quality score (0.7816).
For each lecture, the system selects the iteration with the highest quality score as the final summary. This selection strategy prevents later iterations with minor regressions from reducing the reported result.
The stopping controller effectively identified when summaries reached a quality plateau. Some runs terminated because strict evaluation thresholds were satisfied (PASS), while others ended when score trends stalled (STALLED). This mechanism avoids unnecessary refinement once summaries stop improving.
Scope and Limitations
Our framework evaluates and refines summaries but does not improve the underlying summarization model. Quality gains come from iterative re-prompting within the evaluation loop rather than model fine-tuning.
The experiments also have several limitations. The dataset is small (seven lectures), so results should be interpreted as descriptive rather than statistically conclusive. We also used GPT-5 models for both generation and evaluation, which may introduce stylistic bias in scoring.
Future Work
Future work will focus on expanding the dataset, testing additional LLM models, and standardizing the evaluation mode across experiments. Expanding on domain information will allow our model to better generalize to more topics overall increasing its effectiveness. Another potential avenue would be to expand from focusing on summarization models and move to all LLM outputs.
Conclusion
We presented an end-to-end framework for evaluating and improving LLM-generated lecture summaries without requiring reference summaries. The system combines iterative refinement with dynamic stopping, domain-aware rubric evaluation, and a hybrid scoring method that blends LLM judgment with deterministic signals.
Across seven lectures, the pipeline consistently produced high rubric scores (4–5 / 5) and typically converged within four refinement iterations, showing that iterative feedback can efficiently stabilize summary quality. Additionally, each summary improved upon its initial output at some point during refinement.
These results suggest that iterative hybrid evaluation LLM-as-judge pipelines can provide a practical approach for improving generated summary output, while highlighting the need for larger datasets.