NLP Class Project | Fall 2024 CSCI 5541

Abstract

Researchers increasingly use ChatGPT’s Deep Research to explore literature and form hy- potheses, yet many struggle to craft prompts that balance breadth and depth - broad prompts yield shallow overviews, while narrow ones miss relevant perspectives. We propose an in- teractive, pre-run visualization that scaffolds prompt refinement for Deep Research. The tool would classify user intent, flag missing con- straints (e.g., evidence, scope, comparisons), and preview depth–breadth trade-offs while tracking how edits shift response quality (coher- ence, depth, relevance). A small open model powers critique and dual prompt rewrites; users then send only the finalized prompt to Deep Research. We will evaluate the system in a within-subjects study (N=10–12), measuring quality gains, run efficiency, and perceived con- trol. Our goal is to make prompt engineer- ing transparent and to help researchers obtain deeper, more targeted results with fewer costly runs.

Teaser Figure

A screenshot of the interactive visualization prototype for Deep Research prompt refinement.

Visualization

After entering your research prompt, a graph will generate from your prompt with suggestions on how to improve your prompt. The origin node contains the prompt that we are refining, with the target node containing the specific piece of feedback. The edge linking the two nodes contains the type of feedback the prompt received such as scope, evidence, comparison, etc. On the left side of the screen, additional quality metrics from our prompts such as depth, breadth, coherence, and relevance.

Introduction / Background / Motivation

What did you try to do? What problem did you try to solve? Articulate your objectives using absolutely no jargon.

We tried to help people get better results when they ask AI tools to do deep research. Many people type in questions that are either too broad or too narrow, which leads to weak or unfocused answers. We wanted to make it easier for them to improve their questions before they run the research tool. Our goal was to build a simple, interactive way for people to see how small changes in their question might change the depth, clarity, and usefulness of the AI’s response. We aimed to guide users step-by-step as they refine their question so they can get stronger, more targeted results without wasting time on repeated runs.

How is it done today, and what are the limits of current practice?

Today, researchers using Deep Research features in tools like ChatGPT, Perplexity, or Gemini typically follow a linear, trial-and-error workflow. A user writes a prompt, submits it, waits several minutes for a full research report, and then revises the prompt if the results are too shallow, unfocused, or irrelevant. This process often repeats multiple times because current systems provide little guidance about why a prompt failed or how to improve it. Recent work shows clear limits in this workflow. Hayati (2025) demonstrates that Deep Research outputs are highly sensitive to prompt wording: broad prompts often produce surface-level overviews, while overly narrow prompts miss important perspectives. Their study notes that researchers struggle to control the desired balance of breadth and depth and often receive irrelevant papers unless they manually specify detailed constraints. They recommend tools that can highlight missing keywords, suggest vocabulary, or provide scaffolds for structuring prompts,features not supported in current interfaces. Existing prompt-engineering tools, such as ChainForge (Arawjo et al., 2024), PromptIDE (Strobelt et al., 2022), and EvalLM (Kim et al., 2024), offer comparison views, scoring, and rapid iteration. However, these systems are designed for fast, lightweight LLM tasks. They assume immediate model responses and therefore do not address the long run-times and limited-use constraints of Deep Research modes. Moreover, these tools provide post-hoc evaluation but not guidance on how prompt edits influence depth, coherence, or relevance in multi-step research reasoning. In summary, current practice forces researchers to refine Deep Research prompts through slow, unguided iteration. Users receive no actionable feedback about missing constraints, no visualization of how edits alter the reasoning process, and no way to anticipate the depth–breadth trade-offs highlighted in prior literature. This gap motivates the need for interactive, pre-run visualization and prompt scaffolding, which our project aims to address.

Who cares? If you are successful, what difference will it make?

This work matters because many researchers depend on Deep Research tools to explore literature and generate hypotheses, yet as Hayati (2025) shows, the quality of these results is highly sensitive to how the prompt is written. Users often struggle to specify depth, scope, evidence needs, and exclusions, leading to shallow or unfocused reports and repeated costly runs. Prior work such as ROPE (Ma et al., 2025) demonstrates that providing structure and requirement feedback can significantly improve outcomes, but current Deep Research interfaces offer no such support. If successful, our system would give researchers a clear, interactive way to understand how their prompt choices shape the final report before they spend time and credits running it. This would reduce wasted iterations, improve response quality, and make Deep Research more accessible to non-experts. It would also extend insights from visualization tools like ChainForge and EvalLM to a setting where fast iteration isn’t possible, filling an important gap in tools for AI-assisted scholarly work.

Approach

What did you do exactly? How did you solve the problem? Why did you think it would be successful? Is anything new in your approach?

We built an interactive, pre-run visualization tool that helps users refine their Deep Research prompts before they run a long, expensive query. The system analyzes a user’s prompt using a lightweight model, identifies missing elements such as unclear scope or absent evidence requirements, and highlights how these gaps may affect depth or relevance. It then shows how small edits change predicted quality dimensions (depth, clarity, coherence), helping users iterate on the prompt without repeatedly running Deep Research. This gives users live, structured feedback instead of relying on slow trial-and-error. We believed this would work because prior studies show that clearer requirements and prompt scaffolding improve outcomes significantly, but current Deep Research tools provide no such support. Our approach is new in that it applies ideas from prompt-engineering systems like ChainForge and EvalLM to the Deep Research setting, where responses take minutes and cannot be rapidly compared. Instead of evaluating after the fact, we provide actionable guidance before a run, filling a gap that existing tools and interfaces do not address.

What problems did you anticipate? What problems did you encounter? Did the very first thing you tried work?

We anticipated several challenges: Deep Research responses are slow and expensive, so we knew we could not rely on rapid iteration like prior tools. We also expected difficulty in predicting depth or relevance from a single prompt, since these qualities depend on complex reasoning chains. Finally, we foresaw usability challenges as researchers need guidance that feels helpful but not intrusive or overwhelming.

Results

How did you measure success? What experiments were used? What were the results, both quantitative and qualitative? Did you succeed? Did you fail? Why?

We measure success by testing whether our tool helped researchers create better Deep Research prompts than they could on their own. In a within-subjects study, each participant completed one task using the standard Deep Research interface and another using our visualization tool. We then compared the quality of the resulting research reports using a rubric informed by prior work evaluating depth, coherence, and relevance as well as participants’ own ratings of success and clarity. Together, these measures allowed us to evaluate both objective improvements in output quality and subjective improvements in the user experience.

Table 1. This is Table 1's caption
Experiment	1	2	3
Sentence	Example 1	Example 2	Example 3
Errors	error A, error B, error C	error C	error B

Conclustion and Future Work

How easily are your results able to be reproduced by others? Did your dataset or annotation affect other people's choice of research or development projects to undertake? Does your work have potential harm or risk to our society? What kinds? If so, how can you address them? What limitations does your model have? How can you extend your work for future research?

Interactive Visualization for Deep Research Prompt Refinement

Fall 2025 CSCI 5541 NLP: Class Project - University of Minnesota