top of page

Slow AI: Designing User Control for Long Tasks

  • Writer: Jakob Nielsen
    Jakob Nielsen
  • 2 minutes ago
  • 28 min read
Summary: Batch processing returns as AI agents run for hours or days, eliminating traditional turn-taking interaction. Slow AI creates control and observability problems; users forget context during long waits. UX solutions include upfront clarification, explicit run contracts, checkpointing for recovery, conceptual breadcrumbs showing reasoning, and progressive disclosure of partial results.

 

ree

Interactions that take a long time requires special UX interventions to ensure usability. Extremely long run times require even more design work. (Seedream 4)


This is a long article. For a shorter and more entertaining overview of the highlights, watch my music video Slow AI (YouTube, 4 min.).


Waiting hours for AI results creates anxiety: Is it working? Going the right direction? Users need reassurance through visible progress, early glimpses of the AI’s thinking, ability to correct course mid-run, and help remembering what they asked for days ago.


We’re now seeing of AI agents that may run for hours and days to complete a workflow and powerful AI tools (such as Sora 2) that require 10–20 minutes or more to produce complex work products, such as a fully-edited video. These slow response times raise a new user interface design problem: how to retain user control and freedom when we no longer have a traditional interaction with frequent turn-taking.


Actually, this is an old problem come back from the dead: the Zombie UX of batch processing is being revived. Batch processing was the very first interaction paradigm for using computers: you would give the data center a stack of punch cards specifying what the computer should do, and you return the next morning to receive a printout of the results.


(The other two interaction paradigms are the command-based interaction of virtually all computers from 1964–2023 and the intent-based outcome specification of modern AI systems starting in 2023.)


ree

Batch processing is coming back from the dead. (Seedream 4)


Batch processing creates a highly unpleasant user experience, as I know from personally experiencing it in the 1970s when it was on its death throes. (One of the benefits of my very long career in UX is that I have personal experience with a very wide range of interaction styles.)


Whether command-based or intent-based, turn-taking is the preferred way of interacting with computers, but it requires response times of less than 10 seconds for each turn. (Preferably less than a second.)


Unfortunately, early AI products had miserable response time, and it often took more than those 10 seconds to render an image in Midjourney. However, even the slowest AI almost always delivered results within a minute, which — though annoying — allowed users to feel that they were in a dialogue (or “chat” as it’s now called) with the computer.


More powerful AI tools like Deep Research have broken this one-minute barrier big time, often requiring 10 minutes or more to deliver results. The same is true for the better video-generating tools: the wait for even a 15-second video clip in Sora 2 often runs to 10 minutes, and a 3-minute HeyGen avatar clip can take up to half an hour to render during busy workloads.


ree

Waiting is never popular, but the most powerful AI systems are now taking so long to complete tasks that we can’t expect users to just sit there and wait for the AI to finish working. (GPT Image-1)


Reportedly, Claude Sonnet 4.5 has run independently for up to 30 hours to complete certain tasks. Generally, we know that the duration of the most complex tasks that AI models can complete doubles approximately every 7 months. While that 30-hour Sonnet task is the exception for the capabilities of current AI (where 2 hours is a more common limit for long tasks), multi-day tasks will be common in a few years, as part of transformative AI, as AI takes over almost all current human work.


Now for the UX question: fast response times are needed for traditional usability, but since we do want powerful AI that handles highly complex tasks, we’ll sometimes have to endure the wait. Extremely slow response times (hours or days) raise new usability concerns.


Clarification Before Running

A slow AI shouldn’t just set out to produce its best guess as to the user’s intention. Hours of user time and millions of AI tokens will often be wasted due to erroneous assumptions.

OpenAI’s Deep Research showcases one design pattern to address this problem: It analyzes the user’s request and asks clarifying questions before setting out to do its extensive research that often takes half an hour. This has saved me many times.


As we get AI video systems that are capable of generating an entire feature film in one run, it will also be best if the AI shows its direction for user approval before starting the detailed video generation. For example, it could show a storyboard for approval of plot details or rough samples of short video clips for approval of cinematographic style.


After any clarifications, make the upcoming interaction explicit: What will happen, when, and under which constraints. Show a run contract card that includes: (a) an ETA window (not a point estimate) with a confidence band, (b) the cost cap and model mix, (c) the definition of “done,” and (d) what the system will not do (e.g., won’t email external parties without approval). Users can accept this contract or edit the constraints before launching the run.


ree

A long path, but where does it lead? You want to know before setting out. (Seedream 4)


For example, a marketing team might ask AI to draft and score 1,500 newsletter content variants across five languages. A run contract might show a six‑to‑ten‑hour window, a $220 budget cap, a ban on emailing these draft newsletters to any customers, and a clause that only style guides from the “Brand Standards 2025” folder may be used. The team might want faster results and reduce the scope to 600 content variants, set a $150 spend ceiling, and demand a mid‑run sample pack for approval.


Unless the run contract card can be generated within a one-second response time, the AI should assume that the user may have diverted his or her attention and not wait for approval. The AI should simply start working, at the cost of possibly having to redo some steps if the user later reviews the contract and decides to make changes. The AI shouldn’t perform destructive actions that cannot be undone until after getting the approval, but some wasted tokens from time to time are worth the expense in return for faster task completion every time.


Long Runs Are Harder Than They Look

Failure is common in long-running AI systems. Even with the most carefully engineered pipelines, things go wrong. Networks blink in and out, servers reboot, API rate limits suddenly appear, external services change their responses, and context windows overflow. For a system that needs to operate continuously for 30 hours, any one of these disruptions can be fatal unless the agent has been designed to anticipate and recover gracefully.


A short-lived agent that produces a two-minute summary or a one-time code snippet can rely on fragile assumptions. If it fails, the user can simply retry. But once you stretch execution across dozens of hours, a “just retry it” mentality becomes unsustainable. Imagine running a 29-hour experiment that collapses in the final 10 minutes because the agent lost its state. That is not just frustrating, it’s devastating.


The second challenge is observability decay. Logs roll over. Prompts mutate. External tools change their state. By the time something breaks, you may no longer know why. A failing step that might be diagnosed easily in a five-minute run becomes a forensic nightmare in a thirty-hour one.


Finally, there is the problem of human patience versus machine time. People expect a progress bar or at least a reliable status update within minutes. Machines, on the other hand, can run in the background for days. Designing long-running agents requires reconciling these mismatched time horizons: the system must keep humans informed while still grinding through massive tasks at machine speed.


People have lives; laptops have batteries; network policies have bedtimes. Long runs must support pause and resume gracefully. Pausing should preserve state deterministically and show a succinct “what changes on resume” diff. If resumption requires a different model tier or a new quota bucket, the system should propose a re‑baselined plan and quantify the impact.


Learning from Mainframes and Factories

To cope with these difficulties, it helps to borrow principles from two older worlds: mainframe batch computing and factory production lines. Both domains grappled with longevity decades ago, and the lessons remain relevant.


ree

Mainframe computers were never quite as steampunk as this rendering, but they might as well have been, as far as most present-day UX designers are concerned. Disregarding the past is wrong, however, since many lessons from the mainframe era apply to the next generation of AI systems. (Seedream 4)


The first principle is to establish a Single Source of Truth (SSOT). In mainframe computing, this usually meant a central transaction log or append-only ledger that captured every action. In factories, it meant a kanban board or production register that no one could bypass. For AI agents, an SSOT should consist of a durable state record, an append-only event log that records every attempted action, paired with periodic snapshots of the agent’s working state.


For example, if an agent is tasked with analyzing thousands of research papers, the SSOT would contain entries such as “Downloaded paper #245” and “Extracted summary of section 3.2.” If the process fails at step 1,276, we can restart from that point instead of starting from zero.


The second principle is idempotence, a term borrowed from mathematics and computing. An idempotent operation is one that can be repeated without changing the final result. In practice, this means designing every agent action so that it can be retried after a crash without introducing corruption. If the agent has already downloaded a file, downloading it again should not break anything. If it has already generated a summary, regenerating should either produce the same output or clearly indicate that the step is complete. This sounds obvious but is frequently ignored, leading to duplication, inconsistencies, or runaway loops.


The third principle is to break tasks into small, restartable work units. Just as factories rely on workstations that each complete a short, well-defined operation, agents should divide their 30-hour journeys into narrow tasks lasting 5–15 minutes each. A plan to analyze 10,000 documents should not be a single monolithic “analyze everything” instruction. It should be a directed acyclic graph (DAG) of smaller nodes: fetch document, parse metadata, extract text, summarize sections, and so forth. Each node should produce explicit inputs and outputs so that failures can be isolated and retried.


This leads to the fourth principle: deterministic plans with adaptive tactics. The plan itself should remain fixed throughout execution, providing stability. But within each node, the agent should adapt to conditions. For example, if the primary summarization model is rate-limited, the agent might fall back to a smaller one. What matters is that the structure of the plan remains immutable, while the tactics used to execute individual steps can vary.


Defining True Progress

Users need to know how far along the system has come. Unfortunately, most AI agents today report progress in vague or misleading ways. A message like “Almost done…” after 12 hours is not helpful. Instead, progress should be calculated in terms of planned tasks versus completed tasks, weighted by their contribution to the final goal.


ree

A traditional percent-done indicator is useful for response times that are slower than 10 seconds but still faster than 10 minutes. For a 10-hour AI task, 5% progress (as shown here) would take half an hour: too long to sit and stare at the bar while wondering whether the AI is doing right. (Seedream 4)


For example, if a DAG contains 100 nodes but 80 of them are trivial and the remaining 20 are computationally heavy, then finishing 80% of the nodes does not mean 80% progress. True progress should account for the critical path: the sequence of dependent tasks that determines total completion time.


A well-designed progress indicator for slow AI should show three layers:


  1. Overall completion percentage: how much of the total plan is done, using estimated time rather than the number of steps. Also show an estimate of when the full task will be completed, so that the user doesn’t have to calculate this manually.

  2. Critical path status: what step the agent is currently working on that gates overall progress.

  3. Current step ETA: how long until the current node is expected to finish, bounded by service-level agreements (SLAs) for that type of task.


In addition, the agent should surface blocking conditions explicitly. If it is waiting for human approval, the status should say so. If it is in a retry loop due to API rate limits, the dashboard should read “retrying in 12 minutes.”


ree

Most AI providers are seriously under-resourced with compute, because AI has proven so popular with users that they constantly increase their demand for more AI services. (OpenAI is in a particularly bad spot.) This inevitably means that extensive tasks become rate limited, especially if run during peak load hours. (GPT Image-1)


A useful mental model is the operator dashboard. At a glance, an operator should be able to see the plan version, the critical path node, recent events, budget consumption, checkpoint age, and estimated time to completion.


Early estimates are guesses with a tie on. As the system proceeds through the task and clears unknowns, those estimates should mature from a wide cone to a narrow one that cites historical calibration. By the second hour of a complex video‑synthesis pipeline, the agent might narrow its forecast from “four to twelve hours” to “6.3–7.1 hours,” while showing that 78% of similar runs finished within this band. Forecasts that admit their weakness age better than bravado. Operators gain a realistic sense of pace, and stakeholders stop asking, “Are we there yet?” every twenty minutes because the cone makes the answer uncertainty visible.


If compute capacity constraints drive the schedule, hiding the queue is hostile. Show customers their position, the moving‑average throughput, the reason for throttling, and the options to mitigate. At 10:04 a.m., a user might see, “Position 47 of 280; average start in eight minutes; GPU pool saturated due to Monday morning surge.” The UI can propose deferring to off‑peak hours for faster completion and off-peak reduced token pricing, splitting the run into smaller jobs, or paying for a reserved accelerator. Capacity limitations are not a scandal; the current AI secrecy is.


ree

AI should make capacity limits explicit and offer reduced pricing during off-peak hours, allowing users to prioritize what they need now and what can wait. (GPT Image-1)


The Need for Conceptual Observability

Technical progress indicators, such as the percentage of completed nodes in a DAG or lists of files processed, address the quantitative aspect of a long task. However, they often fail to address the qualitative aspect: what the AI is actually thinking and achieving conceptually.


When a user waits for days, anxiety stems not just from the wait, but from the uncertainty of whether the AI is pursuing a fruitful direction. Traditional batch processing often produced predictable outcomes because the instructions were explicit. AI agents, operating on intent-based outcome specification, are inherently less predictable. A status report like “Downloaded paper 245” is technically accurate but conceptually opaque. It tells the user that the AI is busy, but not that it is being useful.


To maintain user control and build trust, slow AI must provide Conceptual Breadcrumbs. These are short, synthesized summaries of the insights, decisions, or intermediate conclusions reached by the AI at key milestones.


For example, instead of only reporting technical steps in a research task, the agent should report conceptual milestones:


  • “Identified three main arguments against the initial hypothesis in the first 50 papers.” (Followed by a one-click option to summarize these three main arguments.)

  • “Discovered an unexpected correlation between variable X and Y; shifting focus to investigate this link.”

  • “Rejected the initial design direction for the video’s second act due to inconsistencies in character motivation.”


These breadcrumbs allow the user to quickly assess the AI’s line of reasoning without diving into detailed logs or artifacts. If the AI reports a conclusion that the user knows is fundamentally flawed, the user can intervene immediately. The UI should present these Conceptual Breadcrumbs prominently in the operator dashboard, distinct from the detailed event log.


Support Context Switching

When an AI task takes 30 hours to complete, users won’t sit and watch. They will context switch by moving to other projects, attending meetings, perhaps even going home for the night. This creates a cognitive discontinuity problem that doesn’t exist with fast interactions.


This introduces a severe usability penalty: the cost of context switching. We know from decades of human factors research that resuming an interrupted task takes time and increases the likelihood of errors. For slow AI, the interruption isn’t seconds; it’s days. This magnifies the cognitive load required to re-engage.


The challenge is twofold. First, when the AI needs user input after 8 hours of autonomous work, the user must reconstruct his or her mental model of what they asked for and why. I’ve experienced this with Deep Research: returning to a finished research task after a day of other work, I sometimes struggle to remember my original intent or the specific angle I wanted to explore.


Second, when the AI completes its work, the user must re-engage with enough context to evaluate the output meaningfully. A notification saying “Your video is ready” after 12 hours is insufficient, especially when the user has many balls in the air. By then, the user will have trouble identifying which of these “balls” (videos) the notification concerns, or may have forgotten what specific client deliverable this was meant for.


ree

Supporting users when they return from a long absence requires UX design for regaining context that has been forgotten. (Seedream 4)


This raises the critical need for Context Reboarding. The system must help the user quickly regain situational awareness. This adheres to the usability heuristic of recognition rather than recall; the user should not have to remember the state of the task.


When the user returns to the system, whether to check progress, provide input, or view the final result, the UI must clearly present a Resumption Summary:


  • The Original Intent: The initial prompt and any clarifications provided by the user before the run started.

  • Key Decisions Made: A summary of human approvals or interventions already provided during the run, as well as the most important Conceptual Breadcrumbs.

  • Current Status and Cost: How long the system has worked, the resources consumed, and the estimated time to completion.

  • What happens if no input is given, so users can safely defer if they’re in the middle of something else.

If the user has to dig through logs or decipher complex output to understand the current state, the interface has failed. The AI must onboard the user back into the context they left days before.


For completed tasks, the results page should show a reminder card at the top: “You asked for X, I focused on Y approach because of Z, here's what I produced.” This 30-second re-orientation can mean the difference between useful evaluation and confused abandonment.


Notification Design for Asynchronous Intelligence

Long AI task raises a question that traditional software rarely faces: how do you notify someone about something they asked for last week?


Email notifications (like HenGen uses now) are the obvious solution, but they’re also terrible. An email saying “Your AI task needs attention” gets buried among 200 other messages, or arrives while the user is in a four-hour meeting. By the time they see it, the urgency has passed, or they’ve mentally moved on.


Push notifications are better for immediacy but worse for interruption. A smartphone buzz about a completed AI analysis might arrive at 3 AM if the user is traveling, or during a critical presentation. The interruption cost can exceed the value of the notification.


Slow AI requires attention management that is timely and context-aware. Users do not need a notification every time a minor sub-task completes. They need notifications for critical events: Intervention Required, Milestones Reached, Anomalies Detected, and Completion.


The design pattern that works best is tiered, context-aware notifications:


Tier 1 - Critical blocks: If the AI is completely blocked and cannot proceed without user input, send an immediate push notification, followed by email after 30 minutes, and SMS after 2 hours. But make these rare. Most user-input-needed situations are not truly blocking.


Tier 2 - Decisions that affect quality: If the AI can proceed with a default choice but user input would improve the outcome, show an in-app notification the next time the user opens the application, and send a low-priority email after 4 hours. The key is that the AI continues working while awaiting input, then applies any user guidance retroactively if needed.


Tier 3 - Completion notices: When the task finishes, use the user’s stated preference: some want immediate notifications, others prefer a daily digest of completed tasks. The system should learn from user behavior: if someone consistently ignores notifications until evening, stop bothering them during the workday.


ree

To avoid notification overload, the loudness of the virtual “ping” must be scaled according to 3 tiers of notification importance. (Seedream 4)


Notifications must also be context-rich. A message saying “Task 45B failed” is useless. A better message is: “Deep Research Task: The analysis of competitor pricing failed due to API authentication errors. Click here to re-authenticate.”


Additionally, respect do-not-disturb contexts. Integration with calendar systems allows the AI to know when a user is in meetings, out of office, or in a different timezone’s sleep hours. A research task that completes at 2 AM should queue its notification until a reasonable hour unless the user has explicitly requested otherwise.


Finally, provide a status dashboard that users can check proactively without waiting for notifications. Many users prefer to pull information rather than be pushed to it. The dashboard should show all active tasks, estimated completion times, and any pending decisions, allowing users to engage on their own schedule.


Building Reliable Checkpoints

Even the best-designed long-running agent will encounter crashes. The key is ensuring that checkpoints allow for reliable resumption. A checkpoint is essentially a saved state of the system that can be restored later.


What should be serialized? At minimum, every checkpoint should include:


  • The DAG plan and its current execution frontier.

  • Tool configurations and API cursors.

  • Prompt templates and their resolved variables.

  • Working artifacts with content hashes.

  • Random seeds and model parameters to ensure reproducibility.

  • References to external sources stored with digests, not just URLs, so that they cannot change silently.


It is also useful to distinguish between warm resumes and cold resumes. A warm resume occurs when the environment is still intact, for example, if the process simply restarted but still has access to open streams or queues. A cold resume assumes nothing: a new environment must reconstruct the entire run from the SSOT and artifacts. Designing for cold resumes ensures true robustness.


Quality Checks and Course Correction

It should be possible for users to see work in progress to assess whether the AI is on the right track. If not, the user should always have the option to stop the process and either start over or roll back to a checkpoint where the AI was still operating as desired.


ree

There should be a simple user control to stop the line before we proceed to produce a bunch of useless stuff. (GPT Image-1)


In either case, the user should also be able to guide the AI about what changes to make, if the user doesn’t want the AI to give it a new try with the same prompt. Sometimes, seeing the AI go down the wrong path will cause the user to rewrite the prompt to avoid the problem. To help the user identify why the prompt went wrong, AI should employ aided prompt understanding, including ways of explaining how the (misguided) work in progress was derived from the original (misstated) prompt.


Frequently, rather than starting over with a rewritten prompt and hoping that it will be interpreted better in the next (slow) run, it’s better if the user has the option to tell the AI specifically what it did wrong and what incremental changes the user wants it to implement going forward.


For example, I was recently using ChatGPT 5 to make a series of possible thumbnails for my video about latent affordances. Because the GPT Image-1 model is excruciatingly slow, getting a set of 10 design ideas from which to pick the desired direction can often take half an hour.


ree

Bad design direction for a series of possible YouTube thumbnails for a video about latent affordances. The concept of “threading the needle” makes sense as a line in the script in the context of a larger explanation, but doesn’t have anything to do with latent affordances. (GPT Image-1)


Luckily, GPT renders its ideas one at a time, so the user can check in from time to time to see what design directions it is exploring. In this case, the AI had become fixated on one line in the video manuscript, where the narrator says that latent affordances “thread the needle” between too-loud persistent perceived affordances for everything and too-hidden lack of affordances for an empty prompt box where you can type anything. Threading the needle is an obvious candidate for visualization in a thumbnail, and the AI was progressing through a large number of different designs with various ways of showing needles and threads.


Unfortunately, this metaphor is useless for a thumbnail (which is seen in isolation, before clicking through to the video), even though it makes sense in the context of watching the full video.


In this case, I caught the AI about halfway through its run rendering useless thumbnails. I stopped it and told it to draw new thumbnails that avoided the thread-and-needle concept.

One of the biggest psychological challenges with slow AI is the feeling of wasted time when results aren’t quite right. Waiting 30 minutes for a video only to discover it’s unusable is far more frustrating than a 3-second query that misses the mark.


The solution is progressive disclosure of partial results. Rather than treating long-running tasks as all-or-nothing, AI systems should deliver useful outputs along the way.


ree

Progressive disclosure requires ways of showing the work product at various stages of completion so that the user can estimate whether the process is on the right track. In this example, if I want a still life showing a cluster of grapes in a bowl, I can stop the AI after seeing the leftmost rough sketch. This is easy for visuals, but harder for more conceptual tasks. (Seedream 4)


For research tasks, this means surfacing key findings as they’re discovered rather than holding everything until the final report.


For generative tasks like video or document creation, the pattern is different. The AI should produce increasingly refined versions: rough cut before final edit, outline before full prose, wireframe before polished design. Each intermediate artifact has standalone value and provides an opportunity for course correction without starting over.


The key is making these partial results explicitly labeled as incomplete. Users need to understand what they’re seeing: “Preliminary findings based on 40% of sources analyzed” or “Rough cut in 480p resolution.” Without clear labeling, partial results create confusion about whether the AI is finished or still working.


Additionally, give users the option to stop and keep partial results. If the AI is 60% through a 2-hour task and has already produced something useful enough, let users say “that’s good enough, stop here” rather than forcing them to wait for 100% completion or cancel and lose everything. This respects the user’s time and acknowledges that perfect is often the enemy of good enough.


ree

In 1772, French author Voltaire said that “Perfect is the Enemy of good” (“Le mieux est l’ennemi du bien.”) This insight was as true in the 18th Century as it is in the 21st. (Seedream 4)


For some projects, half the answer now can be worth double of a complete answer tomorrow. Long agents should offer safe “salvage” points where the user can export finished artifacts and stop with full disclosure of what would be forfeited. A competitor‑scan scheduled for nine hours might, at hour three, already cover the top quartile with high‑quality summaries and citations. A product manager facing a board meeting can stop there, export the current package, and schedule a continuation for the weekend. This respects the time‑value of information and prevents the sunk‑cost trance that keeps bad runs alive.


Combating the Sunk Cost Fallacy

Long-running tasks significantly exacerbate the sunk cost fallacy. This cognitive bias causes users to continue an endeavor as a result of previously invested resources (time, money, or effort). If an AI agent has been running for 20 hours, a user will be extremely reluctant to stop it, even if the intermediate results look poor. They feel they must “get something” for the time and expense already incurred.


ree

Sunk cost fallacy: When you have spent significant resources digging a hole, you are more focused on getting some value from the bottom of that hole instead of climbing out and grabbing the much higher value that may lie outside your hole. (Seedream 4)


This leads to bad outcomes: users accept substandard work because the cost of iterating (waiting another 20 hours) feels too high.


The user interface must actively counteract this fallacy by minimizing the penalty for course correction and emphasizing the value that can be salvaged.


  1. Visualize Salvage Value: When a user considers stopping a process, the UI should not simply offer a “Cancel and Delete” option. It must explicitly indicate which artifacts and intermediate results can be reused in a subsequent run. For example, if an AI spent 15 hours collecting data and 5 hours analyzing it poorly, the user should be able to stop the analysis but keep the collected dataset. If 80% of the work is reusable, the cost of iteration is not 20 hours, but only 4 hours. Showing this “Salvage Value” makes the decision to correct errors less painful.

  2. Frictionless Restarts: Restarting from a previous checkpoint or launching a new task with salvaged data must be extremely easy: ideally a one-click operation. If restarting is complex, slow, or hidden, users will avoid it.

  3. Explicit Cost/Benefit Analysis: When a user inspects an intermediate result, the system should present not just the cost incurred, but also the projected remaining cost to completion. Visualizing the future expenditure required to finish a potentially flawed run helps the user make a rational decision to stop the process.


User control means nothing if the psychological cost of exercising that control is too high.


Handling Errors Like Adults

Errors come in many forms, and lumping them all together as “failures” is a recipe for chaos. Instead, errors should be classified into types:


  • Transient errors, such as network timeouts, can be retried with backoff.

  • Semantic errors, where the tool produced nonsense, require input correction or a retry with a modified approach.

  • Policy errors, such as security violations, need human intervention.

  • Fatal errors, such as a missing dependency that cannot be resolved, must halt the run and escalate.


Retries should not be infinite. Each error class should have a retry budget, and retries should use exponential backoff with jitter to avoid overwhelming external systems. Circuit breakers should detect “flappy” services and pause before retrying.


In some cases, compensation steps are needed. If a step had side effects (e.g., sending an email or updating a database), simply retrying might cause duplication. Compensation patterns (undo or roll-forward) are necessary to keep state consistent.


Keeping Humans in the Loop

Despite the push toward autonomy, many tasks require human-in-the-loop checkpoints. The challenge is enabling human input without stalling the system indefinitely.


The solution is to design pausable nodes. When the agent reaches a step requiring approval, it queues the request and waits with a timeout. If approval is not given in time, the system should fall back to a safe behavior, such as skipping the step with a stub output.


Crucially, humans need diff views: clear comparisons of planned versus proposed changes, labeled with risks. Instead of showing raw prompts, the UI should highlight what the agent is about to do that differs from the original plan.


Overrides must also be safe. A single click should let a human edit inputs, reduce scope, or skip a node. Without such controls, operators either trust blindly or derail the system with heavy interventions.


Recovery from User Unavailability

Slow AI creates an awkward temporal mismatch: the AI might need user input on Tuesday, but the user is on vacation until Friday. Traditional software handles this poorly since tasks simply stall indefinitely, or worse, timeout and fail.


The problem is exacerbated when the user is unavailable not by choice but by circumstance: hospitalization, family emergency, or simply being in a different timezone. An AI that requires immediate human approval at 3 AM is fundamentally broken.


The solution requires graceful degradation with substitute decision-making. When user input is needed, the system should:


First, check if the input is truly required or if there’s a safe default. Many times, the AI can make a reasonable choice and document it for later review. “I needed to choose between approaches A and B; you were unavailable, so I proceeded with A because it matched your previous preferences. You can review this decision in the audit log.” Offer a one-click option for backtracking to that checkpoint and rerunning the task with a different decision.


Second, delegate to backup decision-makers for organizational tasks. If I’m out of office, my colleague or manager might be authorized to make certain decisions on my behalf. The system should support delegation chains: “If I don’t respond within 4 hours, ask Maria; if she doesn’t respond within 8 hours, use the conservative default.”


Third, implement maximum wait times with smart defaults. Don't wait indefinitely. After a reasonable timeout (which should be context-dependent, such as 4 hours during business days, 24 hours over weekends), proceed with the option that minimizes risk and cost. Always log what was decided automatically so users can review later.


Finally, provide a “vacation mode” where users can tell the system they’ll be unavailable for a specified period. During this time, either auto-approve low-stakes decisions, defer non-urgent tasks, or route approvals to designated alternates. This is standard practice in email systems and expense approval workflows; it should be standard in AI systems too.


The worst possible outcome is an AI that waits indefinitely for a user who won’t return for days, wasting both time and resources. Build systems that can keep working, safely, even when users can’t respond immediately.


Telemetry and Audits

Every long-running agent should act as if an accident investigation will eventually occur. That means capturing telemetry and supporting audits.


Each step should emit a structured record: timestamp, node ID, input digest, output digest, tool version, model version, execution cost, token count, and error flags. These metrics should be stored separately from logs: metrics for system health, logs for forensic detail.

A replay tool is essential. It should allow investigators to reconstruct the entire run from the event log, stepping through decisions to understand why the system behaved as it did. Without such tooling, postmortems devolve into speculation.


Cost and Quota Control

Even if an agent survives technically, it can still fail economically. Running a large language model for 30 hours straight can easily exceed budget. Thus, every long-running agent must incorporate cost and quota controls.


ree

Video generation currently costs about $5 per 10-seond clip with leading models like Sora 2 and Veo 3, meaning that a request for 4 alternate takes that could take 10 minutes to complete will cost $20. With current video UI, you have no idea whether that money is being spent well until the very end. Long AI runs might cost thousands of dollars, meaning that budget controls must be exposed in the user interface. (GPT Image-1)


At the planning stage, assign a budget: maximum tokens, maximum dollars, and maximum wait-time. As execution proceeds, forecast whether the critical path will exceed the budget. If so, issue early warnings: “Critical path will exceed token budget by 18%.”


Graceful degradation is vital. When costs rise too high, the system can fall back to smaller models, smaller context windows, or cheaper approximations. Without such controls, an agent might succeed technically but bankrupt the user.


Managing Multiple Long-Running Tasks

As slow AI becomes more common, users will inevitably have multiple long-running tasks active simultaneously. This creates a resource allocation problem: if I have three video generations queued, each requiring 20 minutes, should they run sequentially or in parallel? What if I realize the first one is more urgent?


Current AI systems handle this poorly or not at all. There’s no visibility into task queues, no way to reorder priorities, and no understanding of resource contention. Users are left guessing whether starting a new task will slow down existing ones.


This introduces the problem of Portfolio Management. The user experience breaks down if the user has to manually check the status of ten different tasks, each running in its own siloed interface or browser tab.


A proper task queue manager should be a standard component of any slow AI system. The interface needs three core features:


Priority management: Users should see all their active and queued tasks in one view, with the ability to drag-and-drop to reorder them. If I suddenly realize I need the research report before tomorrow's meeting, I should be able to promote it ahead of the less urgent video generation. The system should show the impact: “Promoting this task will delay Task B by approximately 45 minutes.”


Resource allocation visibility: Show users what resources are being consumed and why. If my video generation is taking longer than expected because the service is at capacity, tell me. If I could get faster results by upgrading to a premium tier, show me the cost-benefit tradeoff. Transparency builds trust and helps users make informed decisions about whether to wait or pay for priority.


Pause and resume controls: Users should be able to pause any non-critical task to free up resources for something more urgent. This is common in download managers but rare in AI systems. The pause should be clean (checkpointing the current state and releasing resources) with a one-click resume when the user is ready.


For organizations with shared AI resources, a collaborative queue view becomes essential. Team members need to see who has tasks running, avoid conflicts, and potentially yield resources when a colleague has something urgent. This is the AI equivalent of booking time on a shared microscope or wind tunnel: resource scheduling that we solved decades ago in other domains but must now reinvent for AI.


User Experience Design Patterns

Several UI patterns can bring these principles to life:


  • A flight strip view shows upcoming nodes on the left, the current node in focus, and completed nodes checked off.

  • A black box recorder panel summarizes the last decision, its inputs, outputs, and rationale.

  • A frozen receipt allows users to download a run summary for external reporting.

  • A traffic light system indicates overall health: green for on-track, amber for at-risk, red for blocked.


These design elements help human operators maintain trust in a system that might otherwise feel inscrutable.


ree

The “black box” is a metaphor from airplane safety investigations: replaying what took place in the time leading up to a critical event can help us understand what possibly went wrong and how to do better next time. (GPT Image-1)


Multi‑Party Collaboration and Handoffs Without Friction

In an enterprise setting, tasks that run for multiple days introduce a challenge of continuity. The person who initiated the task may not be the person who reviews the results or handles errors. Shifts change, employees go on vacation, or priorities are reassigned.


Slow AI systems must be designed for collaboration and seamless handoff. A long-running task cannot be treated as a private endeavor; it is an organizational asset.


The interface must support shared workspaces where team members can monitor progress. Handoff requires more than shared access; it requires shared understanding. Users need the ability to leave annotated notes directly on the task timeline or the operator dashboard. For example, an operator going off-shift might note: “Rate limiting occurred at 14:00; I increased the backoff parameter. Please monitor closely.”


ree

“Why did my husband order a flamingo sculpture for our front yard?” If you don’t know, you can’t assess whether the drone delivery has gone wrong. (Seedream 4)


Users must also be able to explicitly delegate oversight. If I start a process on Friday, I should be able to assign a colleague to receive notifications and make decisions over the weekend.


Crucially, if the AI requires human intervention, it must be capable of notifying the appropriate person, not just the original initiator. If the primary contact or delegate does not respond within a defined timeframe (the timeout mentioned earlier), the system must escalate the issue. This requires integrating the AI system with organizational structures, such as on-call rotations or supervisors. A multi-day process must not stall simply because one person is unavailable.


Failure Drills

No system should be trusted until it has survived failure testing. Common drills include:

  1. Killing the process mid-node to test resumption.

  2. Revoking credentials to see if the agent halts gracefully.

  3. Corrupting artifacts to check integrity validation.

  4. Injecting bad tool output to test semantic error handling.

  5. Triggering rate limits to observe backoff behavior.


Running such drills uncovers flaws before they occur in real production.

Don’t just harden the system; train the human for critical tasks. Offer a safe “Chaos Mode” that simulates common failures (rate limits, network flaps, tool nonsense) on a sample run so teams can practice approvals, rollbacks, and overrides. It’s cheaper to rehearse disaster than to improvise it at 3 AM.


Super-Long AI Tasks

In this article, I discussed the UX concerns with AI tasks that runs for hours or days. But what about even bigger AI tasks. AI’s ability to handle complex tasks doubles every 7 months when measured in the time it takes humans to perform those tasks, and the current frontier models usually max out at two-hour tasks.


If this trend continues, this means that in 9 years, AI will be able to perform tasks that take 7 years to complete. An example might be an entire research program to investigate something completely new.


What UX elements will be needed for human direction of such extended AI tasks? Hard to say, but the design will almost certainly be different than what we currently need for hour-long tasks. Luckily, we have a few years left where we can gain experience with the usability of hour-long and day-long AI tasks before we’ll have to worry about designing for year-long AI tasks.


Conclusion

Long AI tasks create real stress: you might wait 20 hours only to discover the output is useless. You forget what you asked for, feel trapped by sunk costs, and lose precious time. Good design helps through upfront clarity, showing the AI’s reasoning along the way, letting you stop and salvage work, and gently rebuilding your context when you return.


ree

Waiting can be deadly for usability. It’s time to take time seriously in the AI user experience. (Seedream 4)


Designing an AI agent that can run for days is not simply a matter of stringing together prompts. It is a systems engineering challenge, requiring lessons from mainframes, manufacturing, cloud infrastructure, and aviation. Success comes from anticipating failure, structuring tasks for resumability, maintaining observability, enforcing budgets, and designing humane interfaces for human oversight.


In short: treat the agent not as a magic box but as a long-haul airplane. It needs pre-flight checks, mid-air telemetry, fuel budgets, black box recorders, and trained pilots ready to intervene. Only then can we trust it to stay airborne for hours or days.


This was a long article. For a shorter and more entertaining overview of the highlights, watch my music video Slow AI (YouTube, 4 min.).

 

Top Past Articles
bottom of page