Josephine Waliman

TL;DR

Nimbus turns a number into layers.

Nimbus translates weather conditions into a temperature adjusted outfit guide, grounded in thermal science and explained in plain language. Built for Seattle microclimates, it accounts for when wind changes what a number feels like and when yesterday’s layers weren’t enough.

Role: Solo Product Designer + Developer
Timeline: Feb – Mar 2026
(3 Weeks)
Context: Foundations of A.I. Design, MHCI+D
Tools: Figma · Cursor · Vercel · Claude Code

Problem

Every weather app stops at the forecast.

Weather apps report an accurate number and leave the task of what to wear for those conditions to the user. When the day feels wrong, users blame the app even when the number was right. A study by the Royal Meteorological Society shows users judge accuracy by experience, not the number alone.

Seattle makes the gap impossible to ignore. Every forecast collapses the city into one number, but wind off the water and block-to-block microclimates decide whether that number keeps you comfortable or leaves you underdressed for the rest of the day.

Context

62% of King County residents weren’t born in Washington (U.S. Census 2024). My cohort was the proof: warm-climate and international transplants dressing for a number that never counted the wind.

To better understand the gap, I mapped what people already use to get dressed.

Market Category	Optimizes for	Limit
Weather apps	Forecast accuracy	Stops at data: no outfit decision
LLM chat advice	Flexible language	No memory loop, inconsistent warmth logic
Outdoor layering guides	Technical warmth	Not built for daily commute routines
Closet-inventory tools	Wardrobe truth	High setup cost, weak weather translation

Process

The loop worked. Nobody could see it.

V1 shipped the full loop. Usability testing showed the loop was invisible.

01 · Every garment gets a warmth score.

Before designing screens, I built the thermal foundation: a garment reference grounded in ISO 9920 insulation values (CLO). Every item Nimbus can suggest has a known warmth.

02 · Everyone checked in. Nobody knew why.

The failure wasn’t interaction. The check-in and the outcome it affected were on two separate screens. Users couldn’t infer a causal relationship that the interface didn’t make visible.

Method

Goal

Does the feedback loop feel legible? Do users understand that check-in shapes tomorrow’s suggestion?

Tasks

Build an outfit for today · Complete the evening check-in

Participants

5 Seattle Transplants (5 mo – 1 yr)

“I thought I was just, like, rating the day. I didn’t know it was actually doing something.”
— P1

The failure wasn’t interaction. The check-in and the outcome it affected were on two separate screens. Users couldn’t infer a causal relationship that the interface didn’t make visible.

Finding	Signal	Solution
Invisible Feedback Loop	4/5 didn’t connect check-in → tomorrow; 3/5 said “no effect” in debrief	Inline survey on outfit tile · Today/Tomorrow tabs · loading: “Adjusting for your note…”
Data Disclaimer ≠ Trust	3/5 paused on “calibrated for you” until I explained it	Dropped label · Profile + how your data shapes suggestions
Static Tone	1-yr users found newcomer tone patronizing	Tenure-aware system prompt

03 · Make the cause visible.

V2 narrowed AI’s job: a deterministic thermal engine owns warmth; the model owns language only. The check-in moved inline, the loop became visible.

Area	V1	V2
Check-in	Slide-up panel, separate screen	Inline on the outfit tile
Loop visibility	Tomorrow updated after close	Today \| Tomorrow tabs, loading replays your note
Trust signal	“Calibrated for you” label	Profile data explainer · honest limits in onboarding
Architecture	LLM picks garments + writes voice	Engine picks warmth · the model writes copy only
Tone	Fixed newcomer register	Tenure-aware — less hand-holding as users acclimate

Warmth is physics, not a prompt.

The engine exists because of one V2 decision: warmth had to stop being a model output. Warmth is physics. It’s testable, so it runs through deterministic code: a UTCI stress score maps to an insulation tier, garment lookup applies rain, wind, and damp-cold overrides, and your check-ins shift the band before lookup. Same weather, same history, same outfit. Every recommendation is traceable.

Language is where non-determinism earns its keep. The model writes the voice on top of decisions the engine already made. Nimbus also dresses you for exposure: the coldest hour you’re actually outside on your commute, not the daily high.

The model writes. It doesn’t decide.

My first prompt was a flat feature list. Every decision I hadn’t made explicitly, the model made on its own. Under-specification became a design failure. The rewrite is a behavioral contract: nine sections, thirty-plus subsections, with hard boundaries on what the model can and cannot decide.

Final product

One day in the loop.

Step through a single day: morning forecast, outfit lock-in, evening check-in, tomorrow’s adjustment.

What’s today like?

Today’s outfit and forecast on Home — no label claiming personalization, just the suggestion and the data behind it.

Week 2

“44° and drizzling — classic Seattle greeting. Your merino base and rain shell will handle it.”

Month 7

“42°, light rain. Shell, merino, boots. You ran warm last Thursday, trimmed the tier.”

The rules the loop never breaks

Closed vocabulary

Model picks from a fixed item set with known CLO values. It cannot invent garments that break the thermal math.

Hard weather overrides

Rain means a rain shell. The model doesn’t negotiate with weather.

Try Another

Variety within the same warmth band. A different outfit, not a worse one.

Sick-day escape hatch

Unusual days don’t train the thermal offset. The system learns only from representative days.

Impact

They asked to download it.

Nobody asked what the check-in was for. They asked if they could download the app.

“I wish I had this when I first moved here. It would’ve made the winter more bearable.”
— MHCI+D Cohort Member

Reflection

Find the line before trust breaks.

Using AI everywhere is not always the right call. Nimbus only worked once I found the line between what the model should own and where it was breaking trust.

Where AI broke trust

“Calibrated for you” claimed personalization without showing proof. Users felt watched, not helped. Trust didn’t come from smarter copy. It came from making the feedback loop visible: see your check-in change tomorrow’s suggestion.

Where AI earned its place

V1 let the LLM pick garments and write the voice. V2 moved warmth to deterministic CLO math and limited the model to language. Both layers got better. The model stopped making calls it couldn’t defend.

What’s Next?

Longitudinal · 4–6 weeks

Whether personalization improves comfort over time, not just first-session comprehension.

Automated + human eval

Automated and human review of tenure-aware tone outputs tested against fixed outfit decisions.