The data we pull
Every run collects three independent streams:
- NOAA tide predictions for the Tacoma reference station, datum-corrected to Redondo Beach observations.
- Five ensemble forecasts from GFS, ECMWF, ICON, GEM, and UKMO — pressure, wind vectors, and surface roughness for the next 72 hours.
- A local observation snapshot at forecast time: recent pressure trend, wind persistence, and water level anomaly at Tacoma.
The classifier
A deterministic rules layer turns the raw numbers into the four signals that actually drive Redondo floods: astronomical tide height, pressure anomaly, onshore wind stress, and antecedent water level. Each signal contributes to a base score.
The five weather forecasts rarely agree on the details. One might call for slightly lower pressure, another for a stronger wind shift. Rather than pick one and hope, the system runs the classifier hundreds of times — each run using a slightly different combination of what the five forecasts are telling us. The category that comes up most often becomes the final answer. If most runs land in the same category, confidence is high. If they disagree, the report says so (e.g., “Cat 3 most likely, but roughly a 15% chance of Cat 4 if pressure drops further than the midpoint forecast suggests”).
The categories
| Cat | Meaning |
|---|---|
| 1–2 | Ordinary high tides. No action. |
| 3 | First level where we notify subscribers. Minor ponding possible. |
| 4 | Meaningful shoreline flooding likely. |
| 5 | December 2022 territory — street-level flooding. |
| 6–7 | Rare, extreme stacking. Last one was in the historical record. |
The self-audit
After every event, a separate job compares the prediction against what actually happened — NOAA water-level observations, NWS alerts, and community reports. The classifier’s running hit rate and false-alarm rate are tracked and used to tune future thresholds.
The LLM’s role
The system uses OpenAI’s o1 reasoning model as its primary language model, with gpt-4-1106-preview for second-opinion reviews. The LLM does not predict the flood — that’s the deterministic classifier described above. The classifier is fully auditable, scored against every known Redondo flood event, and its math can be reproduced by anyone reading the same weather data.
What the LLM does
- Writes the narrative reasoning in each report. Given the classifier’s inputs (tide, pressure, wind, antecedent water) and its output category, the LLM drafts the plain-language “why” — the case tonight’s conditions make for that category. This is the part of the report a human actually reads.
- Compares to past events. Given a shortlist of analog flood events from the retrieval layer, the LLM picks the most representative one and writes a one-paragraph comparison — “tonight’s setup most resembles December 14, 2022, but with a lower pressure drop and weaker wind persistence.”
- Provides a second-opinion review. Before an alert is sent, a second model re-reads the reasoning with fresh context and flags obvious inconsistencies. If the review disagrees with the primary output strongly enough, the run logs an error and the audit log notes it for later inspection.
Why a reasoning model?
Flood-narrative writing has to thread a chain of causation: high astronomical tide + dropping pressure + persistent onshore wind → water stacks up against the shoreline → ponding and minor structural flooding. Earlier GPT-4 variants would sometimes skip a step or conflate magnitudes. o1 does its chain-of-thought explicitly before committing to text, which has been noticeably more reliable in testing. The trade-off is cost per report — it’s higher than a standard chat model, and it’s budgeted and monitored.
What the LLM is not allowed to do
- Change the category. The category comes from the deterministic classifier. The LLM can describe it, not move it.
- Decide who gets alerted. Alert rules are hard-coded: Category 3+ triggers SMS, subscriber preference determines channel.
- Access the subscriber list. The classifier, LLM, and messaging paths are entirely separate. Names and phone numbers never enter the model context.
Put another way: if OpenAI went dark tomorrow, the classifier would still produce a category and the alert pipeline would still send a text — subscribers would just lose the plain-language reasoning and get a shorter “Cat X forecast, peak tide Y ft at Z:ZZ” message. The system degrades gracefully.