Model evaluation

The Prophet Python package has a built-in diagnostics module that allows users evaluate a horizon of prediction vs input from within a time series for an individual station. While the outputs of the diagnostics are useful, we were also interested in how the models compare to each other across the region. To that end, we generated a series of python-based Jupyter notebooks that used root-mean-square errors, maps, visual inspection of time-series fits, and an ice-transition analysis to get a handle on how well the Prophet models fit the inputs, if there are regional trends to the errors, and the extent of the model limitations. The results of those notebooks are summarized here.

How well does the prophet model match up with the extracted results from CFSv2e at the given locations?

Evaluating fits via root-mean-square errors

The sea ice concentration root-mean-square errors (RMSE) over every station derived from a cross-validation horizon of 365 days based on 8 years of training data had a mean value of 0.138, a minimum of 0.008, and a maximum of 0.434. The station with the highest average RMSE (N73W145) had an average RMSE of 0.26. Generally, the highest error correspond to a difference in phase between ice growth and retreat and that predicted by the model.

How well does Prophet fit ice transition points (ice in or ice out)?

We looked at a number of stations to qualitatively determine how well Prophet modeled inflection points. Using the common threshold of 15%, turned the station time-series results into a binary of has ice (True) and doesn’t have ice (False) to examine how well the models reproduced the timing of ice transitions.

An idea example of this analysis is station PAOT (Kotzebue), which for the past nine years has had just one transition from ice in to ice out each year.

YearCFSv2 ice outProphet ice outCFSv2 ice inProphet ice in

For this case, it's easy to compare the CFSv2 data to the Prophet model at Kotzebue to find a mean ice-out error of 10.8 days and a mean ice-in error of -5.3 days. The Prophet model was generated based on weekly data, so this is pretty good in terms of error (1-2 data points).

We processed the data to find every station in the study that showed one (and only one) transition from ice to no ice (and vice versa) in any year of their time-series. 178 of the 300 stations had clear ice-in or ice-out transitions whose timing could be compared to that derived from the Prophet model. The stations in this subgroup included were widely distributed along the seasonal ice edge, from the northern extent of the study region in the Chukchi Sea (76.9 N) to as far south as Cape Newenham (58.4 N).

By analyzing the dates for all of the Prophet models at these stations, we found that, on average, the Prophet model predicted the dates of ice-out transitions to be 11.4 days later than the CFSv2 inputs, and the Prophet model predicted the dates of the ice-in transitions to be 6.5 days earlier than the inputs. I.e., it consistently showed an ice season arrived one week earlier and stuck around one week longer than the inputs. 1245 ice-out transitions and 1132 ice-in transitions (total of 2377) were compared in this analysis.

  • Ice In Mean Difference: -6.5 days
  • Ice In Mean Absolute Difference: 8.4 days
  • Ice Out Mean Difference: 11.4 days
  • Ice Out Mean Absolute Difference: 12.5 days

Which stations have the best or worst fits?

We generated a map of all the modeled stations to determine if there was a spatial component to the model accuracy. Regions in the southern Bering Sea had expectedly low root mean squared errors (RMSE), because sea ice only occasionally reaches into those regions. Areas in the variable region near the Bering Strait had moderate RMSE, with the area west of St. Lawrence Island showing the most error-prone models, with Norton and Kotzebue Sounds being slightly better. Models near the north coast of Alaska in the Beaufort Sea were modeled with surprisingly low errors, whereas areas offshore saw higher RMSE values, perhaps because of quickly changing conditions there.

We inspected models produced for individual points, and the following six plots represent a selection of models across the region.

In general, lower ice concentrations overall (e.g., at St. George) lead to very low RMSE values. This makes sense, but also suggests that RMSE must be used carefully when comparing models from different areas. In the cases above the RMSE for the Beaufort Sea models have some of the best “looking” model fits, with the exception of 2020, which had more ice in CFSv2 than was predicted by the Prophet model. Meanwhile, the St. Lawrence and St. Matthews Islands show lower RMSE values, but visually have a worse fit with input values well above what the Prophet model predicted. In particular, the St. Lawrence Island model would have significantly have under reported the ice that was seen there.

How many years of data/CFSv2e are necessary to get “good” results from the prophet model?

The Prophet model runs best with at least three years of historical inputs.

Adding Regressors

How strong is the relationship between modeled temperature and modeled sea ice?

We used the surface temperature and ice concentration variables and plotted their relationship against each other). Individual stations generally show a strong relationship between temperature and ice concentration (e.g., PAOT, -162.60624E, 66.88576N, near Kotzebue).

For interest, we generated a 2D histogram plot of all 305 stations to determine if the temperature and ice concentration relationship was consistent. As expected, ice concentration is related to the surface temperature, though the relationship appears to grow more complicated as temperature increases. At -1.0 C, nearly every ice concentration value between 0.0 and 0.8 is represented at some point in time. Similarly, ice concentrations of 0 can happen at nearly every temperature >1.5 C. The 2D histogram suggests that a typical error of +/- 0.3 ice concentration could be a typical error for predictions across the entire region. But errors based on models at individual stations may fare significantly better.

If we add regressors to drive the prophet model (e.g., air temperature), how does that change the outcomes and statistics?

Regressors are used to tie the predictions to one or more driving variables. Crucially, if regressors are used to train the model, they must be provided for the duration of the predicted future. For example, if we include surface temperature as a driver of ice concentration in the Arctic, Prophet can only be used to predict futures as far out as we have temperature predictions. This makes sense, but can also feel somewhat circular. In this case, we trained a model based on CFSv2 with CPC sea ice concentrations. If we use CFSv2 surface temperature to drive the model and can only predict into the future as far as CFSv2 goes, the sea ice concentrations from the Prophet model become redundant with the CFSv2 result for sea ice concentrations.

The value in adding regressors is to generate a far more accurate, physically-driven model for the case where the regressors are well-modeled but the metric being predicted is not. For example, if we had a long-term air temperature forecast that does not include ice, but the ice metrics could be trained against the temperature from historical satellite data. In the case of this project, we produced models (1) trained on only the historical sea ice concentration values, and (2) using surface temperature from CFSv2 as a regressor, which was useful to prove the method but doesn’t provide additional predictions beyond CFSv2. The published models only include models without regressors, so they can be used to model futures of arbitrary length.


For this project, we focused on the open-source time-series prediction library Prophet (Taylor and Letham, 2017). Prophet has been shown to be accurate for time-series that have known seasonality and at least a few years of annual data. This fit the needs for this project, which had ~9 years of historical inputs from CFSv2, and sea ice generation has a strong seasonality based largely on the calendar year. It is generally straightforward to set up and use, fits the future based on a linear logistic curve, and handles outliers and missing data well.

However, as an easy-to-use model, Prophet also has some drawbacks. First, it doesn’t do appear to do very well predicting outlier years, i.e., the predicted futures often have a response that is more towards “normal” that is sometimes warranted. Second, if outlier years are used in the last three years of training inputs, they can strongly affect the resulting futures. For example, the 2017-2018 and 2018-2019 winter seasons at station BRST10 (Bering Strait) were modeled by CFSv2 to have relatively low ice concentrations, followed by two winters where sea ice returned to higher concentration values closer to the “normal” trend. The initial Prophet model prediction we used (daily data with no regressors) took this return to normal as a sign that sea ice would be increasing ad infinitum into the future.

Lastly, although Prophet accepts values for carrying capacities and saturating minimums (0 and 1 for sea ice concentration), the model allows predictions to go well beyond these values.