Last updated: 2021-03-24

Checks: 5 1

Knit directory: thesis/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20210321) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Tracking code development and connecting the code version to the results is critical for reproducibility. To start using Git, open the Terminal and type git init in your project directory.

This project is not being versioned with Git. To obtain the full reproducibility benefits of using workflowr, please see ?wflow_start.

# 1 Results and Discussion

## 1.1 Hyperparameter Tuning

The results of the BO process are reported in Table 1.1. For each of the predictor sets, an optimization was conducted for both the adm and bas units. Except for the SV-bas model, a double-layered CNN was found to deliver the best results. Concerning the activation function $$a_{cnn}$$, the picture is less clear, but softmax was selected in $$\frac{8}{24}$$ of the cases followed by the hard_sigmoid activation function ($$\frac{7}{24}$$). The kernel numbers $$k_{cnn}$$ are selected only twice above a value of 120, approximating the maximum possible value of 128. The size of the network generally seems sufficient to capture the data structure. The selected kernel widths $$\beta_{cnn}$$ for the CH-bas model are at the maximum of 24 months for both the 50 and 100 km branch, suggesting that for this variable configuration, there could be a performance improvement with bigger kernel widths. For the remaining configurations, the kernel widths are mainly between five to 22 months. For parameter $$pool_1$$, in about half the cases maximum pooling is selected with $$pool\_size$$ between 3 and 24 and only in $$\frac{6}{24}$$ cases the pool size is below 12. Generally, information beyond a period of 12 months seems to represent the signal for conflicts better. As pooling operation $$pool_2$$, maximum pooling is selected in $$\frac{17}{24}$$ cases, indicating that maximizing the values through the time series is beneficial for the conflict prediction task. It is of interest that average pooling was only chosen for 100 and 200 km inputs. The signal from these inputs seems to be better represented by an average layer.

For most variable configurations, the selection of two LSTM-layers yielded the best results, except for the CH-bas and EV-adm configurations. Similar to the CNN part of the model, the results suggest that the sizes of $$n_{1:3}$$ generally seem to be sufficient for the data complexity. Only in $$\frac{7}{72}$$ cases more than 120 neurons are selected for a LSTM layer. Instead, the number of neurons is frequently below 100, indicating that smaller network sizes are able to learn a pattern from the data. There is no clear pattern concerning the number of neurons across variable configurations or between layers. Also, dropout rates $$d_{1:3}$$ vary substantially, with the lowest rate of 1 % to the highest of 50 % dropout. As the activation function $$a_{dense}$$ for the fully connected model, softplus and relu both are selected twice. The number of neurons $$n_{dense}$$ is between 21 and 97 neurons. The maximum of 128 is not approximated by any of the model configurations.

files <- list.files("../output/bayes/", ".rds$", full.names = T) results <- lapply(files, function(x){ data <- readRDS(x) if(str_detect(x, "baseline")) type <- "baseline" if(str_detect(x, "structural")) type <- "structural" if(str_detect(x, "environmental")) type <- "environmental" if(str_detect(x, "basins")) unit <- "bas" if(str_detect(x, "states")) unit <- "adm" data[str_detect(data, "average")] = "avg." paras <- tibble( '$double\\_cnn$' = NA, '$a_{cnn}$' = NA, '$k_{cnn}$' = NA, '$\\beta_{cnn}$' = NA, "$pool_1$" = NA, "$pool\\_size$" = NA, '$pool_2$' = NA, '$lstm\\_layers$' = NA, '$n_1$' = NA, '$d_1$' = NA, '$n_2$' = NA, '$d_2$' = NA, '$n_3$' = NA, '$d_3$' = NA, '$a_{dense}$' = NA, '$n_{dense}$' = NA, '$a_{out}$' = NA, '$\\pi$' = NA, '$\\alpha$' = NA, '$\\gamma$' = NA, '$opti$' = NA, '$lr$' = NA) double_cnns = data[rep(grep("double_cnn", names(data)),4)] paras[,1] = paste(if_else(double_cnns == TRUE, "$Yes$", "$No$"), collapse = "/") paras[,2] = paste(data[grep("cnn_activation", names(data))], collapse = "/") paras[,3] = paste(data[grep("cnn_filters", names(data))], collapse = "/") paras[,4] = paste(data[grep("cnn_kernel", names(data))], collapse = "/") paras[,5] = paste(data[grep("cnn_pooling", names(data))], collapse = "/") paras[,6] = paste(data[grep("pool_size", names(data))], collapse = "/") paras[,7] = paste(data[grep("global_pooling", names(data))], collapse = "/") paras[,8] = paste(rep(data[grep("lstm_layers", names(data))], 4), collapse = "/") paras[,9] = paste(data[grep("lstm_neurons_1", names(data))], collapse = "/") paras[,10] = paste(round(unlist(data[grep("lstm_dropout_1", names(data))]), 2), collapse = "/") inputs = c("inputUnit", "inputB50", "inputB100", "inputB200") neurons_2 = unlist(lapply(inputs, function(i){ tmp = unlist(data[paste0(i,"_lstm_neurons_2")]) if(is.null(tmp)) return("-") tmp })) neurons_2 = paste(neurons_2, collapse = "/") dropout_2 = unlist(lapply(inputs, function(i){ tmp = unlist(data[paste0(i,"_lstm_dropout_2")]) if(is.null(tmp)) return("-") round(tmp,2) })) dropout_2 = paste(dropout_2, collapse = "/") paras[,11] = neurons_2 paras[,12] = dropout_2 neurons_3 = unlist(lapply(inputs, function(i){ tmp = unlist(data[paste0(i,"_lstm_neurons_3")]) if(is.null(tmp)) return("-") tmp })) neurons_3 = paste(neurons_3, collapse = "/") dropout_3 = unlist(lapply(inputs, function(i){ tmp = unlist(data[paste0(i,"_lstm_dropout_3")]) if(is.null(tmp)) return("-") round(tmp,3) })) dropout_3 = paste(dropout_3, collapse = "/") paras[,13] = neurons_3 paras[,14] = dropout_3 paras[,15] = data$dense_activation
paras[,16] = data$dense_units paras[,17] = data$out_activation
paras[,18] = round(data$pi, 4) paras[,19] = round(data$alpha, 4)
paras[,20] = round(data$gamma, 4) paras[,21] = data$optimizer
paras[,22] = round(data$lr, 4) paras$unit = unit
paras$type = type paras }) results = do.call(rbind, results) results$unit = factor(results$unit, levels = c("adm", "bas"), labels = c("$_adm_$", "$_bas_$")) results$type = factor(results$type, levels = c("baseline", "structural", "environmental"), labels = c("Conflict History", "Structural Variables", "Environmental Variables")) results["$a_{cnn}$"] = paste("$",sub("(/.*?)/(.*)", "\\1\\\\\\\\\\2", unlist(results["$a_{cnn}$"])), "$", sep = "") results["$a_{cnn}$"] = str_replace_all(unlist(results["$a_{cnn}$"]), "_", "\\\\_") results["$pool_1$"] = paste("$",unlist(results["$pool_1$"]), "$", sep = "") results["$pool_2$"] = paste("$",unlist(results["$pool_2$"]), "$", sep = "") results["$a_{dense}$"] = paste("$",str_replace(unlist(results["$a_{dense}$"]), "_", "\\\\_"), "$", sep = "") results["$a_{out}$"] = paste("$",str_replace(unlist(results["$a_{out}$"]), "_", "\\\\_"), "$", sep = "") results["$opti$"] = paste("$",str_replace(unlist(results["$opti$"]), "_", "\\\\_"), "$", sep = "") results %>% arrange(type, unit) %>% mutate_all(as.character) %>% #mutate('$double\\_cnn$' = str_replace( .[,2],pattern = "TRUE", "$Yes")) %>% pivot_longer(cols = 1:22, names_to = "Parameter") %>% pivot_wider(id_cols = c(Parameter), names_from = c(type, unit)) %>% # slice(1:17) %>% thesis_kable( align = c("lcccccc"), col.names = NULL, linesep = c(""), escape = F, caption = c("Results of the Bayesian hyperparameter optimization.") ) %>% kable_styling(latex_options = "HOLD_position", font_size = 8) %>% add_header_above(c(" ", "adm" = 1, "bas" = 1, "adm" = 1, "bas"= 1, "adm" = 1, "bas" = 1), italic = T) %>% add_header_above(c(" ", "Conflict History (CH)" = 2, "Structural Variables (SV)" = 2, "Environmental Variables (EV)" = 2), bold = T) %>% footnote(general = "Multiple values indicate the results for the input branch of buffer size 0/50/100/200 km, respecitvley.", threeparttable = T, escape = FALSE, general_title = "General:", footnote_as_chunk = T) Conflict History (CH) Structural Variables (SV) Environmental Variables (EV) adm bas adm bas adm bas Table 1.1: Results of the Bayesian hyperparameter optimization. $$double\_cnn$$ $$Yes$$/$$Yes$$/$$Yes$$/$$Yes$$ $$Yes$$/$$Yes$$/$$Yes$$/$$Yes$$ $$Yes$$/$$Yes$$/$$Yes$$/$$Yes$$ $$No$$/$$No$$/$$No$$/$$No$$ $$Yes$$/$$Yes$$/$$Yes$$/$$Yes$$ $$Yes$$/$$Yes$$/$$Yes$$/$$Yes$$ $$a_{cnn}$$ $$softsign/hard\_sigmoid\\softsign/softmax$$ $$softsign/hard\_sigmoid\\sigmoid/softsign$$ $$softmax/softmax\\hard\_sigmoid/softmax$$ $$sigmoid/softplus\\softplus/softmax$$ $$hard\_sigmoid/softmax\\hard\_sigmoid/hard\_sigmoid$$ $$softmax/softmax\\hard\_sigmoid/softmax$$ $$k_{cnn}$$ 106/125/45/78 86/99/39/128 95/70/63/42 41/102/67/43 19/41/94/99 92/76/66/42 $$\beta_{cnn}$$ 5/10/22/8 19/24/24/6 10/9/9/16 22/12/14/5 12/6/13/14 8/10/10/17 $$pool_1$$ $$max/max/avg./max$$ $$avg./max/max/max$$ $$max/avg./avg./max$$ $$avg./max/avg./avg.$$ $$max/avg./max/avg.$$ $$max/avg./avg./max$$ $$pool\_size$$ 15/23/3/14 24/19/10/21 12/18/17/18 16/6/10/9 12/16/9/22 12/18/17/16 $$pool_2$$ $$max/max/max/avg.$$ $$max/max/avg./avg.$$ $$max/max/max/avg.$$ $$max/max/avg./max$$ $$max/max/max/avg.$$ $$max/max/max/avg.$$ $$lstm\_layers$$ 2/2/2/2 3/3/3/3 2/2/2/2 2/2/2/2 3/3/3/3 2/2/2/2 $$n_1$$ 114/83/114/106 97/12/12/78 109/128/43/23 23/59/38/92 88/28/85/122 107/127/38/18 $$d_1$$ 0.22/0.08/0.14/0.49 0.23/0.2/0.38/0.25 0.02/0.09/0.07/0.06 0.22/0.14/0.02/0.2 0.2/0.1/0.09/0.31 0.02/0.07/0.05/0.06 $$n_2$$ 79/53/62/123 128/34/84/65 83/82/128/21 37/114/83/56 111/19/13/104 76/80/128/21 $$d_2$$ 0.05/0.28/0.37/0.08 0.5/0.24/0.05/0.16 0.34/0.03/0/0.29 0.26/0.12/0.46/0.47 0.23/0.04/0.09/0.49 0.34/0.02/0.01/0.31 $$n_3$$ -/-/-/- 85/92/96/57 -/-/-/- -/-/-/- 54/109/106/13 -/-/-/- $$d_3$$ -/-/-/- 0.437/0.231/0.255/0.172 -/-/-/- -/-/-/- 0.202/0.477/0.284/0.285 -/-/-/- $$a_{dense}$$ $$softplus$$ $$elu$$ $$relu$$ $$softplus$$ $$selu$$ $$relu$$ $$n_{dense}$$ 21 97 32 38 95 29 $$a_{out}$$ $$sigmoid$$ $$sigmoid$$ $$sigmoid$$ $$sigmoid$$ $$sigmoid$$ $$hard\_sigmoid$$ $$\pi$$ 0.4529 0.6534 0.2404 0.3856 0.5697 0.275 $$\alpha$$ 0.9245 0.7928 0.8056 0.9087 0.722 0.8544 $$\gamma$$ 6.2896 5.8844 7.8011 1.1974 4.8053 8.1588 $$opti$$ $$adagrad$$ $$adam$$ $$adamax$$ $$adagrad$$ $$adamax$$ $$adamax$$ $$lr$$ 0.0244 0.0116 0.0246 0.0258 0.0117 0.0226 General: Multiple values indicate the results for the input branch of buffer size 0/50/100/200 km, respecitvley. For the output layer activation $$a_{out}$$, sigmoid is selected in $$\frac{5}{6}$$ cases, hard_sigmoid is selected only once. This can be expected since sigmoid is usually the choice for probabilistic model output whereas hard_sigmoid is faster to compute but also less accurate. The results of the bias initialization show that the lowest initialization value selected is $$\pi = 0.2404$$. This result seems to be counter-intuitive since the high class imbalance in the data set the layer should be initialized with relatively small values (Lin et al., 2018). However, during training, the cut-off threshold to differentiate between conflict and peace district-months was held constant at a value of 0.5, thus $$\frac{4}{6}$$ models are still initialized with a bias towards the background class. The weight factor $$\alpha$$ is comparable between model configurations between 0.722 to 0.925, close to the highest possible value of 1. It thus effectively downweighs the contribution of the frequent peace observations during the calculation of the focal loss. The parameter $$\gamma$$ takes comparable values between 4.8053 and 8.1588 except for the variable configuration SV-bas. Here, with $$\gamma = 1.1974$$, an exceptionally low value is selected, leading to higher contributions of correctly identified observations to the overall loss compared to the other models. As an optimizer, adam or derivation of it are selected. In $$\frac{3}{6}$$ cases, it is the adamax optimizer. This indicates that optimizers that adapt the learning rate during a training process and keep information on prior gradient calculations, called momentum, are beneficial for the problem at hand. Initiating the learning rate between values of 0.0116 to 0.0258 delivers the best results across all model configurations. Overall, BO seems to be able to determine well performing model configurations for different representations of the data. It should be kept in mind that there are some limitations in the selected approach because BO was only conducted once on the outcome variable cb. However, especially the parameters of the focal loss function that substantially influence a model’s ability to learn from the data are expected to be sensitive to the specific class distribution per outcome class. It can be expected that repeating the BO for each outcome class would lead to better performance. BO and hyper-parameter tuning in general are not frequently conducted in conflict prediction studies. The main reason is that linear models seldom require hyper-parameters to be tuned. But even for more complex model algorithms, tuning is often omitted. Hegre et al. (2019) do not optimize the hyper-parameters of their Random Forest models but hold them static at values found empirically that produce accurate predictions. This is also true for the model of Kuzma et al. (2020), where the selection of hyper-parameters for their Random Forest model was guided by manual search. Schellens and Belyazid (2020) report for both their neural network and the Random Forest model that hyper-parameters were delineated using “rules-of-thumb” (Schellens and Belyazid, 2020, p. 4), i.e., the parameters were set in advance based on the size of the training data set, and no search was implemented. For Random Forest, the tuneable parameters are only a few, comprising the number of trees and the number of variables to be considered at node splits (Breiman, 2001). It should be noted that Random Forest can deliver robust results with low sensitivity to its hyper-parameters and that e.g., increasing the number of trees is not always beneficial, especially for classification tasks (Probst and Boulesteix, 2018). For DL models, the number of tuneable parameters is substantially higher. Experimental studies from other domains have shown that model performance is very sensitive to their initialization rendering hyper-parameter optimization a necessity (Cooney et al., 2020; Zhang and Wallace, 2016). ## 1.2 Global Performance The following considerations are focused on the $$F_2$$-score, the AUPR metric, precision and sensitivity for reasons of brevity. Visualizations of additional accuracy metrics as well as a comprehensive table comparing these metrics for model specifications are listed in the Appendix (Figure ?? and Table ??). The $$F_2$$-score serves as the primary performance metric in this thesis. From Figure 1.1, it becomes evident that framing the detection problem as a time series and applying a DL framework substantially increases the $$F_2$$-score compared to the LR baseline. However, the LR baseline based on the bas districts achieves higher scores for all conflict classes compared to adm-LR, showing that aggregation on the basis bas of districts is beneficial for the prediction. The score gains from the LR baseline to the CH theme are considerable for both aggregation units. For the outcome classes sb and ns, both aggregation units show an increased variance for the 10 repeats in the CH set, but for bas districts this variance is substantially larger compared to the adm districts. The results suggest that the history of recent conflicts as the sole predictor already is able to achieve better performance compared to the fully specified LR baseline. However, the performance seems not to be very stable but dependent on the random initialization of model parameters, especially for bas models. The performance on the outcome classes cb and sb increases with more complex predictor sets for both aggregation units. In general, the absolute gains in performance between the DL models are only marginal. For the outcome class cb the $$F_2$$-score raises from 0.590 (0.664) with the CH predictor set, to 0.616 (0.649) with the SV set to to 0.618 (0.689) with the EV predictor set for the adm (bas) districts, resulting in an absolute gain of 0.028 (0.025) points (Table ??). #files <- list.files("../output/acc/", pattern = ".rds", full.names = T) files <- list.files("../output/test-results/", pattern = "test-results", full.names = T) # files <- files[-grep("test-results-regression-basins", files)] acc_data <- lapply(files, function(x){ out <- vars_detect(x) tmp = readRDS(x) tmp %>% filter(month == 0) %>% group_by(name) %>% summarise(mean = mean(score, na.rm = T), median = median(score, na.rm = TRUE), sd = sd(score, na.rm = T), type = outtype,
unit = out$unit, var = out$var) %>%
ungroup()
})
acc_data = do.call(rbind, acc_data)

acc_data %<>%
mutate(var = factor(var,
levels = c("all", "sb", "ns", "os"),
labels = c("cb", "sb", "ns", "os")),
type = factor(type,
labels = c("LR", "CH", "SV", "EV"),
levels = c("regression", "baseline", "structural", "environmental")),
unit = if_else(unit == "basins", "bas", "adm"),

pd <- position_dodge(0.1)
score = "median"
labs.unit = c("_adm_", "_bas_")
labs.name = c("", "")
names(labs.name) = c("f2", "f2")

plt_f2 = acc_data %>%
dplyr::filter(name == "f2") %>%
dplyr::select(type, unit, var, name, score = !!score, sd) %>%
ggplot(aes(color=var, x=type, shape=var)) +
geom_point(aes(y=score), position = position_dodge2(width = .5)) +
geom_errorbar(aes(ymin=score-sd, ymax=score+sd, x=type), width=.5, position = position_dodge2(width = .5)) +
scale_color_brewer(palette = "Dark2") +
labs(x="", y = expression(paste(F[2],"-score")), color = "Conflict Class", shape = "Conflict Class") +
ylim(0,1) +
facet_wrap(~unit+name, labeller = labeller(unit = labs.unit, name = labs.name)) +
guides(color=guide_legend(ncol=4)) +
my_theme

plt_aucpr = acc_data %>%
filter(name == "aucpr") %>%
dplyr::select(type, unit, var, name, score = !!score, sd) %>%
ggplot(aes(color=var, x=type, shape=var)) +
geom_point(aes(y=score), position = position_dodge2(width = .5)) +
geom_errorbar(aes(ymin=score-sd, ymax=score+sd, x=type), width=.5, position = position_dodge2(width = .5)) +
scale_color_brewer(palette = "Dark2") +
labs(x="Model type", y = "AUCPR", color = "Conflict Class", shape = "Conflict Class") +
ylim(0,1) +
facet_wrap(~unit+name, labeller = labeller(unit = labs.unit, name = labs.name)) +
guides(color=guide_legend(ncol=4)) +
my_theme

if(knitr::is_latex_output()){
ggarrange(plt_f2, plt_aucpr, ncol = 1, common.legend = T, legend = "bottom")
} else {
require(plotly)
subplot(plt_f2, plt_aucpr, nrows = 2) %>%
layout(legend = list(orientation = "h",
xanchor = "center",
x = 0.5, y = -.1))
}

Figure 1.1: Global performance of the F2-score (top) and AUCPR metric (bottom). (LR: Logistic Regression, CH: Conflict History, SV: Structural Variables, EV: Environmental Variables)

For the two other conflict classes, ns and os, the picture is quite different. The first observation is that the absolute scores achieved for these classes are substantially lower. Partly, this can be explained with the lower occurrence of these classes in the data set (Table ??). As a second observation, there is evidence that for bas districts for both classes the performance is improving with more complex predictors (Figure 1.1). Considering the ns class, the $$F_2$$-score raises about 0.112 points from the CH to the EV theme. For the os conflict class, the increase is less pronounced with 0.053 points. In comparison, the performance for adm districts is decreasing with more complex predictor sets for outcome class ns. The difference between the CH and EV set is -0.029. For the os outcome variable, there is an increase in the $$F_2$$-score. With an absolute gain of 0.022 points from the CH to EV theme, this gain is only about half the size observed with bas units. These findings indicate that for the ns and os conflict classes, environmental variables based on bas districts play a major role in improving the predictive performance of the DL networks. An attempt to explain this could be the nature of these conflicts. Conflicts between non-state actors might include two groups with clashing socio-economic interests contesting over the available (natural) resources (Sundberg et al., 2012) while os events of violence often are associated with singular terrorist acts against the state or civilians (Eck and Hultman, 2007). Both types of conflict come with high opportunity costs for the actors inducing the violence. In contrast, state-based violence in an already conflict-prone area shows lower opportunity costs for the state actor. In its own logic, a state has to defend itself against rebels or insurgencies to maintain its legitimate power position (Collier, 1998). Also, the availability of resources to engage in warfare is usually higher for the state than for other combating actors, further reducing opportunity costs. Thus, ns and os conflict classes might show a stronger relationship to natural resources. Changes in resource availability substantially reduce the opportunity costs for deprived individuals to get involved in violent actions when their livelihoods are seriously endangered.

Considering the AUPR values, the finding that DL models substantially increase performance over the LR baseline is confirmed (Figure 1.1). The larger variance of the CH-bas model becomes even more evident compared to the results of the $$F_2$$-score. Also, the tendency for bas districts to achieve higher performances with more complex predictor sets is more pronounced for all outcome variables. On the other hand, decreasing or stable AUPR values are observed for adm districts. For both, ns and os conflict classes, a substantial decline from the CH set to the EV set is observed, accounting for -0.064 and -0.041 points, respectively. In contrast, for bas districts and the os outcome class, the increase in AUPR from the CH to the EV theme amounts to 0.08 points. For the ns class, an increase of 0.039 points is observed. The increases for the cb and sb class are even more substantial (0.066 and 0.098, respectively). This underlines the finding that more complex predictor sets do not substantially benefit the performance of DL models when combined with adm districts. The selected predictors, however, do increase the performance in combination with bas districts, indicating that the environmental processes happening on the ground are better captured at the sub-basin watershed level.

Comparing the performance to other conflict prediction studies is complicated because (i) some studies do not directly report selected metrics, and (ii) the data sets and definitions used to identify conflicts can be different. For example, the $$F_2$$-score is seldom reported, but it can be directly calculated for cases where the sensitivity and precision metrics are given. It should be noted that differences in the definition of conflict classes can not be accounted for directly. Hegre et al. (2019) reach an $$F_2$$-score of about 0.86 for the sb outcome variable on the African continent based on a country-month analysis which is substantially higher than the maximum $$F_2$$-score of 0.67 for the sb class achieved by the SV-adm model in the present analysis. They also model conflict on a regular grid with a cell size of 0.5 x 0.5 decimal degrees. For this representation, the $$F_2$$-score only reaches about 0.36, underlining the rigorous impact different aggregation units have on model performance. The $$F_2$$-scores in this study lie in between the values achieved by Hegre et al. (2019) on large and small-scale aggregation units but deliver greater spatial detail than their country-month analysis. The comparison reveals a trade-off between predictive performance and spatial detail, which needs to be considered when evaluating conflict prediction models.

Kuzma et al. (2020), on the other hand, use adm districts across Africa and Asia for their prediction model. They use a subtle difference in their definition of conflict by applying a moving-window of 12 months into the future and setting a threshold of 10 or more casualties to consider a given district-month as conflict. This way, they are able to obtain a $$F_2$$-score of 0.70 for African districts (0.74 overall). Given that the cb outcome class achieves a maximum $$F_2$$-score of 0.69 and that similar aggregation districts were used, the present study’s overall performance is comparable. Halkia et al. (2020) are able to achieve an $$F_2$$-score of 0.79 based on a global country-month data set, indicating that higher performances can be achieved with simple linear regression models for a loss in spatial detail. They also use a slightly different definition of conflict by combining different types of conflicts into two classes representing sub-national conflicts and national power conflicts. While they do not report AUPR values, they are able to achieve AUC values of 0.94 for both classes. This performance is consistent with the models presented here, where AUC values between 0.944 and 0.963 are achieved for different conflict classes (Table ??). Overall AUPR scores for the complete study domain are reported by Kuzma et al. (2020) accounting for a score of 0.42. Hegre et al. (2019) report an AUPR score of 0.869 for sb conflicts based on country-months and of 0.277 based on the grid representation. In the current study, for the sb outcome class, a maximum AUPR score of 0.65 is achieved, while for the cb, the score is 0.639. In terms of AUPR, this study achieves better performances compared to Kuzma et al. (2020) who use similar aggregation units. In comparison to Hegre et al. (2019), the results are again in between the different aggregation units used in their study, indicating losses in performance for increased spatial detail.

plt_prec = acc_data %>%
filter(name == "precision") %>%
dplyr::select(type, unit, var, name, score = !!score, sd) %>%
ggplot(aes(color=var, x=type, shape=var)) +
geom_point(aes(y=score), position = position_dodge2(width = .5)) +
geom_errorbar(aes(ymin=score-sd, ymax=score+sd, x=type), width=.5, position = position_dodge2(width = .5)) +
scale_color_brewer(palette = "Dark2") +
labs(x="", y = "Precision", color = "Conflict Class", shape = "Conflict Class") +
ylim(0,1) +
facet_wrap(~unit+name, labeller = labeller(unit = labs.unit, name = labs.name)) +
guides(color=guide_legend(ncol=4)) +
my_theme

plt_sens = acc_data %>%
filter(name == "sensitivity") %>%
dplyr::select(type, unit, var, name, score = !!score, sd) %>%
ggplot(aes(color=var, x=type, shape=var)) +
geom_point(aes(y=score), position = position_dodge2(width = .5)) +
geom_errorbar(aes(ymin=score-sd, ymax=score+sd, x=type), width=.5, position = position_dodge2(width = .5)) +
scale_color_brewer(palette = "Dark2") +
labs(x="Model type", y = "Sensitivity", color = "Conflict Class", shape = "Conflict Class") +
ylim(0,1) +
facet_wrap(~unit+name, labeller = labeller(unit = labs.unit, name = labs.name)) +
guides(color=guide_legend(ncol=4)) +
my_theme

if(knitr::is_latex_output()){
ggarrange(plt_f2, plt_aucpr, ncol = 1, common.legend = T, legend = "bottom")
} else {
require(plotly)
subplot(plt_f2, plt_aucpr, nrows = 2) %>%
layout(legend = list(orientation = "h",
xanchor = "center",
x = 0.5, y = -.1))
}

Figure 1.2: Global performance of precision (top) and sensitivity (bottom). (LR: Logistic Regression, CH: Conflict History, SV: Structural Variables, EV: Environmental Variables)

Both of the performance metrics previously discussed are a combination of precision and sensitivity into a single metric. The $$F_2$$-score puts more weight on sensitivity while AUPR provides a more balanced assessment. As presented in Section ??, precision and sensitivity behave in a fragile balance to each other. Given a specific prediction model with fitted parameters, increases in precision will decrease sensitivity and vice versa. In Figure 1.2 both of these metrics are presented. For adm districts, there is a substantial increase in precision from the CH to SV set. Simultaneously, this increase is associated with a decrease in sensitivity most pronounced for the ns outcome class. Moving to the EV set, precision is slightly reduced, and some moderate sensitivity gains are observed. This behavior indicates the EV set achieves a more balanced distribution compared to other predictor sets compromising between precision and sensitivity. In general, one would expect a simultaneous increase of both performance metrics for a model with greater predictive ability. This is not found in the data at hand, rather more complex predictor sets achieve a more sensible balance between the metrics. The same observation holds for bas districts, except that the relation between sensitivity and precision is reversed. Moving from the CH to the SV set, sensitivity is increased at the cost of precision. The EV set again seems to settle on a sensible balance with slight increases in precision and decreases in precision compared to the SV set. Comparing the EV set for both aggregation units reveals that while the precision is comparable, bas districts are associated with higher sensitivity values compared to adm districts except for outcome variable os. This indicates that on the basis of a comparable precision rate, the models based on bas districts are more able to flag observed conflict district-months. Concerning the standard deviation, for adm districts the variance is merely decreased with more complex predictor sets. However, bas models seem capable to substantially reduce the variance with more complex predictor sets. The increased stability of prediction performance for bas districts indicates that the influence of the random initialization of the neural network’s parameters does not effect the predictive performance for these models to the same degree, overall increasing the credibility of the predictions based on bas units.

In general, a precision greater 0.5 is only achieved once across all model configurations. This indicates that most combinations are characterized by a high false-positive rate well above 50 %. Concerning sensitivity, most models achieve performances close to a value of 0.75, indicating that about 75 % of observed conflict district-months are detected. The maximum value of 0.876 is achieved for the combination of structural variables with bas units for the outcome class cb associated with a comparatively low precision value of 0.32. Concerning the ability of a model to correctly identify peaceful district-months, the specificity, for almost all outcome variables, maximum values $$>0.93$$ are achieved (not reported here - see Table ?? in Appendix).

Hegre et al. (2019) do not report sensitivity and precision specifically in their main article. However, for the sb conflict class, these metrics are given in the Supplementary Materials of their article. The final ensemble model achieves a precision (sensitivity) of 0.608 (0.963) for the country-month representation and 0.214 (0.443) for the grid-based aggregation. Considering the EV predictor set, for the adm representation a precision (sensitivity) of 0.39 (0.78) is achieved, and for the bas representation, it is 0.42 (0.79). From the comparison, it becomes evident that both metrics achieve substantially lower performance than the country-month theme by Hegre et al. (2019). However, while the losses compared to the country-month analysis amount to approximately 1/3, the performance is nearly doubled compared to the grid representation. In the following the cb outcome variable’s performance is considered because it most closely matches the distinct conflict definitions. In the present study precision (sensitivity) of 0.40 (0.72) is achieved for the adm districts, while for bas districts 0.42 (0.82) is achieved. Kuzma et al. (2020) report precision (sensitivity) values of 0.40 (0.85) for the African continent using similar aggregation units. While the precision is comparable, the performance of adm districts measured by sensitivity is substantially lower. However, for bas districts, the achieved performance is comparable to the results of Kuzma et al. (2020), indicating that with the selected predictor variables the bas representation more closely achieves comparable performance to related studies. Halkia et al. (2020) are able to achieve values of 0.577 (0.866) based on country-months. While their sensitivity is comparable to the one achieved by bas districts, precision is substantially higher. The comparison reveals that overall in terms of precision, the current model configurations require improvement in relation to other studies.

files <- list.files("../output/acc/", pattern = ".rds", full.names = T)

pr_data <- lapply(files, function(x){
out <- vars_detect(x)

df_pr <- lapply(1:length(tmp), function(i){
rep = tmp[[i]]$rep recall = tmp[[i]]$aucpr$curve[,1] precision = tmp[[i]]$aucpr$curve[,2] recall = recall[ c(rep(FALSE, 3), TRUE)] # reduce to every 4th entry to reduce file size precision = precision[ c(rep(FALSE, 3), TRUE)] # reduce to every 4th entry to reduce file size pr <- tibble(type = out$type,
unit = out$unit, var = out$var,
rep = rep,
recall = recall,
precision = precision)
})
df_pr = do.call(rbind, df_pr)
})

pr_data <- do.call(rbind, pr_data)

pr_data %<>%
mutate(var = factor(var,
labels = c("__cb__", "__sb__", "__ns__", "__os__"),
levels = c("all", "sb", "ns", "os")),
type = factor(type,
labels = c("CH", "SV", "EV"),
levels = c("baseline", "structural", "environmental")),
unit = factor(unit, levels = c("states", "basins"), labels = c("adm", "bas")))

Precision-Recall curves (PR-cureves) are used to compare the performances of distinct models for varying values of precision and sensitivity more directly. For reasons of brevity, the discussion will focus on the PR-curves, but ROC curves are found in the Appendix (Figures ?? and ??). Figure 1.4 depicts the PR-curves for the adm representation. Note that the bold lines indicate the averaged performance across 10 repeats, while faint lines represent the curves of individual models. LR baseline is not included because of the different training process which made it mandatory to train one model for each month in the prediction horizon. For the conflict class cb, it becomes evident that CH and SV predictor sets can maintain a higher precision for sensitivity values up to 0.3. However, with the EV predictor set, higher precision values are obtained for sensitivity above 0.6. This shows that the EV set is superior to the other predictor sets in the range of high sensitivity values. A similar pattern emerges for the sb conflict class, where the EV predictor set firstly achieves lower sensitivity values at high precision rates. When sensitivity increases over a value of 0.75, the EV set is able to maintain higher precision values.

pr_data %>%
mutate(group = paste(type,var,rep,sep="-")) %>%
select(-unit) %>%
arrange(type,var,rep) %>%
ggplot(aes(x=recall, y=precision, linetype=type, color=type)) +
geom_smooth(aes(), alpha=1, size=.5, se=FALSE) +
geom_line(aes(group=group), size=.1, alpha = .5) +
scale_color_manual(values = RColorBrewer::brewer.pal(n=4, name  = "Dark2")[2:4]) +
facet_wrap(~var, ncol = 2) +
labs(color = "Model", y = "Precision", x = "Sensitivity", linetype = "Model", subtitle = "adm") +
guides(color=guide_legend(ncol=3, alpha=1)) +
my_theme +
theme(
plot.subtitle =  element_markdown(size = 10, face =  "italic")
) 

Concerning the ns class, the picture is quite different. The CH set outperforms both of the other sets almost over the complete value range. The EV set shows the lowest precision values at high rates of sensitivity, indicating that for this outcome class SV and EV predictors do not improve model performance. For the os class, the CH and EV sets almost perform identically. However, the EV set constantly achieves slightly higher precision values when sensitivity is above 0.5. The SV set performs substantially worse than the other predictor sets for sensitivity values between 0 and 0.7. Considering the bas representation for the outcome variable cb, CH and EV virtually show an identical performance (Figure 1.3). The SV predictor set obtains lower precision values for sensitivities between 0.3 to 0.9. Concerning the sb outcome variable, SV achieves higher precision than the other sets for low sensitivity values. Towards higher sensitivity, the CH set can maintain the highest precision while only slight differences exist between the SV and EV sets. Concerning the ns outcome variable, the pattern is similar to the one found for adm districts, indicating that the prediction of the ns outcome class does not benefit from more complex predictor sets. For os conflicts, similar to the adm representation, the EV set outperforms the other two sets almost over the complete value range. However, the difference to the CH set is only marginal, while SV achieves the lowest precision for increasing sensitivity values.

pr_data %>%
filter(unit == "bas") %>%
mutate(group = paste(type,var,rep,sep="-")) %>%
select(-unit) %>%
arrange(type,var,rep) %>%
ggplot(aes(x=recall, y=precision, linetype=type, color=type)) +
geom_smooth(aes(), alpha=1, size=.5, se=FALSE) +
geom_line(aes(group=group), size=.1, alpha = .5) +
scale_color_manual(values = RColorBrewer::brewer.pal(n=4, name  = "Dark2")[2:4]) +
facet_wrap(~var, ncol = 2) +
theme_classic() +
labs(color = "Model", linetype = "Model", y = "Precision", x = "Sensitivity", subtitle = "bas") +
guides(color=guide_legend(ncol=3, alpha=1)) +
my_theme +
theme(
plot.subtitle =  element_markdown(size = 10, face =  "italic")
) 

To summarize these findings, the EV set outperforms other model configurations for adm units for cb, sb, and os conflict classes. For the ns class, irrespective of the aggregation units, the CH set outperforms the other model configurations. Concerning the bas units, the EV set shows slightly better performance for the os class, but similar or worse performance to the CH set for all other classes. These findings do not support previous findings of a systematically increased performance of conflict prediction when environmental variables are included. The difference to models based only on conflict history is marginal, and the most simple model frequently outperforms complex variable configurations. However, the specific conflict class seems to play a role in how structural and environmental variables change conflict prediction performance. Specifically, the EV seems to improve the performance on the os outcome class as well as on cb and sb outcome classes for adm districts. It should be noted that the smoothed PR-curves do not represent the performance of a specific model, but the general tendency of a predictor set’s performance. The variance between the 10 repeats of training, however, can be quite high as indicated with previous results.

Despite their capability to easily compare different model configurations, PR-curves are not reported very often in related studies. Halkia et al. (2020) report distinct PR-curves for their 10-fold cross-validation strategy. Their results indicate that their model is more capable of achieving relatively high precision scores for increasing sensitivity which is in agreement with previous findings on the global performance. It underlines that the presented DL methods show weaker performance in terms of precision compared to related studies. The desirable feature of a model to maintain precision while increasing sensitivity is not observed at an equal level. Partly, this can be explained by the aggregation to country-months and differences in the definition of the conflict classes. However, comparing to existing conflict prediction tools reveals that it is possible to achieve a better balance between precision and sensitivity. Based on the selected predictor variables in this study, the models ability to reduce the False Positive Rate is limited compared to other studies. The deliberate exclusion of predictor variables delivering more context on the political and cultural compilation of a district could be a reason for the evident reduced performance in terms of precision of the proposed method. The results also illustrate the complexities in the field of conflict prediction. The focus of a machine learning procedure on one metric can adversely affect other performance metrics. It is essential to carefully select a metric for optimization, which represents the overall goal of the prediction task most closely. Because the costs associated with missing out on a conflict district-month are expected to be higher as the costs of wrongly predicting a conflict for a given district-month, the focus on the $$F_2$$-score mirrors this assumption. Optimizing towards it means putting more value on sensitivity which explains the reduced performance in terms of precision from a methodological point of view.

## 1.3 Temporal Performance

#files <- list.files("../output/acc/", pattern = ".rds", full.names = T)
files <- list.files("../output/test-results/", pattern = "test-results", full.names = T)
# files <- files[-grep("test-results-regression-basins", files)]
acc_data <- lapply(files, function(x){
out <- vars_detect(x)
tmp$type = out$type

tmp %>%
filter(month != 0)
})
acc_data = do.call(rbind, acc_data)

acc_data %<>%
mutate(var = factor(var, labels = c("all", "sb", "ns", "os"),
levels = c("all", "sb", "ns", "os")),
type = factor(type,
labels = c("LR", "CH", "SV", "EV"),
levels = c("regression", "baseline", "structural", "environmental")),
unit = if_else(unit == "basins", "bas", "adm"),

labs.var = c("__cb__", "__sb__", "__ns__", "__os__")
names(labs.var) = c("all", "sb", "ns", "os")

The occurrence of conflict is predicted 12 months into the future using a single threshold. Indications for the performance of different model configurations to distinguish between conflict and peace district-months irrespective of the time-dimension are given as density plots in the Appendix (Figures ?? to ??). Despite using a single cut-off value, the performance of the evaluated model is different along the months in the prediction horizon. Figure 1.5 depicts the obtained $$F_2$$-scores for both aggregation units considering each month in the prediction horizon individually. In general, the performance is relatively stable for most model configurations. Variations over time remain mostly subtle, but changes in both directions towards lower and higher scores are observed for the DL frameworks. The performance of the LR baseline deteriorates with an increasing prediction horizon for almost all conflict classes and is substantially lower compared to the DL models. For the outcome variable cb, a better performance of the SV and EV sets compared to the CH theme is observed for adm districts, though they almost perform identically. For bas districts, EV slightly outperforms all other themes over the entire horizon, but SV shows lower performance than CH. Better performances of the more complex predictor sets compared to the CH theme become more evident for the sb conflict class.

acc_data %>%
filter(name == "f2") %>%
dplyr::select(name, score, month, var, unit, type) %>%
ggplot(aes(x=month, y=score, color=type, linetype=type))+
stat_smooth(geom='line', alpha=.6, size=.5, se=FALSE, lwd=1)+
facet_grid(unit~var, labeller = labeller(unit = labs.unit, var = labs.var)) +
scale_x_continuous(breaks=c(1:12)) +
ylim(0,1) +
theme_classic() +
scale_color_brewer(palette="Dark2") +
labs(subtitle = "F2-score", x = "Prediction window (in months)", y = "Score", color = "Model", linetype = "Model") +
guides(color=guide_legend(ncol=4)) +
my_theme -> plt_out
plot_output(plt_out)

Figure 1.5: Time dependent performance of the F2-score for adm (top) and bas (bottom) districts. Single lines are obtained by smoothing over the individual scores obtained for 10 repeats.

For both aggregation units, the difference to the CH theme is more substantial over the complete prediction horizon. The pattern of the SV and EV theme over the prediction horizon is very similar and they perform equally well. Differences between adm and bas districts become evident for the ns class. Here the bas models almost show a monotonic increase in prediction performance over the horizon. The EV set overall achieves higher scores, especially after five months into the horizon. For adm districts, the CH theme outperforms the other predictor sets over the first half and towards the end of the horizon. Only for eight and nine months into the future, SV and EV achieve higher $$F_2$$-scores and then experience a substantial decrease in performance. For the os outcome class, the DL models perform equally well for the adm representation with a slightly better performance of the EV set. For bas districts, the increased performance of the EV is more pronounced, but towards the end of the horizon the difference compared to the SV set is reduced.

acc_data %>%
filter(name == "aucpr") %>%
dplyr::select(name, score, month, var, unit, type) %>%
ggplot(aes(x=month, y=score, color=type, linetype=type))+
stat_smooth(geom='line', alpha=.6, size=.5, se=FALSE, lwd=1)+
facet_grid(unit~var, labeller = labeller(unit = labs.unit, var = labs.var)) +
scale_x_continuous(breaks=c(1:12)) +
ylim(0,1) +
theme_classic() +
scale_color_brewer(palette="Dark2") +
labs(subtitle = "F2-score", x = "Prediction window (in months)", y = "Score", color = "Model", linetype = "Model") +
guides(color=guide_legend(ncol=4)) +
my_theme -> plt_out
plot_output(plt_out)

Figure 1.6: Time dependent performance of the AUPR metric. Single lines are obtained by smoothing over the individual scores obtained for 10 repeats.

In general, for the adm representation, the differences between the SV and EV sets are only marginal for all conflict class except ns, suggesting that environmental variables do not further increase the prediction performance. For the ns class, CH clearly shows some advantages over the other predictor sets for the most part of the prediction horizon, which is in agreement with previous findings. Considering the bas representation, EV outperforms other predictor sets except of the sb outcome class where the difference to the SV set is marginal. In contrast to adm districts, the performance compared to the CH theme is substantial, except for outcome class cb, suggesting that the selected predictor variables have a greater effect on prediction performance when they are aggregated on the bas districts.

acc_data %>%
filter(name == "sensitivity") %>%
dplyr::select(name, score, month, var, unit, type) %>%
ggplot(aes(x=month, y=score, color=type, linetype=type))+
stat_smooth(geom='line', alpha=.6, size=.5, se=FALSE, lwd=1)+
facet_grid(unit~var, labeller = labeller(unit = labs.unit, var = labs.var)) +
scale_x_continuous(breaks=c(1:12)) +
ylim(0,1) +
theme_classic() +
scale_color_brewer(palette="Dark2") +
labs(subtitle = "F2-score", x = "Prediction window (in months)", y = "Score", color = "Model", linetype = "Model") +
guides(color=guide_legend(ncol=4)) +
my_theme -> plt_out
plot_output(plt_out)

Figure 1.7: Time dependent performance of sensitivity. Single lines are obtained by smoothing over the individual scores obtained for 10 repeats.

Concerning the temporal performance measured by the AUPR, the general pattern is comparable to the one found for the $$F_2$$-score (Figure 1.5). We observe almost identical patterns for adm districts and outcome variables cb and sb. For outcome variables ns and os the higher performance of the CH is even more pronounced. Concerning the bas districts, the pattern is reversed and the differences of the SV and EV sets to the CH set is more strongly evident. For outcome variables cb and os, the EV set shows slightly higher performances than the SV set for the complete prediction horizon. The monotonic increase in performance of the ns outcome variable is not as clear, reaching a plateau after five months for the CH set and after eight months for the SV and EV sets.

Directly comparing model performances in terms of precision and sensitivity delivers an imperfect picture of the comparative performances between different model configurations. Rather than the absolute precision or sensitivity values, the balance between these metrics as expressed in the $$F_2$$-score or the AUPR better captures information on their mutual dependency. An alternative perspective to compare model performance in terms of precision and sensitivity is to analyze the “cost” a gain in one metric is associated with a loss in the other. An insightful distinction between adm and bas districts is found in this regard (Figure 1.7 & Figure 1.8). While the CH set initially delivers comparatively high values of sensitivity for adm districts, it is associated with higher precision values for bas districts. Increasing the complexity of the predictor variables with the SV set leads to a substantial decrease in sensitivity for adm districts but simultaneously increases the precision. For bas districts, moving to the SV set has the opposite effect of increasing the sensitivity but decreasing precision. Finally, the EV set achieves sensitivity and precision rates in between the CH and SV sets. These findings are in agreement with the results of the global performances. The EV theme is better capable of mediating between precision and recall for both aggregation districts. A distinction between the aggregation units is the direction from which this balance is achieved. Solely based on the CH set, adm districts seem beneficial to reach high sensitivity values. However, the predictions are associated with a high rate of false alarms because the precision is low. Using bas districts, CH delivers relatively high precision values, but the rate observed conflicts are detected remains low. This contrasting behavior could be harnessed to build early-warning systems with a specialized focus on precision or sensitivity. Some applications might emphasize the importance of detecting all future conflict district-months over the precision to only select relevant cases. The results suggest that such a case would benefit from a focus on adm districts. For cases where high precision is more relevant than selecting all future conflicts, a model based on bas districts could prove a valuable choice.

Concerning related studies, only Hegre et al. (2019) report time-dependent performance of their models. The prediction window differs from the one presented here, as they predict the occurrence of conflict for up to 36 months into the future. Additionally, they restrict their time-dependent performance analysis to the AUC and AUPR metrics. For the first 12 months, similar to the results presented here, the AUPR metric shows great stability for the country-month representation but with values above 0.90. For the grid representation, there is less stability and values as low as 0.35 are observed. Consistent with previous findings, the country-month data set can achieve substantially higher scores than the presented DL models. However, because the prediction window in this study has been set to 12 months only, the long-term performance can not be compared.

acc_data %>%
filter(name == "precision") %>%
dplyr::select(name, score, month, var, unit, type) %>%
ggplot(aes(x=month, y=score, color=type, linetype=type))+
stat_smooth(geom='line', alpha=.6, size=.5, se=FALSE, lwd=1)+
facet_grid(unit~var, labeller = labeller(unit = labs.unit, var = labs.var)) +
scale_x_continuous(breaks=c(1:12)) +
ylim(0,1) +
theme_classic() +
scale_color_brewer(palette="Dark2") +
labs(subtitle = "F2-score", x = "Prediction window (in months)", y = "Score", color = "Model", linetype = "Model") +
guides(color=guide_legend(ncol=4)) +
my_theme -> plt_out
plot_output(plt_out)