Child exposure to diseases caused by suboptimal breastfeeding

Table of Contents

Theory

We are interested in the number of newborns exposed to suboptimal breastfeeding. Unfortunately, GBD does not disclose any exposure data. Instead, they offer cause prevalence data (individuals affected by a disease measured in number, YLD, YLL or DALY) and risk-specific cause data (individuals affected by a disease caused by a risk-factor measured in YLD, YLL or DALY). We will attempt to approximate exposure (measured in number of individuals exposed) from these data.

First, we first calculate risk-specific prevalences, i.e. the number of disease cases that can be attributed to a specific risk factor. Because GBD offers different measures of prevalence, we have three options:

$ P_{c,r} = P_{c} * \dfrac{YLD_{c,r}}{YLD_{c}} $

$ P_{c,r} = P_{c} * \dfrac{YLL_{c,r}}{YLL_{c}} $

$ P_{c,r} = P_{c} * \dfrac{DALY_{c,r}}{DALY_{c}} $

where $P_{c}$ is the prevalence of disease c (in GBD language: cause c), $YLD_c$ is the number of YLD caused by cause c, and $YLD_{c,r}$ is the number of YLD caused by cause c attributable to risk factor r. The notations for YLL and DALY are analogous to YLD.

Data download

Cause data

Cause-Prevalence data ($P_c$, $YLL_c$, $YLD_c$, $DALY_c$), are available from the GBD website. We will download the data automatically using the Python library ddf_tools. The GBD codes that comprise the query can be looked up here.

Risk-specific data

Risk-specific data ($YLD_{r,c}$, $YLL_{r,c}$, $DALY_{r,c}$) are available from the same API. Unforunately, the ddf_utils library contains a bug that we need to fix before we can download the data.

Load and clean data

Cause data

Let's drop the id columns for a better overview (except the location ids, which we need to differentiate between locations of the same name). Also, let's rename the other columns to something nicer.

Next, let's melt the 'val' / 'lower' / 'upper' columns into an 'estimate' and a 'value' column.

Lastly, let's separate the prevalence numbers ($P_c$) from the other measures ($YLL_c$, $YLD_c$, $DALY_c$).

Risk data

We repeat the same procedure for the risk data.

Calculating risk-specific prevalence

Before we can apply the formulae from the introduction, we need to match the location- and age group-specific cause- and risk-data. First, we match $YLL_c$, $YLD_c$, $DALY_c$ with $YLL_{c,r}$, $YLD_{c,r}$, $DALY_{c,r}$:

Next, we match the prevalence numbers $P_c$:

Now we can apply the formulae. Note that the differentiation between different measures (YLD, YLL, DALY) is implicit in the the table rows so we do not need to account for it separately.

YLD vs. YLL vs. DALY

Let's have a look at the result. We filter for the 'Global' prevalence of 'Diarrheal diseases' within the 'Post Neonatal' age group attributed to the risk factor 'Discontinued breastfeeding'. To keep it simple, we inspect 'val' estimates only.

Corresponding to the three formulae, we get one estimate for each measure YLD, YLL and DALY. While YLL and DALY yield a similar value, using YLD yields a +70% higher estimate. It is not know to the authors why the three measures yield different estimates. However, we can accept the differences if their variance is smaller than the variance already contained in the dataset (e.g. in the form of the estimates val, lower, upper). To confirm this, let's compare the distribution of risk-specific prevalences (our calculation result) to that of the unspecific cause prevalences (downloaded data). Note that both distributions are log-normal, so we take the log10 before plotting and calculating their variance:

As expected, total cause numbers are on average higher than risk-specific cause numbers. Apart from this, the variance in the risk-specific data is lower than in the cause data. This means our calculation did not introduce additional uncertainty (at least on average).

We can use our approach to estimate risk-specific cause prevalences without introducing additional uncertainty. But which of the three measures (YLD, YLL, DALY) should we use? Let's compare the three approaches:

All three measures yield the same distribution of risk-specific prevalences. This means, on average, they yield the same result. Nevertheless, values can vary significantly for individual locations, age groups or risk factors, as we saw before. To understand these individual differences better, let's look at the distribution of the relative deviations ($\dfrac{P_{c,r,YLD}-P_{c,r,YLL}}{P_{c,r,YLD}}$, $\dfrac{P_{c,r,YLD}-P_{c,r,DALY}}{P_{c,r,DALY}}$ and $\dfrac{P_{c,r,YLY}-P_{c,r,DALY}}{P_{c,r,YLL}}$) over all data points:

The histogram and the summary statistics show that YLL and DALY yield very similar estimates. The average deviation between both approaches is 0.4%. Half of all data points show a deviation < +-1%. The maximum deviation is presented by a YLL-estimate (we will use this as a short-hand for "risk-specific prevalence estimate that is calculated using YLL as shown in above formula") that is 3 times higher than the corresponding DALY estimate.

Because YLL and DALY yield similar results, it is sufficient to compare YLD to YLL estimates. Here, deviations are larger and YLD yields 2% higher estimates on average. Half of the YLL values lie within an +-7% error band around the YLD estimates. The maximum deviation is presented by a YLD estimate that is more than 8 times larger than the corresponding YLL estimate. This result is consistent with our previous observation that the log-mean of the YLD data is higher than that of the YLL data.

In conclusion, using YLD to estimate risk-specific prevalence yields slightly higher estimates than using YLL or DALY. However, the uncertainty (mean variance) introduced by this deviation is still smaller than the uncertainty contained in the GBD data (in the form of val/lower/upper estimates).

Calculating exposure

Our ultimate goal is to calculate the number of infants exposed to diseases caused by suboptimal breastfeeding - i.e. exposure data. What we have so far are risk-specific prevalences (RSPs) - i.e. how many disease cases can be attributed to a certain risk factor. RSP data are different from exposure data in two ways:

1) RSP data do not include infants that are exposed to risk factors without getting sick. This results in an underestimation of exposure.

2) RSP data do not account for co-morbidity: A child exposed to non-exclusive breastfeeding may suffer from diarrhea and respiratory infection at the same time. Using the presented approach, such cases will show up as two prevalence counts and will result in an overestimation of exposure.

We can see that both effects work in opposite directions. Without further data it is not possible to determine, which effect dominates. For now, we will assume that the second effect is slightly stronger. We we account for this by using YLL data instead of YLD data, as it yields slightly lower estimates for risk-specific prevalence.

$ E_{r} = \sum_c P_{c,r} = \sum_c P_{c} * \dfrac{YLL_{c,r}}{YLL_{c}} $

For each of the three measures in our formula ($P_c$, $YLL_c$, $YLL_{c,r}$) we have three estimates in the GBD data (lower, upper and val). Consequently, we have 27 exposure estimates for each (location, age group, risk factor) tuple. To keep things simple, let's transform these 27 values into three: lower, upper, mean.

At the end of our calculation, we have 2700 unique (location, age group, risk factor) combinations and three exposure estimates (min, mean, max) for each of them.

(Note: All values are for both sexes and for the year 2011. The two corresponding columns contain redundant data, which we kept for clarity.)