[R-bloggers] An introduction to Causal inference (and 2 more aRticles) |
- An introduction to Causal inference
- How You Measure Months Matters — A Lot. A Look At Two Implementations of KDA
- GRNN with Small Samples
An introduction to Causal inference Posted: 30 Nov 2019 04:00 AM PST [This article was first published on Fabian Dablander, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Causal inference goes beyond prediction by modeling the outcome of interventions and formalizing counterfactual reasoning. In this blog post, I provide an introduction to the graphical approach to causal inference in the tradition of Sewell Wright, Judea Pearl, and others. We first rehash the common adage that correlation is not causation. We then move on to climb what Pearl calls the "ladder of causal inference", from association (seeing) to intervention (doing) to counterfactuals (imagining). We will discover how directed acyclic graphs describe conditional (in)dependencies; how the do-calculus describes interventions; and how Structural Causal Models allow us to imagine what could have been. This blog post is by no means exhaustive, but should give you a first appreciation of the concepts that surround causal inference; references to further readings are provided below. Let's dive in!1 Correlation and CausationMesserli (2012) published a paper entitled "Chocolate Consumption, Cognitive Function, and Nobel Laureates" in The New England Journal of Medicine showing a strong positive relationship between chocolate consumption and the number of Nobel Laureates. I have found an even stronger relationship using updated data2, as visualized in the figure below.
Now, except for people in the chocolate business, it would be quite a stretch to suggest that increasing chocolate consumption would increase the number Nobel Laureates. Correlation does not imply causation because it does not constrain the possible causal relations enough. Hans Reichenbach (1956) formulated the common cause principle which speaks to this fact:
An in principle straightforward way to break this uncertainty is to conduct an experiment: we could, for example, force the citizens of Austria to consume more chocolate, and study whether this increases the number of Nobel laureates in the following years. Such experiments are clearly unfeasible, but even in less extreme settings it is frequently unethical, impractical, or impossible — think of smoking and lung cancer — to study an association experimentally. Causal inference provides us with tools that license causal statements even in the absence of a true experiment. This comes with strong assumptions. In the next section, we discuss the "causal hierarchy". The Causal HierarchyPearl (2019a) introduces a causal hierarchy with three levels — association, intervention, and counterfactuals — as well as three prototypical actions corresponding to each level — seeing, doing, and imagining. In the remainder of this blog post, we will tackle each level in turn. SeeingAssociation is on the most basic level; it makes us see that two or more things are somehow related. Importantly, we need to distinguish between marginal associations and conditional associations. The latter are the key building block of causal inference. The figure below illustrates these two concepts. If we look at the whole, aggregated data on the left we see that the continuous variables $X$ and $Y$ are positively correlated: an increase in values for $X$ co-occurs with an increase in values for $Y$. This relation describes the marginal association of $X$ and $Y$ because we do not care whether $Z = 0$ or $Z = 1$. On the other hand, if we condition on the binary variable $Z$, we find that there is no relation: $X \perp Y \mid Z$. (For more on marginal and conditional associations in the case of Gaussian distributions, see this blog post). In the next section, we discuss a powerful tool that allows us to visualize such dependencies. Directed Acyclic GraphsWe can visualize the statistical dependencies between the three variables using a graph. A graph is a mathematical object that consists of nodes and edges. In the case of Directed Acyclic Graphs (DAGs), these edges are directed. We take our variables $(X, Y, Z$) to be nodes in such DAG and we draw (or omit) edges between these nodes so that the conditional (in)dependence structure in the data is reflected in the graph. We will explain this more formally shortly. For now, let's focus on the relationship between the three variables. We have seen that $X$ and $Y$ are marginally dependent but conditionally independent given $Z$. It turns out that we can draw three DAGs that encode this fact; these are the first three DAGs in the figure below. $X$ and $Y$ are dependent through $Z$ in these graphs, and conditioning on $Z$ blocks the path between $X$ and $Y$. We state this more formally shortly. While it is natural to interpret the arrows causally, we do not do so here. For now, the arrows are merely tools that help us describe associations between variables. The figure above also shows a fourth DAG, which encodes a different set of conditional (in)dependence relations between $X$, $Y$, and $Z$. The figure below illustrates this: looking at the aggregated data we do not find a relation between $X$ and $Y$ — they are marginally independent — but we do find one when looking at the disaggregated data — $X$ and $Y$ are conditionally dependent given $Z$. A real-world example might help build intuition: Looking at people who are single and who are in a relationship as a separate group, being attractive ($X$) and being intelligent ($Y$) are two independent traits. This is what we see in the left panel in the figure above. Let's make the reasonable assumption that both being attractive and being intelligent are positively related with being in a relationship. What does this imply? First, it implies that, on average, single people are less attractive and less intelligent (see red data points). Second, and perhaps counter-intuitively, it implies that in the population of single people (and people in a relationship, respectively), being attractive and being intelligent are negatively correlated. After all, if the handsome person you met at the bar were also intelligent, then he would most likely be in a relationship! In this example, visualized in the fourth DAG, $Z$ is commonly called a collider. Suppose we want to estimate the association between $X$ and $Y$ in the whole population. Conditioning on a collider (for example, by only analyzing data from people who are not in a relationship) while computing the association between $X$ and $Y$ will lead to a different estimate, and the induced bias is known as collider bias. It is a serious issue not only in dating, but also for example in medicine. The simple graphs shown above are the building blocks of more complicated graphs. In the next section, we describe a tool that can help us find (conditional) independencies between sets of variables.3 $d$-separationFor large graphs, it is not obvious how to conclude that two nodes are (conditionally) independent. d-separation is a tool that allows us to check this algorithmically. We need to define some concepts:
With these definitions out of the way, we call two nodes $X$ and $Y$ $d$-separated by $\mathcal{L}$ if conditioning on all members in $\mathcal{L}$ blocks all paths between the two nodes. If this is your first encounter with $d$-separation, then this is a lot to wrap your head around. To get some practice, look at the graph on the left side. First, note that there are no marginal dependencies; this means that without conditioning or blocking nodes, any two nodes are connected by a path. For example, there is a path going from $X$ to $Y$ through $Z$, and there is a path from $V$ to $U$ going through $Y$ and $W$. However, there are a number of conditional independencies. For example, $X$ and $Y$ are conditionally independent given $Z$. Why? There are two paths from $X$ to $Y$: one through $Z$ and one through $W$. However, since $W$ is a collider on the path from $X$ to $Y$, the path is already blocked. The only unblocked path from $X$ to $Y$ is through $Z$, and conditioning on it therefore blocks all remaining open paths. Additionally conditioning on $W$ would unblock one path, and $X$ and $Y$ would again be associated. So far, we have implicitly assumed that conditional (in)dependencies in the graph correspond to conditional (in)dependencies between variables. We make this assumption explicit now. In particular, note that d-separation provides us with an independence model $\perp_{\mathcal{G}}$ defined on graphs. To connect this to our standard probabilistic independence model $\perp_{\mathcal{P}}$ defined on random variables, we assume the following Markov property: In words, we assume that if the nodes $X$ and $Y$ are d-separated by $Z$ in the graph $\mathcal{G}$, the corresponding random variables $X$ and $Y$ are conditionally independent given $Z$. This implies that all conditional independencies in the data are represented in the graph.4 Moreover, the statement above implies (and is implied by) the following factorization: where $\text{pa}^{\mathcal{G}}(X_i)$ denotes the parents of the node $X_i$ in graph $\mathcal{G}$ (see Peters, Janzing, & Schölkopf, p. 101). A node is a parent of another node if it has an outgoing arrow to that node; for example, $X$ is a parent of $Z$ and $W$ in the graph above. The above factorization implies that a node $X$ is independent of its non-descendants given its parents. d-separation is an extremely powerful tool. Until now, however, we have only looked at DAGs to visualize (conditional) independencies. In the next section, we go beyond seeing to doing. DoingWe do not merely want to see the world, but also change it. From this section on, we are willing to interpret DAGs causally. As Dawid (2009a) warns, this is a serious step. In merely describing conditional independencies — seeing — the arrows in the DAG played a somewhat minor role, being nothing but "incidental construction features supporting the $d$-separation semantics" (Dawid, 2009a, p. 66). In this section, we endow the DAG with a causal meaning and interpret the arrows as denoting direct causal effects. What is a causal effect? Following Pearl and others, we take an interventionist position and say that a variable $X$ has a causal influence on $Y$ if changing $X$ leads to changes in $Y$. This position is a very useful one in practice, but not everybody agrees with it (e.g., Cartwright, 2007). The figure below shows the observational DAGs from above (top row) as well as the manipulated DAGs (bottom row) where we have intervened on the variable $X$, that is, set the value of the random variable $X$ to a constant $x$. Note that setting the value of $X$ cuts all incoming causal arrows since its value is thereby determined only by the intervention, not by any other factors. As is easily verified with $d$-separation, the first three graphs in the top row encode the same conditional independence structure. This implies that we cannot distinguish them using only observational data. Interpreting the edges causally, however, we see that the DAGs have a starkly different interpretation. The bottom row makes this apparent by showing the result of an intervention on $X$. In the leftmost causal DAG, $Z$ is on the causal path from $X$ to $Y$, and intervening on $X$ therefore influences $Y$ through $Z$. In the DAG next, to it $Z$ is on the causal path from $Y$ to $X$, and so intervening on $X$ does not influence $Y$. In the third DAG, $Z$ is a common cause and — since there is no other path from $X$ to $Y$ — intervening on $X$ does not influence $Y$. For the collider structure in the rightmost DAG, intervening on $X$ does not influence $Y$ because there is no unblocked path from $X$ to $Y$. To make the distinction between seeing and doing, Pearl introduced the do-operator. While $p(Y \mid X = x)$ denotes the observational distribution, which corresponds to the process of seeing, $p(Y \mid do(X = x))$ corresponds to the interventional distribution, which corresponds to the process of doing. The former describes what values $Y$ would likely take on when $X$ happened to be $x$, while the latter describes what values $Y$ would likely take on when $X$ would be set to $x$. Computing causal effects$P(Y \mid do(X = x))$ describes the causal effect of $X$ on $Y$, but how do we compute it? Actually doing the intervention might be unfeasible or unethical — side-stepping actual interventions and still getting at causal effects is the whole point of this approach to causal inference. We want to learn causal effects from observational data, and so all we have is the observational DAG. The causal quantity, however, is defined on the manipulated DAG. We need to build a bridge between the observational DAG and the manipulated DAG, and we do this by making two assumptions. First, we assume that interventions are local. This means that if I set $X = x$, then this only influences the variable $X$, with no other direct influence on any other variable. Of course, intervening on $X$ will influence other variables, but only through $X$, not directly through us intervening. In colloquial terms, we do not have a "fat hand", but act like a surgeon precisely targeting only a very specific part of the DAG; we say that the DAG is composed of modular parts. We can encode this using the factorization property above: which we now interpret causally. The factors in the product are sometimes called causal Markov kernels; they constitute the modular parts of the system. Second, we assume that the mechanism by which variables interact do not change through interventions; that is, the mechanism by which a cause brings about its effects does not change whether this occurs naturally or by intervention (see e.g., Pearl, Glymour, & Jewell, p. 56). With these two assumptions in hand, further note that $p(Y \mid do(X = x))$ can be understood as the observational distribution in the manipulated DAG — $p_m(Y \mid X = x)$ — that is, the DAG where we set $X = x$. This is because after doing the intervention (which catapults us into the manipulated DAG), all that is left for us to do is to see its effect. Observe that the leftmost and rightmost DAG above remain the same under intervention on $X$, and so the interventional distribution $p(Y \mid do(X = x))$ is just the conditional distribution $p(Y \mid X = x)$. The middle DAGs require a bit more work: The first equality follows by definition. The second and third equality follow from the sum and product rule of probability. The last line follows from the assumption that the mechanism through which $X$ influences $Y$ is independent of whether we set $X$ or whether $X$ naturally occurs, that is, $p_{m}(Y = y \mid X = x, Z = z) = p(Y = y \mid X = x, Z = z)$, and the assumption that interventions are local, that is, $p_m(Z = z) = p(Z = z)$. Thus, the interventional distribution we care about is equal to the conditional distribution of $Y$ given $X$ when we adjust for $Z$. Graphically speaking, this blocks the path $X \leftarrow Z \leftarrow Y$ in the left middle graph and the path $X \leftarrow Z \rightarrow Y$ in the right middle graph. If there were a path $X \rightarrow Y$ in these two latter graphs, and if we would not adjust for $Z$, then the causal effect of $X$ on $Y$ would be confounded. For these simple DAGs, however, it is already clear from the fact that $X$ is independent of $Y$ given $Z$ that $X$ cannot have a causal effect on $Y$. In the next section, study a more complicated graph and look at confounding more closely. ConfoundingConfounding has been given various definitions over the decades, but usually denotes the situation where a (possibly unobserved) common cause obscures the causal relationship between two or more variables. Here, we are slightly more precise and call a causal effect of $X$ on $Y$ confounded if $p(Y \mid X = x) \neq p(Y \mid do(X = x))$, which also implies that collider bias is a type of confounding. This occured in the middle two DAGs in the example above, as well as in the chocolate consumption and Nobel Laureates example at the beginning of the blog post. Confounding is the bane of observational data analysis. Helpfully, causal DAGs provide us with a tool to describe multivariate relations between variables. Once we have stated our assumptions clearly, the do-calculus further provides us with a means to know what variables we need to adjust for so that causal effects are unconfounded. We follow Pearl, Glymour, & Jewell (2016, p. 61) and define the backdoor criterion:
The key observation is that this (a) blocks all spurious, that is, non-causal paths between $X$ and $Y$, (b) leaves all directed paths from $X$ to $Y$ unblocked, and (c) creates no spurious paths. To see this action, let's again look at the DAG on the left. The causal effect of $Z$ on $U$ is confounded by $X$, because in addition to the legitimate causal path $Z \rightarrow Y \rightarrow W \rightarrow U$, there is also an unblocked path $Z \leftarrow X \rightarrow W \rightarrow U$ which confounds the causal effect. The backdoor criterion would have us condition on $X$, which blocks the spurious path and renders the causal effect of $Z$ on $U$ unconfounded. Note that conditioning on $W$ would also block this spurious path; however, it would also block the causal path $Z \rightarrow Y \rightarrow W \rightarrow U$. Before moving on, let's catch a quick breath. We have already discussed a number of very important concepts. At the lowest level of the causal hierarchy — association — we have discovered DAGs and $d$-separation as a powerful tool to reason about conditional (in)dependencies between variables. Moving to intervention, the second level of the causal hierarchy, we have satisfied our need to interpret the arrows in a DAG causally. Doing so required strong assumptions, but it allowed us to go beyond seeing and model the outcome of interventions. This hopefully clarified the notion of confounding. In particular, collider bias is a type of confounding, which has important practical implications: we should not blindly enter all variables into a regression in order to "control" for them, but think carefully about what the underlying causal DAG could look like. Otherwise, we might induce spurious associations. The concepts from causal inference can help us understand methodological phenomena that have been discussed for decades. In the next section, we apply the concepts we have seen so far to make sense of one such phenomenon: Simpson's Paradox. Example Application: Simpson's ParadoxThis section follows the example given in Pearl, Glymour, & Jewell (2016, Ch. 1) with slight modifications. Suppose you observe $N = 700$ patients who either choose to take a drug or not; note that this is not a randomized control trial. The table below shows the number of recovered patients split across sex. We observe that more men as well as more women recover when taking the drug (93% and 73%) compared to when not taking the drug (87% and 69%). And yet, when taken together, fewer patients who took the drug recovered (78%) compared to patients who did not take the drug (83%). This is puzzling — should a doctor prescribe the drug or not? To answer this question, we need to compute the causal effect that taking the drug has on the probability of recovery. As a first step, we draw the causal DAG. Suppose we know that women are more likely to take the drug, that being a woman has an effect on recovery more generally, and that the drug has an effect on recovery. Moreover, we know that the treatment cannot cause sex. This is a trivial yet crucial observation — it is impossible to express this in purely statistical language. Causal DAGs provide us with a tool to make such an assumption explicit; the graph below makes explicit that sex ($S$) is a common cause of both drug taking ($D$) and recovery ($R$). We denote $S = 1$ as being female, $D = 1$ as having chosen the drug, and $R = 1$ as having recovered. The left DAG is observational, while the right DAG indicates the intervention $do(D = d)$, where $d$ denotes taking the drug ($d = 0$) or not taking the drug ($d = 1$). We are interested in the probability of recovery if we would force everybody to take, or not take, the drug; we call the difference between these two probabilities the average causal effect. This is key: the do-operator is about populations, not individuals. Using it, we cannot make statements that pertain to the recovery of an individual patient; we can only refer to the probability of recovery as defined on populations of patients. We will discuss individual causal effects in the section on counterfactuals at the end of the blog post. Computing the average causal effect requires knowledge about the interventional distributions $p(R \mid do(D = 0))$ and $p(R \mid do(D = 1))$. As discussed above, these correspond to the conditional distribution in the manipulated DAG which is shown above on the right. The backdoor criterion tells us that the conditional distribution in the observational DAG will correspond to the interventional distribution when blocking the spurious path $D \leftarrow S \rightarrow R$. Using the adjustment formula we have derived above, we expand: In words, we first compute the benefit of taking the drug separately for men and women, and then we average the result by weighting it with the fraction of men and women in the population. This tells us that, if we force everybody to take the drug, about $82\%$ of people will recover. We can similarly compute the probability of recovery given we force all patients to not choose the drug: Therefore, taking the drug does indeed have a positive effect on recovery on average, and the doctor should prescribe the drug. Note that this conclusion heavily depended on the causal graph. While graphs are wonderful tools in that they make our assumptions explicit, these assumptions are — of course — not at all guaranteed to be correct. These assumptions are strong, stating that the graph must encode all causal relations between variables, and that there is no unmeasured confounding, something we can never guarantee in observational data. Let's look at a different example but with the exact same data. In particular, instead of the variable sex we look at the post-treatment variable blood pressure. This means we have measured blood pressure after the patients have taken the drug. Should a doctor prescribe the drug or not? Since blood pressure is a post-treatment variable, it cannot influence a patient's decision to take the drug or not. We draw the following causal DAG, which makes clear that the drug has an indirect effect on recovery through blood pressure, in addition to having a direct causal effect.5 From this DAG, we find that the causal effect of $D$ on $R$ is unconfounded. Therefore, the two causal quantities of interest are given by: This means that the drug is indeed harmful. In the general population (combined data), the drug has a negative effect. Suppose that the drug has a direct positive effect on recovery, but an indirect negative effect through blood pressure. If we only look at patients with a particular blood pressure, then only the drug's positive effect on recovery remains. However, since the drug does influence recovery negatively through blood pressure, it would be misleading to take the association between $D$ and $R$ conditional on $Z$ as our estimate for the causal effect. In contrast to the previous example, using the aggregate data is the correct way to analyze these data in order to estimate the average causal effect. So far, our treatment has been entirely model-agnostic. In the next section, we discuss Structural Causal Models (SCM) as the fundamental building block of causal inference. This will unify the previous two levels of the causal hierarchy — seeing and doing — as well as open up the third and final level: counterfactuals. Structural Causal ModelsIn this section, we discuss Structural Causal Models (SCM) as the fundamental building block of causal inference. SCMs relate causal and probabilistic statements. As an example, we specify: $X$ is a direct cause of $Y$ which it influences through the function $f()$, and the noise variables $\epsilon_X$ and $\epsilon_Y$ are assumed to be independent. In a SCM, we take each equation to be a causal statement, and we stress this by using the assignment symbol $:=$ instead of the equality sign $=$. Note that this is in stark contrast to standard regression models; here, we explicitly state our causal assumptions. As we will see below, Structural Causal Models imply observational distributions (seeing), interventional distributions (doing), as well as counterfactuals (imagining). Thus, they can be seen as the fundamental building block of this approach to causal inference. In the following, we restrict the class of Structural Causal Models by allowing only linear relationships between variables and assuming independent Gaussian error terms.6 As an example, take the following SCM (Peters, Janzing, & Schölkopf, 2017, p. 90): where $\epsilon_X, \epsilon_Y \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 1)$ and $\epsilon_Z \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 0.1)$. Again, each line explicates the causal link variables. For example, we assume that $X$ has a direct causal effect on $Y$, that this effect is linear, and that it is obscured by independent Gaussian noise. The assumption of Gaussian errors induces a multivariate Gaussian distribution on $(X, Y, Z)$ whose independence structure is visualized in the leftmost DAG below. The middle DAG shows an intervention on $Z$, while the rightmost DAG shows an intervention on $X$. Recall that, as discussed above, intervening on a variable cuts all incoming arrows. At the first level of the causal hierarchy — association — we might ask ourselves: does $X$ or $Z$ predict $Y$ better? To illustrate the answer for our example, we simulate $n = 1000$ observations from the Structural Causal model: The figure below shows that $Y$ has a much stronger association with $Z$ than with $X$; this is because the standard deviation of the error $\epsilon_X$ is only a tenth of the standard deviation of the error $\epsilon_Z$. For prediction, therefore, $Z$ is the more relevant variable. But does $Z$ actually have a causal effect on $Y$? This is a question about intervention, which is squarely located at the second level of the causal hierarchy. With the knowledge of the underlying Structural Causal Model, we can easily simulate interventions in R and visualize their outcomes: The leftmost histogram below shows the marginal distribution of $Y$ when no intervention takes place. The histogram in the middle shows the marginal distribution of $Y$ in the manipulated DAG where we set $Z = 2$. Observe that, as indicated by the causal graph, $Z$ does not have a causal effect on $Y$ such that $p(Y \mid do(Z = 2)) = p(Y)$. The histogram on the right shows the marginal distribution of $Y$ in the manipulated DAG where we set $X = 2$. Clearly, then, $X$ has a causal effect on $Y$. While we have touched on it already when discussing Simpson's paradox, we now formally define the Average Causal Effect: which in our case equals one, as can also be seen from the Structural Causal Model. Thus, SCMs allow us to model the outcome of interventions.7 However, note again that this is strictly about populations, not individuals. In the next section, we see how SCMs can allow us to climb up to the final level of the causal hierarchy, moving beyond the average to define individual causal effects. CounterfactualsIn the Unbearable Lightness of Being, Milan Kundera has Tomáš ask himself:
To which he answers:
Kundera is describing, as Holland (1986, p. 947) put it, the "fundamental problem of causal inference", namely that we only ever observe one realization. If Tomáš chooses to stay with Tereza, then he cannot not choose to stay with Tereza. He cannot go back in time and revert his decision, living instead "everything as it comes, without warning". This does not mean, however, that Tomáš cannot assess afterwards whether his choice has been wise. In contrast, humans constantly evaluate mutually exclusive options, only one of which ever comes true; that is, humans reason counterfactually. To do this formally requires strong assumptions. The do-operator, introduced above, is too weak to model counterfactuals. This is because it operates on distributions that are defined on populations, not on individuals. We can define an average causal effect using the do-operator, but — unsurprisingly — it only ever refers to averages. Structural Causal Models allow counterfactual reasoning on the level of the individual. To see this, we use a simple example. Suppose we want to study the causal effect of grandma's treatment for the common cold ($T$) on the speed of recovery ($R$). Usually, people recover from the common cold in seven to ten days, but grandma swears she can do better with a simple intervention — we agree on doing an experiment. Assume we have the following SCM: where $\mu$ is an intercept, $\epsilon_T \sim \text{Bern}(0.50)$ indicates random assignment to either receive the treatment ($T = 1$) or not receive it ($T = 0$), and $\epsilon \stackrel{\text{iid}}{\sim} \mathcal{N}(0, \sigma)$. The SCM tells us that the direct causal effect of the treatment on how quickly patients recover from the common cold is $\beta$. This causal effect is obscured by individual error terms for each patient $\epsilon = (\epsilon_1, \epsilon_2, \ldots, \epsilon_N)$, which is are aggregates term for all the things left unmodelled (see this blog post for some history). In particular, $\epsilon_k$ summarizes all the things that have an effect on the speed of recovery for patient $k$. Once we have collected the data, suppose we find that $\mu = 7$, $\beta = -2$, and $\sigma = 2$. This does speak for grandma's treatment, since it shortens the recovery time by 2 days on average: Given the value for $\epsilon_k$, the Structural Causal Model is fully determined, and we may write $R(\epsilon_k)$ for the speed of recovery for patient $k$. To make this example more concrete, we simulate some data in R: We see that the first patient did not receive the treatment ($T = 0$), took about $R = 5.80$ days to recover from the common cold, and has a unique value $\epsilon_1 = 0.78$. Would this particular patient have recovered more quickly if we had given him grandma's treatment even though we did not? We denote this quantity of interest as $R_{T = 1}(\epsilon_1)$ to contrast it with the actually observed $R_{T = 0}(\epsilon_1)$. To compute this seemingly otherworldly quantity, we simply plug the value $T = 1$ and $\epsilon_1 = 0.78$ into our Structural Causal Model, which yields:
Using this, we can define the individual causal effect as: which in this example is equal to the average causal effect due to the linearity of the underlying SCM (Pearl, Glymour, & Jewell 2016, p. 106). In general, individual causal effects are not identified, and we have to resort to average causal effects.8
Answering the question of whether a particular patient would have recovered more quickly had we given him the treatment even though we did not give him the treatment seems almost fantastical. It is a cross-world statement: given what we have observed, we ask about what would have been if things had turned out different. It may strike you as a bit eerie to speak about different worlds. Peters, Janzing, & Schölkopf (2017, p. 106) state that it is "debatable whether this additional [counterfactual] information [encoded in the SCM] is useful." It certainly requires strong assumptions. More broadly, Dawid (2000) argues in favour of causal inference without counterfactuals, and he does not seem to have shifted his position in recent years. Yet if we want to design machines that can achieve human level reasoning, we need to endow them with counterfactual thinking (Pearl, 2019a). Moreover, many concepts that a relevant in legal and ethical domains, such as fairness (Kusner et al., 2017), require counterfactuals. Before we end, note that the graphical approach to causal inference outlined in this blog post is not the only game in town. The potential outcome framework for causal inference developed by Donald Rubin and others avoids graphical models and takes counterfactual quantities as primary. However, although starting from counterfactual statements that are defined at the individual level, it is my understand that most work that uses potential outcomes focuses on average causal effects. As outlined above, this only requires the second level of the causal hierarchy — doing — and are therefore much less contentious than individual causal effects, which sit at the top of the causal hierarchy. The graphical approach outlined in this blog post and the potential outcome framework are logically equivalent (Peters, Janzing, & Schölkopf, 2017, p. 125), and although there is quite some debate surrounding the two approaches, it is probably wise to be pragmatic and simply choose the tool that works best for a particular application. As Lauritzen (2004, p. 189) put it, he sees the
For further reading, I wholeheartedly recommend the textbooks by Pearl, Glymour, & Jewell (2016) as well as Peters, Janzing, & Schölkopf (2017). For bedtime reading, I can recommend Pearl & McKenzie (2018). Miguel Hernán teaches an excellent introductory online course on causal diagrams here. ConclusionIn this blog post, we have touched on several key concepts of causal inference. We have started with the puzzling observation that chocolate consumption and the number of Nobel Laureates are strongly positively related. At the lowest level of the causal ladder — association — we have seen how directed acyclic graphs can help us visualize conditional independencies, and how d-separation provides us with an algorithmic tool to check such independencies. Moving up to the second level — intervention — we have seen how the do-operator models populations under interventions. This helped us define confounding — the bane of observational data analysis — as occuring when $p(Y \mid X = x) \neq p(Y \mid do(X = x))$. This comes with the important observation that entering all variables into a regression in order to "control" for them is misguided; rather, we need to carefully think about the underlying causal relations lest we want to introduce bias by for example conditioning on a collider. The backdoor criterion provided us with a graphical way to assess whether an effect is confounded or not. Finally, we have seen that Structural Causal Models (SCMs) provide the building block from which observational and interventional distributions follow. SCMs further imply counterfactual statements, which sit at the top of the causal hierarchy. These allow us to move beyond the do-operator and average causal effects: they enable us to answer questions about what would have been if things had been different. I would like to thank Oisín Ryan and Sophia Crüwell for very helpful comments on this blog. References
Footnotes
To leave a comment for the author, please follow the link and comment on their blog: Fabian Dablander. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now |
How You Measure Months Matters — A Lot. A Look At Two Implementations of KDA Posted: 29 Nov 2019 09:03 PM PST [This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This post will detail a rather important finding I found while implementing a generalized framework for momentum asset allocation backtests. Namely, that when computing momentum (and other financial measures for use in asset allocation, such as volatility and correlations), measuring formal months, from start to end, has a large effect on strategy performance. So, first off, I am in the job market, and am actively looking for a full-time role (preferably in New York City, or remotely), or a long-term contract. Here is my resume, and here is my LinkedIn profile. Furthermore, I've been iterating on my volatility strategy, and given that I've seen other services with large drawdowns, or less favorable risk/reward profiles charge $50/month, I think following my trades can be a reasonable portfolio diversification tool. Read about it and subscribe here. I believe that my body of work on this blog speaks to the viability of employing me, though I am also learning Python to try and port over my R skills over there, as everyone seems to want Python, and R much less so, hence the difficulty transferring between opportunities. Anyhow, one thing I am working on is a generalized framework for tactical asset allocation (TAA) backtests. Namely, those that take the form of "sort universe by momentum, apply diversification weighting scheme"–namely, the kinds of strategies that the folks over at AllocateSmartly deal in. I am also working on this framework and am happy to announce that as of the time of this writing, I will happily work with individuals that want more customized TAA backtests, as the AllocateSmartly FAQs state that AllocateSmartly themselves do not do custom backtests. The framework I am currently in the process of implementing is designed to do just that. However, after going through some painstaking efforts to compare apples to apples, I came across a very important artifact. Namely, that there is a fairly large gulf in performance between measuring months from their formal endpoints, as opposed to simply approximating months with 21-day chunks (E.G. 21 days for 1 month, 63 for 3, and so on). Here's the code I've been developing recently–the long story short, is that the default options essentially default to Adaptive Asset Allocation, but depending on the parameters one inputs, it's possible to get to something as simple as dual momentum (3 assets, invest in top 1), or as complex as KDA, with options to fine-tune it even further, such as to account for the luck-based timing that Corey Hoffstein at Newfound Research loves to write about (speaking of whom, he and the awesome folks at ReSolve Asset Management have launched a new ETF called ROMO–Robust Momentum–I recently bought a bunch in my IRA because a buy-it-and-forget-it TAA ETF is pretty fantastic as far as buy-and-hold investments go). Again, I set a bunch of defaults in the parameters so that most of them can be ignored. require(PerformanceAnalytics) require(quantmod) require(tseries) stratStats <- function(rets) { stats <- rbind(table.AnnualizedReturns(rets), maxDrawdown(rets)) stats[5,] <- stats[1,]/stats[4,] stats[6,] <- stats[1,]/UlcerIndex(rets) rownames(stats)[4] <- "Worst Drawdown" rownames(stats)[5] <- "Calmar Ratio" rownames(stats)[6] <- "Ulcer Performance Index" return(stats) } getYahooReturns <- function(symbols, return_column = "Ad") { returns <- list() for(symbol in symbols) { getSymbols(symbol, from = '1990-01-01', adjustOHLC = TRUE) if(return_column == "Ad") { return <- Return.calculate(Ad(get(symbol))) colnames(return) <- gsub("\\.Adjusted", "", colnames(return)) } else { return <- Return.calculate(Op(get(symbol))) colnames(return) <- gsub("\\.Open", "", colnames(return)) } returns[[symbol]] <- return } returns <- na.omit(do.call(cbind, returns)) return(returns) } symbols <- c("SPY", "VGK", "EWJ", "EEM", "VNQ", "RWX", "IEF", "TLT", "DBC", "GLD") returns <- getYahooReturns(symbols) canary <- getYahooReturns(c("VWO", "BND")) # offsets endpoints by a certain amount of days (I.E. 1-21) dailyOffset <- function(ep, offset = 0) { ep <- ep + offset ep[ep < 1] <- 1 ep[ep > nrow(returns)] <- nrow(returns) ep <- unique(ep) epDiff <- diff(ep) if(last(epDiff)==1) { # if the last period only has one observation, remove it ep <- ep[-length(ep)] } return(ep) } # computes total weighted momentum and penalizes new assets (if desired) compute_total_momentum <- function(yearly_subset, momentum_lookbacks, momentum_weights, old_weights, new_asset_mom_penalty) { empty_vec <- data.frame(t(rep(0, ncol(yearly_subset)))) colnames(empty_vec) <- colnames(yearly_subset) total_momentum <- empty_vec for(j in 1:length(momentum_lookbacks)) { momentum_subset <- tail(yearly_subset, momentum_lookbacks[j]) total_momentum <- total_momentum + Return.cumulative(momentum_subset) * momentum_weights[j] } # if asset returns are negative, penalize by *increasing* negative momentum # this algorithm assumes we go long only total_momentum[old_weights == 0] <- total_momentum[old_weights==0] * (1-new_asset_mom_penalty * sign(total_momentum[old_weights==0])) return(total_momentum) } # compute weighted correlation matrix compute_total_correlation <- function(data, cor_lookbacks, cor_weights) { # compute total correlation matrix total_cor <- matrix(nrow=ncol(data), ncol=ncol(data), 0) rownames(total_cor) <- colnames(total_cor) <- colnames(data) for(j in 1:length(cor_lookbacks)) { total_cor = total_cor + cor(tail(data, cor_lookbacks[j])) * cor_weights[j] } return(total_cor) } # computes total weighted volatility compute_total_volatility <- function(data, vol_lookbacks, vol_weights) { empty_vec <- data.frame(t(rep(0, ncol(data)))) colnames(empty_vec) <- colnames(data) # normalize weights if not already normalized if(sum(vol_weights) != 1) { vol_weights <- vol_weights/sum(vol_weights) } # compute total volrelation matrix total_vol <- empty_vec for(j in 1:length(vol_lookbacks)) { total_vol = total_vol + StdDev.annualized(tail(data, vol_lookbacks[j])) * vol_weights[j] } return(total_vol) } check_valid_parameters() { if(length(mom_weights) != length(mom_lookbacks)) { stop("Momentum weight length must be equal to momentum lookback length.") } if(length(cor_weights) != length(cor_lookbacks)) { stop("Correlation weight length must be equal to correlation lookback length.") } if(length(vol_weights) != length(vol_lookbacks)) { stop("Volatility weight length must be equal to volatility lookback length.") } } # computes weights as a function proportional to the inverse of total variance invVar <- function(returns, lookbacks, lookback_weights) { var <- compute_total_volatility(returns, lookbacks, lookback_weights)^2 invVar <- 1/var return(invVar/sum(invVar)) } # computes weights as a function proportional to the inverse of total volatility invVol <- function(returns, lookbacks, lookback_weights) { vol <- compute_total_volatility(returns, lookbacks, lookback_weights) invVol <- 1/vol return(invVol/sum(invVol)) } # computes equal weight portfolio ew <- function(returns) { return(StdDev(returns)/(StdDev(returns)*ncol(returns))) } # computes minimum minVol <- function(returns, cor_lookbacks, cor_weights, vol_lookbacks, vol_weights) { vols <- compute_total_volatility(returns, vol_lookbacks, vol_weights) cors <- compute_total_correlation(returns, cor_lookbacks, cor_weights) covs <- t(vols) %*% as.numeric(vols) * cors min_vol_rets <- t(matrix(rep(1, ncol(covs)))) min_vol_wt <- portfolio.optim(x=min_vol_rets, covmat = covs)$pw names(min_vol_wt) <- rownames(covs) return(min_vol_wt) } asset_allocator <- function(returns, canary_returns = NULL, # canary assets for KDA algorithm and similar mom_threshold = 0, # threshold momentum must exceed mom_lookbacks = 126, # momentum lookbacks for custom weights (EG 1-3-6-12) # weights on various momentum lookbacks (EG 12/19, 4/19, 2/19, 1/19) mom_weights = rep(1/length(mom_lookbacks), length(mom_lookbacks)), # repeat for correlation weights cor_lookbacks = mom_lookbacks, # correlation lookback cor_weights = rep(1/length(mom_lookbacks), length(mom_lookbacks)), vol_lookbacks = 20, # volatility lookback vol_weights = rep(1/length(vol_lookbacks), length(vol_lookbacks)), # number of assets to hold (if all above threshold) top_n = floor(ncol(returns)/2), # diversification weight scheme (ew, invVol, invVar, minVol, etc.) weight_scheme = "minVol", # how often holdings rebalance rebalance_on = "months", # how many days to offset rebalance period from end of month/quarter/year offset = 0, # penalize new asset mom to reduce turnover new_asset_mom_penalty = 0, # run Return.Portfolio, or just return weights? # for use in robust momentum type portfolios compute_portfolio_returns = TRUE, verbose = FALSE, # crash protection asset crash_asset = NULL, ... ) { # normalize weights mom_weights <- mom_weights/sum(mom_weights) cor_weights <- cor_weights/sum(cor_weights) vol_weights <- vol_weights/sum(vol_weights) # if we have canary returns (I.E. KDA strat), align both time periods if(!is.null(canary_returns)) { smush <- na.omit(cbind(returns, canary_returns)) returns <- smush[,1:ncol(returns)] canary_returns <- smush[,-c(1:ncol(returns))] empty_canary_vec <- data.frame(t(rep(0, ncol(canary_returns)))) colnames(empty_canary_vec) <- colnames(canary_returns) } # get endpoints and offset them ep <- endpoints(returns, on = rebalance_on) ep <- dailyOffset(ep, offset = offset) # initialize vector holding zeroes for assets empty_vec <- data.frame(t(rep(0, ncol(returns)))) colnames(empty_vec) <- colnames(returns) weights <- empty_vec # initialize list to hold all our weights all_weights <- list() # get number of periods per year switch(rebalance_on, "months" = { yearly_periods = 12}, "quarters" = { yearly_periods = 4}, "years" = { yearly_periods = 1}) for(i in 1:(length(ep) - yearly_periods)) { # remember old weights for the purposes of penalizing momentum of new assets old_weights <- weights # subset one year of returns, leave off first day return_subset <- returns[c((ep[i]+1):ep[(i+yearly_periods)]),] # compute total weighted momentum, penalize potential new assets if desired momentums <- compute_total_momentum(return_subset, momentum_lookbacks = mom_lookbacks, momentum_weights = mom_weights, old_weights = old_weights, new_asset_mom_penalty = new_asset_mom_penalty) # rank negative momentum so that best asset is ranked 1 and so on momentum_ranks <- rank(-momentums) selected_assets <- momentum_ranks <= top_n & momentums > mom_threshold selected_subset <- return_subset[, selected_assets] # case of 0 valid assets if(sum(selected_assets)==0) { weights <- empty_vec } else if (sum(selected_assets)==1) { # case of only 1 valid asset -- invest everything into it weights <- empty_vec + selected_assets } else { # apply a user-selected weighting algorithm # modify this portion to select more weighting schemes if (weight_scheme == "ew") { weights <- ew(selected_subset) } else if (weight_scheme == "invVol") { weights <- invVol(selected_subset, vol_lookbacks, vol_weights) } else if (weight_scheme == "invVar"){ weights <- invVar(selected_subset, vol_lookbacks, vol_weights) } else if (weight_scheme == "minVol") { weights <- minVol(selected_subset, cor_lookbacks, cor_weights, vol_lookbacks, vol_weights) } } # include all assets wt_names <- names(weights) if(is.null(wt_names)){wt_names <- colnames(weights)} zero_weights <- empty_vec zero_weights[wt_names] <- weights weights <- zero_weights weights <- xts(weights, order.by=last(index(return_subset))) # if there's a canary universe, modify weights by fraction with positive momentum # if there's a safety asset, allocate the crash protection modifier to it. if(!is.null(canary_returns)) { canary_subset <- canary_returns[c(ep[i]:ep[(i+yearly_periods)]),] canary_subset <- canary_subset[-1,] canary_mom <- compute_total_momentum(canary_subset, mom_lookbacks, mom_weights, empty_canary_vec, 0) canary_mod <- mean(canary_mom > 0) weights <- weights * canary_mod if(!is.null(crash_asset)) { if(momentums[crash_asset] > mom_threshold) { weights[,crash_asset] <- weights[,crash_asset] + (1-canary_mod) } } } all_weights[[i]] <- weights } # combine weights all_weights <- do.call(rbind, all_weights) if(compute_portfolio_returns) { strategy_returns <- Return.portfolio(R = returns, weights = all_weights, verbose = verbose) return(list(all_weights, strategy_returns)) } return(all_weights) } #out <- asset_allocator(returns, offset = 0) kda <- asset_allocator(returns = returns, canary_returns = canary, mom_lookbacks = c(21, 63, 126, 252), mom_weights = c(12, 4, 2, 1), cor_lookbacks = c(21, 63, 126, 252), cor_weights = c(12, 4, 2, 1), vol_lookbacks = 21, weight_scheme = "minVol", crash_asset = "IEF") The one thing that I'd like to focus on, however, are the lookback parameters. Essentially, assuming daily data, they're set using a *daily lookback*, as that's what AllocateSmartly did when testing my own KDA Asset Allocation algorithm. Namely, the salient line is this: "For all assets across all three universes, at the close on the last trading day of the month, calculate a "momentum score" as follows:(12 * (p0 / p21 – 1)) + (4 * (p0 / p63 – 1)) + (2 * (p0 / p126 – 1)) + (p0 / p252 – 1)Where p0 = the asset's price at today's close, p1 = the asset's price at the close of the previous trading day and so on. 21, 63, 126 and 252 days correspond to 1, 3, 6 and 12 months." So, to make sure I had apples to apples when trying to generalize KDA asset allocation, I compared the output of my new algorithm, asset_allocator (or should I call it allocate_smartly ?=] ) to my formal KDA asset allocation algorithm. Here are the results: KDA_algo KDA_approximated_months Annualized Return 0.10190000 0.08640000 Annualized Std Dev 0.09030000 0.09040000 Annualized Sharpe (Rf=0%) 1.12790000 0.95520000 Worst Drawdown 0.07920336 0.09774612 Calmar Ratio 1.28656163 0.88392257 Ulcer Performance Index 3.78648873 2.62691398 Essentially, the long and short of it is that I modified my original KDA algorithm until I got identical output to my asset_allocator algorithm, then went back to the original KDA algorithm. The salient difference is this part: # computes total weighted momentum and penalizes new assets (if desired) compute_total_momentum <- function(yearly_subset, momentum_lookbacks, momentum_weights, old_weights, new_asset_mom_penalty) { empty_vec <- data.frame(t(rep(0, ncol(yearly_subset)))) colnames(empty_vec) <- colnames(yearly_subset) total_momentum <- empty_vec for(j in 1:length(momentum_lookbacks)) { momentum_subset <- tail(yearly_subset, momentum_lookbacks[j]) total_momentum <- total_momentum + Return.cumulative(momentum_subset) * momentum_weights[j] } # if asset returns are negative, penalize by *increasing* negative momentum # this algorithm assumes we go long only total_momentum[old_weights == 0] <- total_momentum[old_weights==0] * (1-new_asset_mom_penalty * sign(total_momentum[old_weights==0])) return(total_momentum) } Namely, the part that further subsets the yearly subset by the lookback period, in terms of days, rather than monthly endpoints. Essentially, the difference in the exact measurement of momentum–that is, the measurement that explicitly selects *which* instruments the algorithm will allocate to in a particular period, unsurprisingly, has a large impact on the performance of the algorithm. And lest anyone think that this phenomena no longer applies, here's a yearly performance comparison. KDA_algo KDA_approximated_months 2008-12-31 0.1578348930 0.062776766 2009-12-31 0.1816957178 0.166017499 2010-12-31 0.1779839604 0.160781537 2011-12-30 0.1722014474 0.149143148 2012-12-31 0.1303019332 0.103579674 2013-12-31 0.1269207487 0.134197066 2014-12-31 0.0402888320 0.071784979 2015-12-31 -0.0119459453 -0.028090873 2016-12-30 0.0125302658 0.002996917 2017-12-29 0.1507895287 0.133514924 2018-12-31 0.0747520266 0.062544709 2019-11-27 0.0002062636 0.008798310 Of note: the variant that formally measures momentum from monthly endpoints consistently outperforms the one using synthetic monthly measurements. So, that will do it for this post. I hope to have a more thorough walk-through of the asset_allocator function in the very near future before moving onto Python-related matters (hopefully), but I thought that this artifact, and just how much it affects outcomes, was too important not to share. An iteration of the algorithm capable of measuring momentum with proper monthly endpoints should be available in the near future. Thanks for reading. To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. |
Posted: 29 Nov 2019 10:06 AM PST [This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. After a bank launches a new product or acquires a new portfolio, the risk modeling team would often be faced with a challenge of how to estimate the corresponding performance, e.g. risk or loss, with a limited number of data points conditional on business drivers or macro-economic indicators. For instance, it is required to project the 9-quarter loss in CCAR, regardless of the portfolio age. In such cases, the prevalent practice based upon conventional regression models might not be applicable given the requirement for a sufficient number of samples in order to draw the statistical inference. As a result, we would have to rely on the input of SME (Subject Matter Expert), to gauge the performance based on similar products and portfolios, or to fall back on simple statistical metrics such as Average or Median that can't be intuitively related to predictors. With the GRNN implemented in the YAGeR project (https://github.com/statcompute/yager), it is however technically feasible to project the expected performance conditional on predictors due to the fact that the projected Y_i of a future case is determined by the distance between the predictor vector X_i and each X vector in the training sample, subject to a smoothing parameter namely Sigma. While more samples in the training data are certainly helpful to estimate a generalizable model, a couple data points, e.g. even only one or two data points in the extreme case, are also conceptually sufficient to form a GRNN that is able to generate sensible projections without violating statistical assumptions. Following are a couple practical considerations.
Below is an example of using 100 data points as the training sample to predict LGD within the unity interval of 1,000 cases with both GLM and GRNN. Out of 100 trials, while the GLM only outperformed the simple average for 32 times, the GRNN was able to do better for 76 times. To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now |
You are subscribed to email updates from R-bloggers. To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, United States |
Comments
Post a Comment