


























































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The challenges and methods used by Facebook to measure the effectiveness of advertising while avoiding contamination of the control group. Three mechanisms generating endogeneity are addressed: user-induced, targeting-induced, and competition-induced. The document also explains how Facebook uses a matched sample of unexposed international users as a control group and records conversions for both groups using pixels. The study's results are presented in Figure 7, showing conversion rates and lift percentages for various studies.
What you will learn
Typology: Study Guides, Projects, Research
1 / 66
This page cannot be seen from the preview
Don't miss anything!
Abstract Measuring the causal effects of digital advertising remains challenging despite the availability of granular data. Unobservable factors make exposure endogenous, and advertising’s effect on outcomes tends to be small. In principle, these concerns could be addressed using randomized controlled trials (RCTs). In practice, few online ad campaigns rely on RCTs, and instead use observational methods to estimate ad effects. We assess empirically whether the variation in data typically available in the advertising industry enables observational methods to recover the causal effects of online advertising. This analysis is of particular interest because of recent, large improvements in observational methods for causal inference (Imbens and Rubin 2015). Using data from 15 US advertising experiments at Facebook comprising 500 million user-experiment observations and 1.6 billion ad impressions, we contrast the experimental results to those ob- tained from multiple observational models. The observational methods often fail to produce the same effects as the randomized experiments, even after conditioning on extensive demo- graphic and behavioral variables. We also characterize the incremental explanatory power our data would require to enable observational methods to successfully measure advertising effects. Our findings suggest that commonly used observational approaches based on the data usually available in the industry often fail to accurately measure the true effect of advertising.
Keywords: Digital Advertising, Field Experiments, Causal Inference, Observational Methods, Advertising Measurement. ∗ (^) To maintain privacy, no data contained personally identifiable information that could identify consumers or advertisers. We thank Daniel Slotwiner, Gabrielle Gibbs, Joseph Davin, Brian d’Alessandro, and Fangfang Tan at Facebook. We are grateful to Garrett Johnson, Randall Lewis, Daniel Zantedeschi, and seminar participants at Bocconi, CKGSB, Columbia, eBay, ESMT, Facebook, FTC, HBS, LBS, Northwestern, QME, Temple, UC Berkeley, UCL, NBER Digitization, NYU Big Data Conference, and ZEW for helpful comments and suggestions. We particularly thank Meghan Busse for extensive comments and editing suggestions. Gordon and Zettelmeyer have no financial interest in Facebook and were not compensated in any way by Facebook or its affiliated com- panies for engaging in this research. E-mail addresses for correspondence: b-gordon@kellogg.northwestern.edu, f- zettelmeyer@kellogg.northwestern.edu, nehab@fb.com, chapsky@fb.com.
1 Introduction
Digital advertising spending exceeded television ad spending for the first time in 2017.^1 Adver- tising is a critical funding source for internet content and services (Benady 2016). As advertisers have shifted more of their ad expenditures online, demand has grown for online ad effectiveness measurement: advertisers routinely access granular data that link ad exposures, clicks, page visits, online purchases, and even offline purchases (Bond 2017). However, even with these data, measuring the causal effect of advertising remains challenging for at least two reasons. First, individual-level outcomes are volatile relative to ad spending per customer, such that advertising explains only a small amount of the variation in outcomes (Lewis and Reiley 2014, Lewis and Rao 2015). Second, even small amounts of advertising endogeneity (e.g., likely buyers are more likely to be exposed to the ad) can severely bias causal estimates of its effectiveness (Lewis, Rao, and Reiley 2011). In principle, using large-scale randomized controlled trials (RCTs) to evaluate advertising ef- fectiveness could address these concerns.^2 In practice, however, few online ad campaigns rely on RCTs (Lavrakas 2010). Reasons range from the technical difficulty of implementing experimen- tation in ad-targeting engines to the commonly held view that such experimentation is expensive and often unnecessary relative to alternative methods (Gluck 2011). Thus, many advertisers and leading ad-measurement companies rely on observational methods to estimate advertising’s causal effect (Abraham 2008, comScore 2010, Klein and Wood 2013, Berkovich and Wood 2016). Here, we assess empirically whether the variation in data typically available in the advertising industry enables observational methods to recover the causal effects of online advertising. To do so, we use a collection of 15 large-scale advertising campaigns conducted on Facebook as RCTs in 2015. We use this dataset to implement a variety of matching and regression-based methods and compare their results with those obtained from the RCTs. Earlier work to evaluate such observational models had limited individual-level data and considered a narrow set of models (Lewis, Rao, and Reiley 2011, Blake, Nosko, and Tadelis 2015). A fundamental assumption underlying observational models is unconfoundedness: conditional on observables, treatment and (potential) outcomes are independent. Whether this assumption is true depends on the data-generating process, and in particular on the requirement that some random variation exists after conditioning on observables. In our context, (quasi-)random varia- tion in exposure has at least three sources: user-level variation in visits to Facebook, variation in Facebook’s pacing of ad delivery over a campaign’s pre-defined window, and variation due to unre- lated advertisers’ bids. All three forces induce randomness in the ad auction outcomes. However, (^1) https://www.recode.net/2017/12/4/16733460/2017-digital-ad-spend-advertising-beat-tv, accessed on April 7, 2018. (^2) A growing literature focuses on measuring digital ad effectiveness using randomized experiments. See, for example Lewis and Reiley (2014), Johnson, Lewis, and Reiley (2016), Johnson, Lewis, and Reiley (2017), Kalyanam, McAteer, Marek, Hodges, and Lin (2018), Johnson, Lewis, and Nubbemeyer (2017a), Johnson, Lewis, and Nubbemeyer (2017b), Sahni (2015), Sahni and Nair (2016), and Goldfarb and Tucker (2011). See Lewis, Rao, and Reiley (2015) for a recent review.
advertising effects is that we do not observe all the data that Facebook uses to run its advertising platform. Motivated by this possibility, we conducted the following thought experiment: “Assuming ‘better’ data exist, how much better would that data need to be to eliminate the bias between the observational and RCT estimates?” This analysis, extending work by Rosenbaum and Rubin (1983a) and Ichino, Fabrizia, and Nannicini (2008), begins by simulating an unobservable that eliminates bias in the observational method. Next, we compare the explanatory power of this (simulated) unobservable with the explanatory power of our observables. Our results show that for some studies, we would have to obtain additional covariates that exceed the explanatory power of our full set of observables to recover the RCT estimates. These results represent the second contribution of our paper, which is to characterize the nature of the unobservable needed to use observational methods successfully to estimate ad effectiveness. The third contribution of our paper is to the literature on observational versus experimental approaches to causal measurement. In his seminal paper, LaLonde (1986) compares observational methods with randomized experiments in the context of the economic benefits of employment and training programs. He concludes that “ many of the econometric procedures do not replicate the experimentally determined results” (p. 604). Since then, we have seen significant improvements in observational methods for causal inference (Imbens and Rubin 2015). In fact, Imbens (2015) shows that an application of these improved methods to the LaLonde (1986) dataset manages to replicate the experimental results. In the job-training setting in LaLonde (1986), observational methods needed to adjust for the fact that the characteristics of trainees differed from those of a comparison group drawn from the population. Because of targeting, the endogeneity problems associated with digital advertising are potentially more severe: advertising exposure is determined by a sophisticated machine-learning algorithm using detailed data on individual user behavior. We explore whether the improvements in observational methods for causal inference, paired with large sample, individual-level data, are sufficient to replicate experimental results in a large industry that relies on such methods. We are not the first to attempt to estimate the performance of observational methods in gauging digital advertising effectiveness.^4 Lewis, Rao, and Reiley (2011) is the first paper to compare RCT estimates with results obtained using observational methods (comparing exposed versus unexposed users and regression). They faced the challenge of finding a valid control group of unexposed users: their experiment exposed 95% of all US-based traffic to the focal ad, leading them to use a matched sample of unexposed international users. Blake, Nosko, and Tadelis (2015) documents that non- experimental measurement can lead to highly suboptimal spending decisions for online search ads. However, in contrast to our paper, Blake, Nosko, and Tadelis (2015) use a difference-in-differences approach based on randomization at the level of 210 media markets as the experimental benchmark and therefore cannot implement individual-level causal inference methods. This paper proceeds as follows. We first describe the experimental design of the 15 advertising (^4) Beyond digital advertising, other work assesses the effectiveness of marketing messages using both observational and experimental methods in the context of voter mobilization (Arceneaux, Gerber, and Green 2010) and water-usage reduction (Ferraro and Miranda 2014, Ferraro and Miranda 2017).
RCTs we analyze: how advertising works at Facebook, how Facebook implements RCTs, and what determines advertising exposure. In section 3, we introduce the potential-outcomes notation now standard for causal inference and relate it to the design of our RCTs. In section 4, we explain the set of observational methods we analyze. Section 5 presents the data generated by the 15 RCTs. Section 6 discusses identification and estimation issues and presents diagnostics. Section 7 shows the results for one example ad campaign in detail and summarizes findings for all remaining ad campaigns. Section 8 assesses the role of unobservables in reducing bias. Section 9 offers concluding remarks.
2 Experimental Design
Here we describe how Facebook conducts advertising campaign experiments. Facebook enables advertisers to run experiments to measure marketing-campaign effectiveness, test out different marketing tactics, and make more informed budgeting decisions.^5 We define the central measure- ment question, discuss how users are assigned to the test group, and highlight the endogenous sources of exposure to an ad.
We focus exclusively on campaigns in which the advertiser had a particular “direct response” outcome in mind, for example, to increase sales of a new product.^6 The industry refers to these as “conversion outcomes.” In each study, the advertiser measured conversion outcomes using a piece of Facebook-provided code (“conversion pixel”) embedded on the advertiser’s web pages, indicating whether a user visited that page.^7 Different placement of the pixels can measure different conversion outcomes. A conversion pixel embedded on a checkout-confirmation page, for example, measures a purchase outcome. A conversion pixel on a registration-confirmation page measures a registration outcome, and so on. These pixels allow the advertiser (and Facebook) to record conversions for users in both the control and test group and do not require the user to click on the ad to measure conversion outcomes. Facebook’s ability to track users via a “single-user login” across devices and sessions represents a significant measurement advantage over more common cookie-based approaches. First, this ap- proach helps ensure the integrity of the random assignment mechanism because a user’s assignment can be maintained persistently throughout the campaign and prevents control users from being in- advertently shown an ad. Second, Facebook can associate all exposures and conversions across (^5) Facebook refers to these ad tests as “conversion lift” tests (https://www.facebook.com/business/a/ conversion-lift, accessed on April 7, 2018.). Facebook provides this experimental platform as a free service to qualifying advertisers. 6 We excluded brand-building campaigns in which outcomes are measured through consumer surveys. (^7) A “conversion pixel” refers to two types of pixels used by Facebook. One is traditionally called a “conversion pixel,” and the other is known as a “Facebook pixel.” The studies analyzed in this paper use both types, and they are equivalent for our purposes (https://www.facebook.com/business/help/460491677335370, accessed on April 7, 2018).
Figure 2: Example of three display ads for one campaign
Source: https://www.facebook.com/business/ads-guide
An experiment begins with the advertiser deciding which consumers to target with a marketing campaign, such as all women between 18 and 54. These targeting rules define the relevant set of users in the study. Each user is randomly assigned to the control or test group based on a proportion selected by the advertiser, in consultation with Facebook. Control-group members are never exposed to campaign ads during the study; those in the test group are eligible to see the campaign’s ads. Facebook avoids contaminating the control group with exposed users, due to its single-user login feature. Whether test-group users are ultimately exposed to the ads depends on factors such as whether the user accessed Facebook during the study period (we discuss these factors and their implications in the next subsection). Thus, we observe three user groups: control- unexposed, test-unexposed, and test-exposed. Next, we consider what ads the control group should be shown in place of the advertiser’s campaign. This choice defines the counterfactual of interest. To evaluate campaign effectiveness, an advertiser requires the control condition to estimate the outcomes that would have occurred without the campaign. Thus, the control-condition ads should be the ads that would have been served if the advertiser’s campaign had not been run on Facebook. We illustrate this process using a hypothetical, stylized example in Figure 3. Consider two users in the test and control groups. Suppose that at a given moment, Jasper’s Market wins the auction to display an impression for the test-group user, as seen in Figure 3a. Imagine the control-group user, who occupies a parallel world to that of the test user, would have been served the same ad had this user been in the test group. However, the platform, recognizing the user’s assignment to the control group, prevents the focal ad from appearing. As Figure 3b shows, instead the auction’s second-place ad is served to the control user because that user would have won the auction if the focal ad had not existed.
Figure 3: Determination of control ads in Facebook experiments
(a) Step 1: Determine that a user in the control would have been served the focal ad.
(b) Step 2: Serve the next ad in the auction.
Targeting-induced endogeneity
The targeting criteria for the campaign determines the pool of potential users who may be assigned to the test or control group at the start of the campaign. Although these criteria do not change once the campaign begins, modern advertising delivery systems optimize who are shown ads. Multiple targeting objectives exist, with the most common being maximizing the number of impressions, click-through rate, or purchase. As a campaign progresses, the delivery system learns which types of users are most likely to meet the objective, and gradually the system starts to favor showing ads to users it expects are most likely to meet the objective. To implement this, the delivery system upweights or downweights the auction bids of different types of users within the target group. As a result, conditional on the advertiser’s bid, the probability of exposure increases or decreases for different users. Assessing ad effectiveness by comparing exposed versus unexposed consumers will, therefore, overstate the effectiveness of advertising because exposed users were specifically chosen based on their higher conversion rates. In general, this mechanism will lead to upwardly-biased ad effects, but there are cases where the bias could run in the opposite direction. One example is if the ad campaign is set to optimize for clicks but the advertiser still tracks purchases. Users who are more likely to click on an ad (so-called “clicky users”) may also be less likely to purchase the product. Note that the implementation of this system at Facebook does not invalidate experimentation, because the upweighting or downweighting of bids is applied equally to users in the test and control group. Some users in the test group may become more likely to see the ad if the system observes similar users converting in the early stages of the campaign. The key point is that the same process occurs for users in the control group: the focal ad will receive more weight in the auction for these users and might win the auction more frequently—except that, for members of the control group, the focal ad is replaced “at the last moment” by the runner up and is thus never shown. As a result, the control group remains a valid counterfactual for outcomes in the treatment group, even under ad-targeting optimization.
Competition-induced endogeneity
Ads are delivered if the advertiser wins the auction for a particular impression. Winning the auction implies the advertiser outbid other advertisers competing for the same impression. Therefore, an advertiser’s ads are more likely to be shown to users the advertiser values highly, most often those with a higher expected conversion probability. Even if an advertiser’s actions do not produce any selection bias, the advertiser can nevertheless end up with selection bias in exposures because of what another advertiser does. For example, if, during the campaign period, another advertiser bids high on 18-54-year-old women who are also mothers, the likelihood that mothers will not be exposed to the focal campaign is higher. A case that could lead to downward bias is when other firms sell complementary products and target the same users as a focal advertiser. If these firms win impressions at the expense of the focal advertiser, and obtain some conversions as a result, the resulting set of unexposed users may now be more likely to buy the focal firm’s product.
In the RCT, we address potential selection bias by leveraging the random-assignment mechanism and information on whether a user receives treatment. For the observational models, we discard the randomized control group and address the selection bias by relying solely on the treatment status and observables in the test group.
3 Analysis of the RCT
We use the potential-outcomes notation now standard in the literature on experimental and non- experimental program evaluation. Our exposition in this section and the next draws heavily on material in Imbens (2004), Imbens and Wooldridge (2009), and Imbens and Rubin (2015).
Each ad study contains N individuals (units) indexed by i = 1,... , N drawn from an infinite population of interest. Individuals are randomly assigned to test or control conditions through Zi = { 0 , 1 }. Exposure to ads is given by the indicator Wi(Zi) = { 0 , 1 }. Users assigned to the control condition are never exposed to any ads from the study, Wi(Zi = 0) = 0. However, assignment to the test condition does not guarantee a user is exposed, such that Wi(Zi = 1) = { 0 , 1 } is an endogenous outcome. We observe a set of covariates Xi ∈ X ⊂ RP^ for each user that are unaffected by the experiment. We do not index any variable by a study-specific subscript, because all analysis takes place within a study. Given an assignment Zi and a treatment Wi(Zi), the potential outcomes are Yi(Zi, Wi(Zi)) = { 0 , 1 }. Under one-sided noncompliance, the observed outcome is
Y (^) iobs = Yi(Zi, W (^) iobs ) = Yi(Zi, Wi(Zi)) =
Yi(0, 0), if Zi = 0, W (^) iobs = 0 Yi(1, 0), if Zi = 1, W (^) iobs = 0 Yi(1, 1), if Zi = 1, W (^) iobs = 1
We designate the observed values Y (^) iobs and W (^) iobs to help distinguish them from their potential outcomes. Valid inference requires several standard assumptions. First, a user can receive only one ver- sion of the treatment, and a user’s treatment assignment does not interfere with another user’s outcomes. This pair of conditions is commonly known as the Stable Unit Treatment Value As- sumption (SUTVA), a term coined in Rubin (1978). Our setting likely satisfies both conditions. Facebook’s ability to track individuals prevents the platform from inadvertently showing the wrong treatment to a given user. Non-interference could be violated if, for example, users in the test group share ads with users in the control group. However, users are unaware of both the existence of the experiment and their assignment status. Moreover, if test users shared ads with control users on Facebook, we would be able to observe those impressions.^9 (^9) If test users showed control users the ads, the treatment-effect estimates would be conservative because it might inflate the conversion rate in the control group.
Note that the ATT is inherently conditional on the set of users who end up being exposed (or treated) in a particular experiment. As different experiments target individuals using different X’s, the interpretation of the ATT varies across experiments. Imbens and Angrist (1994) show the ATT can be expressed in an IV framework, relying on the exclusion restriction. The ATT is the ITT effect on the outcome, divided by the ITT effect on the receipt of treatment:
τ = ITTY ITTW
With full compliance in the control, such that Wi(0) = 0 for all users, and complete randomization of Zi, the denominator simplifies to ITTW = E[W (1)], or the proportion in the test group who take up the treatment. In summary, we go from ITT to ATT by using the (exogenous) treatment assignment Z as an instrument for (endogenous) exposure W. An intuitive way to derive the relationship between the ITT and the ATT is to decompose the ITT outcome effect for the entire sample as the weighted average of the effects for two groups of users: compliers and noncompliers. Compliers are users assigned to the test condition who receive the treatment, Wi(1) = 1, and noncompliers are users assigned to the test condition who do not receive the treatment, Wi(1) = 0. The overall ITT effect can be expressed as
ITTY = ITTY,co · πco + ITTY,nc · (1 − πco), (6)
where πco = E[W (1)] is the share of compliers. The exclusion restriction assumes unexposed users have the same outcomes, regardless of whether they were in treatment or control, Yi(1, 0) = Yi(0, 0). This implies ITTY,nc = E [Y (1, 0) − Y (0, 0)] = 0. Thus, ITTY,co can be expressed as the ITT effect divided by the share of compliers,
τ ≡ AT T ≡ ITTY,co = ITTY πco
In a sense, scaling ITTY by the inverse of πco “undilutes” the ITT effect according to the share of users who actually received treatment in the test group (the compliers). Imbens and Angrist (1994) refer to this quantity as the local average treatment effect (LATE) and demonstrate its relationship to IV with heterogeneous treatment effects. If the sample contains no “always-takers” and no “defiers,” which is true in our experimental design with one-sided non-compliance, the LATE is equal to the ATT.
To help summarize outcomes across advertising studies, we report most results in terms of lift, the incremental conversion rate among treated users expressed as a percentage:
τ` = ∆Conversion rate due to ads in the treated group Conversion rate of the treated group if they had not been treated = τ E[Y obs|Z = 1, W obs^ = 1] − τ
The denominator is the estimated conversion rate of the treated group if they had not actually been treated. Reporting the lift facilitates comparison of advertising effects across studies be- cause it normalizes the results according to the treated group’s baseline conversion rate, which can vary significantly with study characteristics (e.g., advertiser’s identity, outcome of interest). One downside of using lift is that differences between methods can seem large when the treated group’s baseline conversion rate is small. Other papers have compared advertising effectiveness across campaigns by calculating advertising ROI (Lewis and Rao 2015), but we lack the data on profit margins from sales to calculate ROI.^11
4 Observational Approaches
Here we present the observational methods we compare with estimates from the RCT. The following thought experiment motivates our analysis. Rather than conducting an RCT, an advertiser (or a third party acting on the advertiser’s behalf) followed customary practice by choosing a target sample and making all users eligible to see the ad. Although all users in the sample are eligible to see the ad, only a subsample is eventually exposed. To estimate the treatment effect, the advertiser compares the outcomes in the exposed group with the outcomes in the unexposed group. This approach is equivalent to creating a test sample without a control group held out. We employ a set of methods that impose various degrees of structure to recover the treatment effects. Our goal is twofold: to ensure we cover the range of observational methods commonly used by academics and practitioners and to understand the extent to which more sophisticated techniques are potentially better at reducing the bias of estimates compared with RCT estimates. The observational methods we use rely on a combination of approaches: matching, stratification, and regression.^12 Both academics and practitioners rely on the methods we implement. In the context of mea- suring advertising effectiveness, matching methods appear in a variety of related academic work, such as comparing the efficacy of internet and TV ads for brand building (Draganska, Hartmann, and Stanglein 2014), measuring the effects of firm-generated social media on customer metrics (Kumar, Bezawada, Rishika, Janakiraman, and Kannan 2016), assessing whether access to digital video recorders (DVRs) affects sales of advertised products (Bronnenberg, Dub´e, and Mela 2010), the effectiveness of pharmaceutical detailing (Rubin and Waterman 2007), the impact of mu- tual fund name changes on subsequent investment inflows (Cooper, Gulen, and Rau 2005), and evaluating alcohol advertising targeted at adolescents (Ellickson, Collins, Hambarsoomians, and McCaffrey 2005). Industry-measurement vendors, such as comScore, Nielsen, and Nielsen Catalina Solutions, all rely on matching and regression methods to evaluate various marketing programs (^11) Although ROI is a monotone transformation of lift, measuring the ROI in addition to lift would be useful because managerial decisions may rely on cutoff rules that involve ROI. 12 Researchers have recently developed more sophisticated methods for estimating causal effects (Imai and Ratkovic 2014), including those that blend insights from operations research (Zubizarreta 2012, Zubizarreta 2015) and machine learning (Athey, Imbens, and Wager forthcoming). We leave to future work to explore how these methods perform in recovering experimental ad effects.
which states that, conditional on Xi, potential outcomes are independent of treatment status. Alternatively, this assumption posits that no unobserved characteristics of individuals associated with the treatment and potential outcomes exist. This particular assumption is considered the most controversial and is untestable without an experiment. The second assumption is overlap, which requires a positive probability of receiving treatment for all values of the observables, such that
0 < Pr(Wi = 1|Xi) < 1 , ∀Xi ∈ X.
Overlap can be assessed before and after adjustments are made to each group. Rosenbaum and Rubin (1983b) refer to the combination of unconfoundedness and overlap assumptions as strong ignorability. The conditional probability of treatment given observables Xi is known as the propensity score, e(x) ≡ Pr(Wi = 1|Xi = x) (15)
Under strong ignorability, Rosenbaum and Rubin (1983b) establish that treatment assignment and the potential outcomes are independent, conditional on the propensity score,
(Yi(0), Yi(1)) ⊥⊥ Wi | e(Xi). (16)
Given two individuals with the same propensity scores, exposure status is as good as random. Thus, adjusting for the propensity score eliminates the bias associated with differences in the observables between treated and untreated individuals. This result is central to many of the observational methods widely employed in the literature.
Exact matching (EM)
Matching is an intuitive method for estimating treatment effects under strong ignorability. To esti- mate the ATT, matching methods find untreated individuals similar to the treated individuals and use the outcomes from the untreated individuals to impute the missing potential outcomes for the treated individuals. The difference between the actual outcome and the imputed potential outcome is an estimate of the individual-level treatment effect, and averaging over treated individuals yields the ATT. This calculation highlights an appealing aspect of matching methods: they do not assume a particular form for an outcome model. The simplest approach is to compare treated and untreated individuals who match exactly on a set of observables Xem^ ⊂ X. To estimate the treatment effect, for each exposed user i, we find the set of unexposed users |Mci | for whom Xiem = Xjem , j ∈ Mci. For an exposed user, we observe Y (^) iobs = Yi(1) and require an estimate of the potential outcome Yi(0). An estimate of this potential outcome is
Ŷi(0) = 1 Mci
j∈Mci
Y (^) jobs. (17)
The exact matching estimator for the ATT is
τ̂ em^ = 1 Ne
∑Ne i=
Wi
Yi(1) − Ŷi(0)
where Ne =
i Wi^ is the number of exposed users (in the test group).
Propensity score matching (PSM)
Exact matching is only feasible using a small set of discrete observables. Generalizing to all the observables requires a similarity metric to compare treated and untreated individuals. One may match on all the observables Xi using a distance metric, such as Mahalanobis distance (Rosenbaum and Rubin 1985). But this metric may not work well with a large number of covariates (Gu and Rosenbaum 1993) and can be computationally demanding. To overcome this limitation, perhaps the most common matching approach is based on the propensity score (Dehejia and Wahba 2002, Caliendo and Kopeinig 2008, Stuart 2010). Let e(x; φ) denote the model for the propensity score parameterized by φ, the estimation of which we discuss in section 6.2. We match on the (estimated) log-odds ratio
`(x; φ) = ln
e(x; φ) 1 − e(x; φ)
This transformation linearizes values on the unit interval and can improve estimation (Rubin 2001). To estimate the treatment effect, we find the M unexposed users with the closest propensity scores to each exposed user. Matching is done with replacement because it can reduce bias, does not depend on the sort order of the data, and is less computationally burdensome. Let mci,k
be the index of the (unexposed) control user that is the kth closest to exposed user i based on |e(xmci,k ; φ) − e(xi; φ)|. The set Mci = {mci, 1 , mci, 2 ,... , mci,M } contains the M closest observations for user i. For an exposed user, we observe Y (^) iobs = Yi(1) and require an estimate of the potential outcome Yi(0). An estimate of this potential outcome is
Ŷi(0) = 1 M
j∈Mci
Y (^) jobs. (19)
The propensity score matching estimator for the ATT is
τ̂ psm^ = 1 Ne
∑Ne i=
Wi
Yi(1) − Ŷi(0)
Stratification (STRAT)
The computational burden of matching on the propensity score can be further reduced by strat- ification on the estimated propensity score (also known as subclassification or blocking). After estimating the propensity score, the sample is divided into strata (or blocks) such that within each stratum, the estimated propensity scores are approximately constant.
Separate models could be estimated for each treatment level. Given our focus on the ATT, we estimate only μ 0 (x) to predict counterfactual outcomes for the treated users. The most common approach is a linear model of the form μw(Xi; βw) = β w′Xi, with flexible functions of Xi. Given some estimator μ 0 (Xi; βˆ 0 ), the regression-adjustment (RA) estimate for the ATT is obtained through
τ̂ ra^ = (^) N^1 e
i=
Wi[Y (^) iobs − μ 0 (Xi; βˆ 0 )]. (26)
Note the accuracy of this method depends on how well the covariate distribution for untreated users overlaps the covariate distribution for treated users. If the treated users have significantly different observables compared to untreated users, μ 0 (Xi; βˆ 0 ) relies heavily on extrapolation, which is likely to produce biased estimates of the treatment effect in equation (26).
Inverse-probability-weighted regression adjustment (IPWRA)
A variant of the RA estimator incorporates information in the propensity scores, borrowing from the insights found in inverse-probability-weighted estimators (Hirano, Imbens, and Ridder 2003). The estimated propensity scores are used to form weights to help control for correlation between treatment status and the covariates. This method belongs to a class of procedures that have the “doubly robust” property (Robins and Ritov 1997), which means the estimator is consistent even if one of the underlying models—either the propensity model or the outcome model—turns out to be misspecified. The inverse-probability-weighed regression adjustment (IPWRA) model estimates the exposure and outcome models simultaneously:
min {φ,β}
i=
Wi
(Yi − μ 0 (Xi; β 0 ))^2 1 − e(Xi; φ)
Given the estimate βˆ 0 from the outcome model, equation (26) is once again used to calculate the treatment effect, ̂τ ipwra. In practice, the exposure model, outcome model, and ATT are esti- mated simultaneously using two-step GMM to obtain efficient estimates and robust standard errors (Wooldridge 2007).
Stratification and Regression (STRATREG)
One problem with regression estimators, even those that weigh by the inverse propensity scores, is that treatment effects can be sensitive to differences in the covariate distributions for the treated and untreated groups. If these distributions differ, these estimators rely heavily on extrapolation. A particularly flexible approach, advocated by Imbens (2015) and Eckles and Bakshy (2017), is to combine regression with stratification on the estimated propensity score. After estimating the propensity score, the sample is divided into strata with approximately constant estimated propensity scores. Regression on the outcome is used within each stratum to estimate the treatment effect and to correct for any remaining imbalance. The idea is that the covariate distribution
within a stratum should be relatively balanced, so the within-stratum regression is less prone to extrapolate. Stratification follows the recursive procedure outlined after equation (23), with a regression within each strata j to estimate the strata-specific ATT:
Yi = αj + τ (^) jstratreg · Wi + β′ j Xi + εi. (27)
As in equation (23), this method produces a set of J estimates that can be averaged appropriately to calculate the ATT:
τ̂ stratreg^ =
j=
N 1 j NT^ ·^ τ^
stratreg j.^ (28)
The goal of each observational method we have discussed is to find and isolate the random variation that exists in the data, while conditioning on the endogenous variation. The latter is accomplished by matching on covariates (directly or via a propensity score), by controlling for covariates in an outcome model, or both. A critique of these observational methods is that sophisticated ad-targeting systems aim for ad exposure that is deterministic and based on a machine-learning algorithm. In the limit, such ad-targeting systems would completely eliminate any random variation in exposure, in which case, the observational methods we have discussed in section 4.2 would fail. As an example, consider propensity scoring. If we observed the exact data and structure used by the ad-targeting systems, the propensity score distribution would collapse to discrete masses at 0 and 1. This is not surprising, because a deterministic exposure system implies that common support in observables between treated and untreated observations cannot exist. As a result, any matching system would fail, as would any regression approach that requires common support on observables. If ad-targeting systems were completely deterministic, identification of causal effects would have to rely on alternative observational methods, for example, regression discontinuity (RD). If the ad- targeting rules were known, an RD design would identify users whose observables are very similar but ended up triggering a different exposure decision by the ad-targeting system. In practice, imple- menting such an RD approach would require extensive collaboration with the advertising platform, because the advertiser would need to know the full data and structure used by the ad-targeting system. Given that advertisers avoid RCTs partially because RCTs require the collaboration of the platform, RD-type observational methods would unlikely be more popular. Moreover, RD-type observational methods are unlikely to overcome the problem that some platforms cannot implement RCTs: if a platform had the sophistication to run an RD design, it would probably also have the sophistication to implement RCTs. As of now, ad-targeting systems have not eliminated all exogenous reasons a given person would be exposed to an ad campaign whereas a probabilistically equivalent person would not. As we discuss in detail in section 6.1, in our context, quasi-random variation in exposure has at