library(readr)
library(stargazer)
library(car)
df <- read_csv("KAG_conversion_data.csv")
Digital marketing – a field that never existed a few years ago in the market. Today, that industry has the highest potential of making money and generating more crowds than any other. Have you ever wondered if digital marketing, influence marketing and other new-age marketing tactics are worth the hype and the money? This research project will analyze the Facebook ads mechanics and dive deeper to get understanding of ad clicks, impressions, conversions, ads budgeting and statistics behind those numbers.
A single post on social media can be powerful. This research project aims to calculate in numbers that how much powerful can it be in the terms of visits, conversions, amount spent and much more.
The primary objective of this research project is to analyze the factors influencing the performance of Facebook ads using multiple linear regression. This study aims to identify un-discovered patterns that might help the ad creators and revenue generators to tail the ad posting mechanism to improvise the ad performance.
This study will try to answer many questions including but not limited to:
Analyzing the spending budget on ads with conversion rate and other factors affecting the spending on ads.
What factors contribute most significantly to the conversion rate in advertising?
This research project will use multiple linear regression, logistic regression, hypothesis testing and other models as and if necessary to analyze the dataset to its full potential.
The dataset has been taken from Kaggle.com – an open source of datasets. Dataset has variables such as three different ad ids from different perspectives, age, gender, impression, clicks, amount spent and conversions from the ad.
Please note that there is a lot of research on-going on the given topic which will be updated with literature review.
Appendix:
https://www.kaggle.com/code/mansimeena/facebook-ad-campaigns-analysis-sales-prediction/input
In the ever-evolving landscape of digital marketing, where the potential for revenue generation is unprecedented, understanding the effectiveness and return on every dollar invested on various strategies becomes crucial. This research project focuses on the impact of Facebook ads on sales and brand awareness and overall impact on business.
According to Google, as of 2023, there are 3.03 billion monthly active users of Facebook in the world. This research project aims to employ multiple linear regression to analyze the factors influencing ad performance. There are various scholars who have studied the patterns of online marketing while considering other factors and measuring the impact of it on sales and overall business. I have gathered and reviewed relevant literature, to contextualize this study, exploring past research in digital marketing, social media advertising and the use of statistical tools for analyzing advertising effectiveness.
In this literature, numerous strategies were suggested for marketing, wherein, Youtube has proven to be the second most effective medium of marketing. In the research they have used sales in thousands of dollars and advertising budget on Youtube.
The Simple Linear Regression Model is used to explain the relationship between Youtube Advertising and sales.
Y = β0 + β1 * YouTube + ε.
That gave:
Y = 4.84708 + 0.04802 * YouTube + ε.
Interpretation of the model:
For spending nothing on the Youtube advertising, the expected sales that is intercept (beta_0) is 4.84708 * 1000 = 4847 dollars.
The slope of the model provided in equation is 0.04802 which indicates 0.04802 * 1000 = 48 units increment in the sales. Therefore, spending money on Youtube, the expected sales is 4.84708 + 0.04802 * 1000 = 52.86708, representing a sale of 52,867 dollars.
In the said paper, they have conducted Hypothesis testing as well which was formulated as follows:
H_0 = Youtube advertising has no significant relationship with sales.
H_1 = Youtube advertising has a significant relationship with sales.
Numerical results of SLRM based on t-test. Adv. media RCs Estimatedvalues SE t-statistic Pr(>|t|) YouTube β0 4.84707 0.39901 12.14700 2e − 16 β1 0.04801 0.00482 9.95900 2e − 16
The t-statistic on the regression coefficient Beta_1 is 9.95900 which indicates a significant deviation from zero. The p-value < 0.05 further strengthens the evidence against the null hypothesis, affirming that Y is associated with X in the context of Youtube advertising. Consequently, there is sufficient evidence to reject Null Hypothesis.
The f-test yields a significant F-statistic of 99.18, suggesting a positive impact of Youtube advertising on sales. The R-square value of 0.4366 further indicates that 43.66% of the variation in sales can be explained by the linear relationship with Youtube advertising, affirming a favorable model fit.
Therefore, statistically there is a positive impact of Youtube advertisement on sales. Our research will follow the same approach and methodology. This research has implemented simple linear regression model and in our research we will need to implement multiple linear regression approach as we are attempting to examine the other factors affecting sales along with amount spent on digital marketing.
In this research paper, they have examined the relationship between celebrity endorsement in online advertisement and consumer’s purchasing intentions.
Celebrity endorsement is a type of marketing campaign that involves a well-known celebrity with huge number of fan following, who is known to have an impact on public such as movie stars, entertainers, and athletes to endorse the products.
The study has examined the effect of attractiveness, trustworthiness, expertise and product fit of the celebrity endorsement towards the consumer purchase intention.
The sample is obtained from a total of 200 respondents from Malaysia. The essential analysis involved are reliability analysis, descriptive statistics analysis, correlation analysis, multiple regression analysis and hypothesis testing.
The research employed a quantitative method and the questionnaire was used as a tool to collect the data and opinion among the respondents towards the effectiveness of celebrity endorsement in online advertisement. The questionnaire consisted of close-ended questions and respondents were given five points scale in which 1 was strongly disagree and 5 is strongly agree.
In the study of correlation, attractiveness, trustworthiness and product fit exhibit strong relationships with purchase intention, supported by respective correlation coefficients of 0.859, 0.832, 0.849. Additionally, the interplay between attractiveness and trustworthiness, attractiveness and expertise and trustworthiness and product fit further underscores the inter-connected-ness of these factors in influencing consumer purchase intention.
Furthermore, the R-square value of 0.83 indicated that 83% of the variability in purchase intention is explained by the independent variables, leaving 17% influenced by other unexplored factors. the ANOVA analysis supports the model’s accuracy, with an F-value of 237.553 and a p-value of 0.00, indicating that the independent variables accurately predict purchase intention.
Hypothesis testing using regression analysis reveals that attractiveness and trustworthiness significantly impact consumer purchase intention, while expertise does not show statistical significance. Product fit emerges as a strong predictor, supporting the hypothesis.
In this research paper, authors have exmained the effect of twitter ads on sales. For this purpose, the simple linear regression modeling approach is implemented to test the significance and usefulness of twitter advertising on sale. Statistical tests such as t-test and correlation test are adopted to test the hypothesis of the “impact of Twitter advertising on sales”. The dataset contains one dependent variable sales in thousand dollars and one independent variable called twitter ad budget.
The Simple Linear Regression Model is used to explain the relationship between Twitter Advertising and sales.
Y = β0 + β1 * Twitter + ε.
That gave:
Sales = 5.621 + 0.193 * Twitter + ε.
Interpretation of the Model:
Intercept is 5.621 when the amount spent on twitter ads is zero. The sale amount is 5.621 * 1000 = 5621 dollars.
The slope - regression coefficient of the regression model, provided in the equation is 0.193 that is 0.193 * 1000 = 193 units increment in the sales. Using the twitter medium as a marketing tool, the estimated sale is 5.621 + 0.193 * 1000 = 198.621, representing the sale of 198.621 * 1000 = 198621 dollars when there is 198 unit change in the twitter ads investment.
The hypothesis test comparing the absence of a relationship between twitter advertising and sales versus the existence of one yields a significant result, indicating a positive relationship between sales and twitter advertising. The Spearman rank correlation test further supports the findings, with a correlation coefficient of 0.161 and a p-value of less than 0.05, leading to the rejection of the null hypothesis and confirming a significant relationship between Twitter advertising and sales.
The thesis focuses on accurately predicting a company’s website visits and estimating prediction uncertainty using a dynamic regression model with an ARIMA error term. The models demonstrated good predictive accuracy, with error measurements comparably small in relation to the standard deviation of the session variable.
The study also evaluated prediction intervals using bootstrap and normality methods. The normal prediction intervals were considered too large due to the long tails of the residual distribution, potentially influenced by extreme observations.
The thesis was concluded with evidence that there was some effect of advertising on the website visits, however, due to lack of data, there is no proper evidence of normality. The dataset used in this study consisted data for a about a year. If the data had been collected for multiple years these time related effects could have been estimated with yearly seasonal effects.
Conclusion:
There are various research theories supporting the theory that there is a positive effect of online and/or offline advertisement on sales. However, there are various factors affecting sales and not just amount spent on ads. Therefore, this research will make an attempt to examine factors including but not limited to amount spent on advertisement and study the statistical change on the Y variable that is conversions and/or sales and/or revenue over the period of time on Facebook as one medium of advertisement.
The dataset used for this research assignment is been taken from an open data source website called - Kaggle.com. The dataset contains 1143 observations of Facebook Ad Campaign from an anonymmous organisation.
The dataset contains following mentioned variables:
ad_id: It is a unique ID for each ad. It has 6-8 digit numeric value which keeps a track of each ad posted.
xyzcampaignid: It is an ID associated with each ad campaign of xxx company. It has 3-4 digit numeric value.
fbcampaignid:It is an ID associated with how Facebook keeps a track of each campaign that is active and running on the platform. It’s a 6 digit numeric value.
Age: A number, age of the audience, the viewer of ad.
Gender: 0 = Male, 1 = Female, the gender of the person to whom the ad is shown.
Interest: It is a 2 digit numeric code specifying the category to which the person’s interest belongs. These numbers are based on the interest mentioned on their Facebook Public Profile.
Impressions: The number of times the ad was shown to the users of Facebook.
Clicks: The number of clicks on the particular ad.
Spent: Amount paid by the company to Facebook, to show the ad. The amount spend for Facebook Ad(Ad budget).
Total_Conversion: The total number of people who inquired about the product after seeing the ad, base-level interaction.
Approved_Conversion: The total number of people who bought the product after seeing the ad.
In summary, this description covers all the details about the different pieces of information that is required in the project. With this, in the next step, this research aims to start digging into how these pieces of information relate to each other and what patterns can be recognized. The goal is to find the useful insights that help the ad managers, marketing executives, creators and entrepreneurs to use Facebook ad and it’s mechanism to it’s full potential.
This research study will make an attempt to examine the dataset and analyze the patterns which would be helpful in predicting total number of conversions, spending budget on ads and interests of the audience based on which ad posting mechanism can be built.
There are variety of analytical methods to comprehensively explore the dynamics of Facebook ads performance.
Descriptive Statistics will help us understand the central tendencies and variability of key variables such as impressions, clicks and conversions.
In the next step, to identify trends, outliers and potential relationships between variables, we will plot the data on scatterplots with statistical numbers.
To examine the correlations between pairs of variables, in the next step we will uncover the linear relationships and identify potential predictors of ad performance, ad budget, interests, impressions and clicks. This process will guide us in selecting the most relevant variables for further analysis.
To understand which variable influences the other variables, regression models are the best choice. The main model for this research paper. There are various un-decided and un-explored y variables with numerous and inter-changing x variables. To explore the dataset at it’s full potential, it is essential to build all possible regression model and get the best-fitted model out of the all models.
Below mentioned are the models that will be used in the research paper and there are models which might be added and/or removed later on as the research progresses.
Model 1:
Dependent Variable: Approved Conversion. Independent Variable: Impressions, Clicks, Spent
Y = β0 + β1 * Impressions
Model 2:
Dependent Variable: Approved Conversion. Independent Variable: Clicks,
Y = β0 + β1 * Clicks
Model 3:
Dependent Variable: Approved Conversion. Independent Variable: Total Conversion.
Y = β0 + β1 * Total Conversion
Model 4:
Dependent Variable: Approved Conversion. Independent Variable: Impressions, Clicks
Y = β0 + β1 * Impressions + β2 * Clicks
In this research paper, we will conduct Hypothesis test as well to examine the relationship between the dependent and independent variables. How significant indepedent variables are while predicting the dependent variable:
H_0 = There is no significant relationship between Y and X variables. H_1 = There is a significant relationship between those variables.
We will study the data and gather the evidence to reject either of the hypothesis null or alternative.
Conclusion:
Through rigorous analysis, this research paper will examine the patterns existing in the dataset and un-cover the facts that might allow the advertisers to use the ad mechanism to it’s full potential.
# Dropping unnecessary columns.
df <- subset(df, select = -c(1, 2, 3))
head(df)
## # A tibble: 6 × 8
## age gender interest Impressions Clicks Spent Total_Conversion
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 30-34 1 15 7350 1 1.43 2
## 2 30-34 1 16 17861 2 1.82 2
## 3 30-34 1 20 693 0 0 1
## 4 30-34 1 28 4259 1 1.25 1
## 5 30-34 1 28 4133 1 1.29 1
## 6 30-34 1 29 1915 0 0 1
## # ℹ 1 more variable: Approved_Conversion <dbl>
df$age <- as.factor(df$age)
This research paper is making an attempt to show the effect of independent variables on the dependent variable and show the collinearity between the independent variables affecting the dependent variable which Approved_Conversion in this case.
Approved_Conversion is basically “a number” of total users who have made a purchase by interacting with the ad on their Facebook wall. In order to get the accuracy in the regression models, the un-used and un-necessary columns are dropped from here. The variables such as three different types of IDs which has no use in the model. Realistically, variable “gender” does not affect the number of conversion as much the other factors, followed by age variable. In the next steps, this research paper will build a model including these variables for more accurate results.
# Units of Impressions column = thousand units
df$Impressions <- df$Impressions / 1000
To improve interpretability, the independent variable “Impressions,” representing the number of views on the Facebook ad, underwent a transformation. Due to the initially small and challenging-to-interpret slope coefficient, the entire column was divided by 1000. This adjustment aimed to provide a clearer and more meaningful interpretation of the variable’s impact in the regression analysis.
plot(Approved_Conversion ~ Impressions, data = df, main = "Approved Conversion Vs. Impressions", xlab = "Impressions", ylab = "Approved Conversion")
The initial regression model examines the linear relationship between Approved_Conversion and Impressions, revealing a positive association where higher impressions correlate with increased approved conversions from the ad. To enhance the precision and clarity of this relationship, the subsequent step in the analysis involves employing a log transformation model. This approach is anticipated to provide greater accuracy and insight into the relationship between the variables.
plot(Approved_Conversion ~ Clicks, data = df, main = "Approved Conversion Vs. Clicks", xlab = "Clicks", ylab = "Approved Conversion")
In the second model, the relationship between Approved Conversions and Clicks exhibits a clear linear trend: as users click on the ad more frequently, the likelihood of them making a purchase and being converted increases. However, to improve the clarity and precision of the model, a log transformation will be applied in subsequent steps. This adjustment aims to enhance the visualization and assess the linearity of the relationship more effectively.
plot(Approved_Conversion ~ Total_Conversion, data = df, main = "Approved Conversion Vs. Total Conversions", xlab = "Total Conversion", ylab = "Approved Conversion")
There exists a robust correlation between Approved Conversion and Total Conversion, indicating that as the number of individuals engaging with the ad increases, so does the likelihood of conversions. Essentially, the conversion rate is heavily influenced by total interactions with the ad, as those who engage with the ad are more likely to make a purchase, emphasizing the direct relationship between interaction levels and conversion rates.
# Summary of Approved_Conversion variable
summary(df$Approved_Conversion)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 0.944 1.000 21.000
The maximum number of conversion is 21 and the minimum is 0 with a median of 1 and mean of 0.944.
# Summary of Impressions variable
summary(df$Impressions)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.087 6.503 51.509 186.732 221.769 3052.003
The maximum number of Impressions is 3052003 and the minimum is 87 with a median of 51509 and mean of 186732.
# Summary of Clicks variable
summary(df$Clicks)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 8.00 33.39 37.50 421.00
The maximum number of Clicks is 421 and the minimum is 0 with a median of 8 and mean of 33.39.
# Summary of Total_Conversion variable
summary(df$Total_Conversion)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 2.856 3.000 60.000
The maximum number of Total conversion is 60 and the minimum is 0 with a median of 1 and mean of 2.856.
# Correlation between variables
cor(df[,-c(1,2)])
## interest Impressions Clicks Spent
## interest 1.00000000 0.1019733 0.08870606 0.07022597
## Impressions 0.10197326 1.0000000 0.94851414 0.97038617
## Clicks 0.08870606 0.9485141 1.00000000 0.99290634
## Spent 0.07022597 0.9703862 0.99290634 1.00000000
## Total_Conversion 0.12026967 0.8128376 0.69463235 0.72537945
## Approved_Conversion 0.05835320 0.6842485 0.55952579 0.59317782
## Total_Conversion Approved_Conversion
## interest 0.1202697 0.0583532
## Impressions 0.8128376 0.6842485
## Clicks 0.6946324 0.5595258
## Spent 0.7253794 0.5931778
## Total_Conversion 1.0000000 0.8640338
## Approved_Conversion 0.8640338 1.0000000
Correlation tests in MLR are vital for detecting multi-collinearity, ensuring model reliability and interpretation, while guiding variable selection for improved predictive accuracy.
By assessing relationships between predictors and the dependent variable, correlation tests aid in identifying influential variables, contributing to model simplification, enhanced and accuracy in results.
The correlation between “interest” and “Impressions” is very weak (0.10197326), suggesting that there is little to no linear relationship between these two variables.
The correlation between “Clicks” and “Impressions” is strong (0.94851414), indicating that there is a positive linear relationship between the number of clicks and impressions. As the number of impressions increases, the number of clicks tends to increase proportionally.
The correlation between “Spent” and “Impressions” is very strong (0.97038617), indicating a positive linear relationship between the amount spent on advertising and the number of impressions. This suggests that as the amount spent on advertising increases, the number of impressions also tends to increase proportionally.
The correlation between “Total_Conversion” and “Impressions” is strong (0.8128376), indicating a positive linear relationship between the total number of conversions and the number of impressions. This suggests that as the number of impressions increases, the total number of conversions also tends to increase proportionally.
The correlation between “Total_Conversion” and “Approved_Conversion” is strong (0.8640338), indicating a positive linear relationship between the total number of conversions and the total number of approved conversions. This suggests that as the total number of conversions increases, the total number of approved conversions also tends to increase proportionally.
# Model-1 : Approved_Conversion ~ Impressions
mod1 <- lm(Approved_Conversion ~ Impressions, data = df)
summary(mod1)
##
## Call:
## lm(formula = Approved_Conversion ~ Impressions, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6645 -0.4463 -0.2425 0.6678 12.8558
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2341098 0.0436790 5.36 1.01e-07 ***
## Impressions 0.0038017 0.0001199 31.69 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.268 on 1141 degrees of freedom
## Multiple R-squared: 0.4682, Adjusted R-squared: 0.4677
## F-statistic: 1005 on 1 and 1141 DF, p-value: < 2.2e-16
This linear regression model predicts the variable “Approved_Conversion” based on the predictor variable “Impressions”. The intercept term suggests that when the number of impressions is zero, the estimated value of approved conversions is approximately 0.234. The coefficient for “Impressions” indicates that for every one-unit increase in impressions, the estimated value of approved conversions increases by approximately 0.0038. The model is statistically significant (p-value < 2.2e-16), suggesting that the relationship between impressions and approved conversions is not due to random chance. The multiple R-squared value of 0.4682 indicates that approximately 46.82% of the variability in approved conversions can be explained by the linear relationship with impressions.
# Model-2 : Approved_Conversion ~ Clicks
mod2 <- lm(Approved_Conversion ~ Clicks, data = df)
summary(mod2)
##
## Call:
## lm(formula = Approved_Conversion ~ Clicks, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6454 -0.4930 -0.3734 0.5668 17.1744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3733681 0.0494222 7.555 8.58e-14 ***
## Clicks 0.0170900 0.0007494 22.804 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.441 on 1141 degrees of freedom
## Multiple R-squared: 0.3131, Adjusted R-squared: 0.3125
## F-statistic: 520 on 1 and 1141 DF, p-value: < 2.2e-16
This linear regression model predicts the variable “Approved_Conversion” based on the predictor variable “Clicks”. The intercept term suggests that when the number of clicks is zero, the estimated value of approved conversions is approximately 0.373. The coefficient for “Clicks” indicates that for every one-unit increase in clicks, the estimated value of approved conversions increases by approximately 0.0171. The model is statistically significant (p-value < 2.2e-16), indicating that the relationship between clicks and approved conversions is not due to random chance. The multiple R-squared value of 0.3131 indicates that approximately 31.31% of the variability in approved conversions can be explained by the linear relationship with clicks.
# Model-3 : Approved_Conversion ~ Total_Conversion
mod3 <- lm(Approved_Conversion ~ Total_Conversion, data = df)
summary(mod3)
##
## Call:
## lm(formula = Approved_Conversion ~ Total_Conversion, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0201 -0.3226 -0.3226 0.6774 7.6173
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.012273 0.030696 -0.40 0.689
## Total_Conversion 0.334874 0.005776 57.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8752 on 1141 degrees of freedom
## Multiple R-squared: 0.7466, Adjusted R-squared: 0.7463
## F-statistic: 3361 on 1 and 1141 DF, p-value: < 2.2e-16
This linear regression model predicts the variable “Approved_Conversion” based on the predictor variable “Total_Conversion”. The intercept term suggests that when the total number of conversions is zero, the estimated value of approved conversions is approximately -0.012. However, this intercept is not statistically significant (p-value = 0.689). The coefficient for “Total_Conversion” indicates that for every one-unit increase in total conversions, the estimated value of approved conversions increases by approximately 0.3349. The model is highly statistically significant (p-value < 2.2e-16), suggesting that the relationship between total conversions and approved conversions is not due to random chance. The multiple R-squared value of 0.7466 indicates that approximately 74.66% of the variability in approved conversions can be explained by the linear relationship with total conversions.
# Multilinear model
mod4 <- lm(Approved_Conversion ~ Impressions + Clicks, data = df)
summary(mod4)
##
## Call:
## lm(formula = Approved_Conversion ~ Impressions + Clicks, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7316 -0.4351 -0.2704 0.6423 10.3954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2660397 0.0403477 6.594 6.55e-11 ***
## Impressions 0.0085029 0.0003493 24.344 < 2e-16 ***
## Clicks -0.0272472 0.0019201 -14.190 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.169 on 1140 degrees of freedom
## Multiple R-squared: 0.548, Adjusted R-squared: 0.5472
## F-statistic: 691.1 on 2 and 1140 DF, p-value: < 2.2e-16
This multiple linear regression model predicts the variable “Approved_Conversion” based on the predictor variables “Impressions” and “Clicks”. The intercept term suggests that when both impressions and clicks are zero, the estimated value of approved conversions is approximately 0.266. The coefficient for “Impressions” indicates that for every one-unit increase in impressions, the estimated value of approved conversions increases by approximately 0.0085, while the coefficient for “Clicks” indicates that for every one-unit increase in clicks, the estimated value of approved conversions decreases by approximately 0.0272. The model is highly statistically significant (p-value < 2.2e-16), suggesting that the relationship between impressions, clicks, and approved conversions is not due to random chance. The multiple R-squared value of 0.548 indicates that approximately 54.8% of the variability in approved conversions can be explained by the linear relationship with impressions and clicks.
In addition, the negative effect observed in the coefficient for the “Clicks” variable in the Y variable (Approved_Conversion) can be attributed to multicollinearity between both independent variables (Impressions and Clicks). According to the correlation test, the correlation between the said variables is 0.9485, which indicates that they are 95% correlated. This multicollinearity can lead to biased and unreliable results in the regression analysis, as the presence of highly correlated predictors can confound the individual effects of each variable on the dependent variable.
vif(mod4)
## Impressions Clicks
## 9.968009 9.968009
Therefore, this research paper incorporates Variance Inflation Factor (VIF) tests to assess multicollinearity between predictors and determine whether variables should be dropped accordingly. VIF tests help identify the extent to which predictors are correlated with each other, aiding in the identification of redundant variables that may lead to biased or unreliable results in regression analysis. By conducting VIF tests and making informed decisions about variable inclusion, this study ensures the robustness and validity of its findings in the presence of multicollinearity.
The VIF results suggest that the variance inflation factor for both “Impressions” and “Clicks” is approximately 9.97, indicating that multicollinearity is a significant issue between these predictors, as The rule of thums is that a regressor produces high collinearity if its VIF is greater than 10 or 15.
# Multilinear model
mod5 <- lm(Approved_Conversion ~ Impressions + Clicks + Total_Conversion, data = df)
summary(mod5)
##
## Call:
## lm(formula = Approved_Conversion ~ Impressions + Clicks + Total_Conversion,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9391 -0.3546 -0.3221 0.6453 6.6996
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0181876 0.0309486 0.588 0.55687
## Impressions 0.0012325 0.0003508 3.513 0.00046 ***
## Clicks -0.0074265 0.0015617 -4.755 2.24e-06 ***
## Total_Conversion 0.3304456 0.0107757 30.666 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8658 on 1139 degrees of freedom
## Multiple R-squared: 0.7524, Adjusted R-squared: 0.7518
## F-statistic: 1154 on 3 and 1139 DF, p-value: < 2.2e-16
vif(mod5)
## Impressions Clicks Total_Conversion
## 18.345219 12.028234 3.556444
The multiple linear regression model predicts “Approved_Conversion” based on “Impressions”, “Clicks”, and “Total_Conversion”. The intercept term suggests that when all predictor variables are zero, the estimated value of approved conversions is approximately 0.018. The coefficients indicate that for every one-unit increase in “Impressions”, “Clicks”, and “Total_Conversion”, the estimated value of approved conversions increases by approximately 0.00123, decreases by approximately 0.00743, and increases by approximately 0.33045, respectively. The model is highly statistically significant (p-value < 2.2e-16), suggesting that the relationship between the predictors and approved conversions is not due to random chance. The multiple R-squared value of 0.7524 indicates that approximately 75.24% of the variability in approved conversions can be explained by the linear relationship with the predictors.
According to the correlation test, the correlation between “Clicks” and “Total Coversion” is 0.69, which indicates that the variables are almost 70% correlated.
In addition, the VIF values for “Impressions”, “Clicks”, and “Total_Conversion” are 18.345, 12.028, 3.556 indicating that all the predictors are highly correlated with eachother and therefore, it is not advisable to include all three variables simultaneously in any regression model.
# Multilinear model
mod6 <- lm(Approved_Conversion ~ Impressions + Total_Conversion, data = df)
summary(mod6)
##
## Call:
## lm(formula = Approved_Conversion ~ Impressions + Total_Conversion,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2487 -0.3460 -0.3175 0.6538 7.5545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0049307 0.0308527 -0.160 0.8731
## Impressions -0.0002959 0.0001420 -2.085 0.0373 *
## Total_Conversion 0.3516523 0.0099021 35.513 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8739 on 1140 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7471
## F-statistic: 1688 on 2 and 1140 DF, p-value: < 2.2e-16
(vif(mod6))
## Impressions Total_Conversion
## 2.947287 2.947287
The linear regression model predicts “Approved_Conversion” based on “Impressions” and “Total_Conversion”. The intercept term suggests that when both predictors are zero, the estimated value of approved conversions is approximately -0.00493, although this intercept is not statistically significant (p-value = 0.8731). The coefficient for “Impressions” indicates that for every one-unit increase in impressions, the estimated value of approved conversions decreases by approximately 0.000296, which is statistically significant (p-value = 0.0373). Conversely, the coefficient for “Total_Conversion” suggests that for every one-unit increase in total conversions, the estimated value of approved conversions increases by approximately 0.35165, which is highly statistically significant (p-value < 2.2e-16). The model as a whole is highly significant (p-value < 2.2e-16), with a multiple R-squared value of 0.7475 indicating that approximately 74.75% of the variability in approved conversions can be explained by the linear relationship with the predictors.
The VIF values for “Impressions” and “Total_Conversion” are 2.94 each, indicating less to no colinearity between the variables.
# Multilinear model
mod7 <- lm(Approved_Conversion ~ Clicks + Total_Conversion, data = df)
summary(mod7)
##
## Call:
## lm(formula = Approved_Conversion ~ Clicks + Total_Conversion,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3016 -0.3635 -0.3155 0.6365 7.2362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0074537 0.0309502 0.241 0.809732
## Clicks -0.0023999 0.0006291 -3.815 0.000144 ***
## Total_Conversion 0.3560271 0.0079825 44.601 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8701 on 1140 degrees of freedom
## Multiple R-squared: 0.7497, Adjusted R-squared: 0.7493
## F-statistic: 1708 on 2 and 1140 DF, p-value: < 2.2e-16
(vif(mod7))
## Clicks Total_Conversion
## 1.93242 1.93242
The linear regression model predicts “Approved_Conversion” based on “Clicks” and “Total_Conversion”. The coefficient for “Clicks” suggests that for every one-unit increase in clicks, the estimated value of approved conversions decreases by approximately 0.0024, which is statistically significant (p-value = 0.000144). Conversely, the coefficient for “Total_Conversion” indicates that for every one-unit increase in total conversions, the estimated value of approved conversions increases by approximately 0.356, which is highly statistically significant (p-value < 2.2e-16).
The VIF values for each of the two independent variables 1.93 suggests low to no colinearity between the variables.
# Multilinear Model
mod8 <- lm(Approved_Conversion ~ Total_Conversion, data = df)
summary(mod8)
##
## Call:
## lm(formula = Approved_Conversion ~ Total_Conversion, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0201 -0.3226 -0.3226 0.6774 7.6173
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.012273 0.030696 -0.40 0.689
## Total_Conversion 0.334874 0.005776 57.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8752 on 1141 degrees of freedom
## Multiple R-squared: 0.7466, Adjusted R-squared: 0.7463
## F-statistic: 3361 on 1 and 1141 DF, p-value: < 2.2e-16
(plot(Approved_Conversion ~ Total_Conversion, data = df))
## NULL
(predict(mod8, newdata = data.frame(Total_Conversion = c(0))))
## 1
## -0.01227337
The linear regression model predicts “Approved_Conversion” based solely on “Total_Conversion”. The coefficient for “Total_Conversion” indicates that for every one-unit increase in total conversions, the estimated value of approved conversions increases by approximately 0.335, which is highly statistically significant (p-value < 2.2e-16). As including “Clicks”, “Impressions” and “Spent” in the model resulted in high collinearity between the predictors, leading to biased and unreliable results, dropping them was a prudent decision to ensure the model’s reliability and validity.
# Stargazer of all the models.
stargazer(mod4, mod6, mod7, type = "text")
##
## ====================================================================
## Dependent variable:
## ------------------------------------
## Approved_Conversion
## (1) (2) (3)
## --------------------------------------------------------------------
## Impressions 0.009*** -0.0003**
## (0.0003) (0.0001)
##
## Clicks -0.027*** -0.002***
## (0.002) (0.001)
##
## Total_Conversion 0.352*** 0.356***
## (0.010) (0.008)
##
## Constant 0.266*** -0.005 0.007
## (0.040) (0.031) (0.031)
##
## --------------------------------------------------------------------
## Observations 1,143 1,143 1,143
## R2 0.548 0.748 0.750
## Adjusted R2 0.547 0.747 0.749
## Residual Std. Error (df = 1140) 1.169 0.874 0.870
## F Statistic (df = 2; 1140) 691.148*** 1,687.577*** 1,707.715***
## ====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The table displays three regression models investigating the relationship between the dependent variable “Approved_Conversion” and independent variables. Model (1) includes “Impressions” and “Clicks”, Model (2) incorporates “Impressions” and “Total_Conversion”, and Model (3) encompasses “Clicks” and “Total_Conversion”. The adjusted R-squared values, which denote the proportion of variance explained by the independent variables, highlight Model (3) as the most fitting with an adjusted R-squared of 0.749, indicating it provides the best explanatory power among the models. Additionally, Model (1) and Model (2) have adjusted R-squared values of 0.747 and 0.748, respectively.
# Stargazer of all the models.
stargazer(mod1, mod2, mod3, type = "text")
##
## ====================================================================
## Dependent variable:
## ------------------------------------
## Approved_Conversion
## (1) (2) (3)
## --------------------------------------------------------------------
## Impressions 0.004***
## (0.0001)
##
## Clicks 0.017***
## (0.001)
##
## Total_Conversion 0.335***
## (0.006)
##
## Constant 0.234*** 0.373*** -0.012
## (0.044) (0.049) (0.031)
##
## --------------------------------------------------------------------
## Observations 1,143 1,143 1,143
## R2 0.468 0.313 0.747
## Adjusted R2 0.468 0.312 0.746
## Residual Std. Error (df = 1141) 1.268 1.441 0.875
## F Statistic (df = 1; 1141) 1,004.527*** 520.011*** 3,360.953***
## ====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The table presents three regression models examining the relationship between the dependent variable “Approved_Conversion” and various independent variables. Model (1) includes “Impressions”, Model (2) includes “Clicks”, and Model (3) incorporates “Total_Conversion”.
The adjusted R-squared values indicate the proportion of variance in “Approved_Conversion” explained by the independent variables. Model (3) has the highest adjusted R-squared value of 0.746, sugjjgesting it provides the best overall fit among the models.
In the next steps, this research paper will make an attempt to include age and gender as interaction terms in model8 that has Approved_Conversion as dependent variable and Total_Conversion as independent variable.
This research paper will proceed to explore log transformation in the regression model to refine its accuracy and mitigate biases, with the objective of attaining more robust and unbiased results. Additionally, efforts will be made to eliminate multicollinearity from the model to enhance its reliability and effectiveness.
Last Update: The intercept - which basically is the value of dependent variable when independent variable is kept zero - meaning “unchanged”. In an effort for removing multi-collinearity from the model, according to the summary results of mod8, we can see that the intercept of the model where Approved_Conversion is Y variable and Total_Conversion is X variable, is -0.012, which is interpreted as : when Total_Conversion is kept zero Approved_Conversion decreases by 0.012 units (limitation of dataset: As no specific units are given in the dataset), which is not really possible in the real world. Therefore, if Total_Conversion is included in any regression models with the presence of Approved_Conversion, the results will show bias-ness, because of multiple reasons:
Therefore, in order to get unbiased and relaible results, we are dropping the variable called “Total_Conversion”.
The Poisson regression model you’ve fitted aims to predict the number of ‘Impressions’ based on the predictors ‘age’ and ‘gender’. Let’s interpret the coefficients:
Intercept (5.176893): This is the expected log count of ‘Impressions’ when all other predictors are held constant at 0. Since ‘age’ and ‘gender’ are categorical variables, this value represents the baseline log count of ‘Impressions’ for the reference category of ‘age’ and ‘gender’.
age35-39 (0.066457), age40-44 (0.157294), age45-49 (0.418635): These are the coefficients for the age groups 35-39, 40-44, and 45-49, respectively. They represent the expected change in the log count of ‘Impressions’ for each unit increase in the respective age group compared to the baseline category (presumably younger age groups).
gender (-0.203384): This coefficient represents the expected change in the log count of ‘Impressions’ for one unit change in gender (assuming it’s a binary variable, e.g., male = 0, female = 1). Since it’s negative, it suggests that being of gender ‘1’ (e.g., female) is associated with a lower expected log count of ‘Impressions’ compared to gender ‘0’ (e.g., male).
The significance codes (’*‘,’’, etc.) indicate the level of significance of each coefficient. All coefficients have very low p-values (< 0.05), indicating that they are statistically significant predictors of ‘Impressions’.
The deviance statistics assess the goodness-of-fit of the model. The residual deviance (398839) is smaller than the null deviance (407629), suggesting that the model explains some of the variance in the data. .
mod1_1 <- glm(Approved_Conversion ~ Impressions, df, family = poisson)
summary(mod1_1)
##
## Call:
## glm(formula = Approved_Conversion ~ Impressions, family = poisson,
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.349e-01 3.776e-02 -14.17 <2e-16 ***
## Impressions 1.474e-03 3.848e-05 38.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2199.3 on 1142 degrees of freedom
## Residual deviance: 1382.9 on 1141 degrees of freedom
## AIC: 2741.8
##
## Number of Fisher Scoring iterations: 5
mod2_1 <- glm(Approved_Conversion ~ Clicks, df, family = poisson)
summary(mod2_1)
##
## Call:
## glm(formula = Approved_Conversion ~ Clicks, family = poisson,
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.5378170 0.0395466 -13.60 <2e-16 ***
## Clicks 0.0085331 0.0002746 31.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2199.3 on 1142 degrees of freedom
## Residual deviance: 1542.6 on 1141 degrees of freedom
## AIC: 2901.5
##
## Number of Fisher Scoring iterations: 5
mod3_1 <- glm(Approved_Conversion ~ Clicks + Impressions, df, family = poisson)
summary(mod3_1)
##
## Call:
## glm(formula = Approved_Conversion ~ Clicks + Impressions, family = poisson,
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.5367098 0.0393924 -13.625 <2e-16 ***
## Clicks 0.0001214 0.0007500 0.162 0.871
## Impressions 0.0014582 0.0001068 13.653 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2199.3 on 1142 degrees of freedom
## Residual deviance: 1382.9 on 1140 degrees of freedom
## AIC: 2743.8
##
## Number of Fisher Scoring iterations: 5
stargazer(mod1_1, mod2_1, mod3_1, type = "text")
##
## ==================================================
## Dependent variable:
## --------------------------------
## Approved_Conversion
## (1) (2) (3)
## --------------------------------------------------
## Impressions 0.001*** 0.001***
## (0.00004) (0.0001)
##
## Clicks 0.009*** 0.0001
## (0.0003) (0.001)
##
## Constant -0.535*** -0.538*** -0.537***
## (0.038) (0.040) (0.039)
##
## --------------------------------------------------
## Observations 1,143 1,143 1,143
## Log Likelihood -1,368.889 -1,448.747 -1,368.876
## Akaike Inf. Crit. 2,741.779 2,901.495 2,743.753
## ==================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
This Poisson regression model aims to predict the count of ‘Approved_Conversion’ based on the predictors ‘Impressions’ and ‘Clicks’. Let’s interpret the coefficients and significance levels for each model:
The constant term in the Poisson regression model represents the expected count of ‘Approved_Conversion’ when all predictors, such as ‘Impressions’ and ‘Clicks’, are zero. Its estimate and standard error provide information about the uncertainty associated with this baseline count.
Observations, Log Likelihood, and AIC metrics offer insights into the model’s fit to the data. The number of observations indicates the size of the dataset used to build the model. Log likelihood measures the goodness-of-fit, representing the probability of observing the data given the model. A lower AIC value suggests a better balance between model complexity and goodness-of-fit, aiding in model selection.
Overall, the models suggest that both ‘Impressions’ and ‘Clicks’ have significant effects on the count of ‘Approved_Conversion’, with varying levels of significance depending on the model specification. The constant term also plays a role in determining the expected count of ‘Approved_Conversion’ when all predictors are zero.
mod10 <- lm(Spent ~ Approved_Conversion + Impressions + Clicks, df)
summary(mod10)
##
## Call:
## lm(formula = Spent ~ Approved_Conversion + Impressions + Clicks,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.243 -1.068 0.044 1.053 60.162
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.103050 0.235007 -0.438 0.661
## Approved_Conversion -0.680407 0.169310 -4.019 6.24e-05 ***
## Impressions 0.085004 0.002462 34.533 < 2e-16 ***
## Clicks 1.085138 0.011907 91.138 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.684 on 1139 degrees of freedom
## Multiple R-squared: 0.9941, Adjusted R-squared: 0.9941
## F-statistic: 6.397e+04 on 3 and 1139 DF, p-value: < 2.2e-16
plot(df$Impressions, df$Spent)
m <- lm(Spent ~ Impressions, data = df)
summary(m)
##
## Call:
## lm(formula = Spent ~ Impressions, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -201.900 -4.153 -1.220 1.849 155.524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.009289 0.723611 1.395 0.163
## Impressions 0.269645 0.001987 135.695 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21 on 1141 degrees of freedom
## Multiple R-squared: 0.9416, Adjusted R-squared: 0.9416
## F-statistic: 1.841e+04 on 1 and 1141 DF, p-value: < 2.2e-16
Regression Analysis Interpretation:
mod10:
The linear regression model we’ve fitted aims to predict ‘Spent’ based on ‘Approved_Conversion’, ‘Impressions’, and ‘Clicks’. Let’s interpret the coefficients and model summary:
Coefficients:
Intercept (-0.103050):
Approved_Conversion (-0.680407):
Nastišin, Ľ. (Year). Facebook and Instagram analysis of CPC and CTR: Analysis by region and industry. Journal Name, Volume(Issue), Page range.
The paper by Nastišin (Year) investigates the analysis of CPC and CTR on Facebook and Instagram, examining variations across industries and regions. The study highlights the importance of understanding these metrics as a baseline for advertising budget calculations. It emphasizes the need for risk assessment in campaign planning, considering differing CTRs between industries and regions. Additionally, the paper identifies a significant discrepancy in advertising costs between campaigns aimed at conversion objectives versus brand awareness objectives. Notably, it underscores the growing importance of Instagram in advertising and the prevalence of Link Click ads on this platform compared to Facebook. The findings suggest a lack of research on the phenomenon where increased conversions lead to decreased ad spending, as well as the correlation between higher click rates and increased costs, particularly on Facebook ads.
Impressions (0.085004):
Clicks (1.085138):
The summary of the model provides key information about its performance and significance:
Residuals: Represent the differences between observed and predicted values of ‘Spent’. Summary statistics (min, 1Q, median, 3Q, max) describe the distribution of these differences.
Residual standard error: Estimates the standard deviation of the residuals, indicating the typical amount by which the model’s predictions deviate from the actual values of ‘Spent’.
Multiple R-squared: Measures how well the model explains the variability in ‘Spent’. It indicates that approximately 99.41% of the variability in ‘Spent’ is accounted for by the predictors (‘Approved_Conversion’, ‘Impressions’, ‘Clicks’).
Adjusted R-squared: Similar to Multiple R-squared, but adjusted for the number of predictors. It provides a more accurate measure of model fit when there are multiple predictors.
F-statistic: Assesses the overall significance of the model. The high F-statistic value (6.397e+04) and associated p-value (< 2.2e-16) indicate that the model as a whole is highly significant, suggesting that ‘Approved_Conversion’, ‘Impressions’, and ‘Clicks’ are significant predictors of ‘Spent’.
plot(mod10)
The residual plot for mod10
show, straight line, suggesting
a very strong linear relationship between the independent variables
(Approved_Conversion, Impressions, Clicks) and the dependent variable
(Spent). This absence of discernible patterns indicates that the model
adequately captures the underlying variability, affirming its
reliability in predicting the dependent variable.
With an R-squared value of 0.9941, and an adjusted R-squared value of the same magnitude, the model explains approximately 99.41% of the variability in the dependent variable, affirming its strong explanatory power and suitability for the observed data.
This research paper has conducted a hypothesis test on the relationship between the variables Spent, Approved_Conversion, Impressions, and Clicks, utilizing the dataset denoted as df.
Null hypothesis (H0): β1 = β2 = β3 = 0 Alternative hypothesis (H1): β1 = β2 = β3 ≠ 0
The null hypothesis (H0) posited no relationship between the independent variables and the dependent variable, while the alternative hypothesis (H1) suggested the presence of some relationship.
With a p-value of approximately 2.2e-16, nearing zero, we reject the null hypothesis, providing substantial evidence in support of the alternative hypothesis. Additionally, the goodness of fit statistics, including an R-squared value of 0.9941 and an adjusted R-squared value of 0.9941, underscores the robustness of the model, signifying a high level of explanatory power and suitability for the observed data.
In conclusion, this research project delved into the realm of Facebook ads analysis using multiple linear regression, aiming to uncover insights into the factors influencing ad performance. By scrutinizing metrics such as conversions, impressions, clicks, and ad spending, the study sought to illuminate the dynamics of digital marketing strategies. Utilizing a dataset sourced from Kaggle.com, encompassing variables such as ad IDs, demographic factors, and ad performance metrics, the analysis employed various statistical models including multiple linear regression and hypothesis testing.
The findings, encapsulated in the final model summary, indicate a robust linear relationship between the independent variables and ad spending. The high R-squared value of 0.9941 underscores the model’s efficacy in explaining approximately 99.41% of the variability in ad spending, reinforcing its reliability and relevance in guiding marketing decisions. This research contributes to a deeper understanding of digital advertising dynamics and provides valuable insights for advertisers and revenue generators to optimize their ad strategies effectively.
Based on the analysis conducted: