Semester Project

Author

Kyle Matheson

Introduction

The dataset emulates information regarding visits made by users on a retail website around the United Kingdom. The users are found to be in two groups, A and B, each of which represents a control group and treatment group respectively.

So, in this scenario, let the color ‘White’ be assigned to Group A which is the default setting for the background color on the website, representing the control group. Also, let the color ‘Black’ be equivalent to Group B which is the newer setting to be tested.

Data Set Includes:

User ID: Serves as an identifier for each user.

Group: Contains both the control group (A) and treatment group (B).

Page Views: Number of pages the user viewed during their session.

Time Spent: The total amount of time, in seconds, that the user spent on the site during the session.

Conversion: Indicates whether a user has completed a desired action (Yes/No).

Device: Type of device used to access the website.

Location: The country in UK where the user is based in.

We will be exploring Page Views and Time Spent by Group.

Design the Study

The purpose behind the study is to test out if a new theme such as black would make a difference in user engagement. We are using two measurements to figure out if a new theme will boost user engagement. We are tracking users time spent on site and how many pages the user visits. We have track these two groups by separating theme into group A and group B. Group A is the white theme and group B is the black theme. The main goal is to understand if there’s significant improvement in page views and time spent with a new Black theme.

Collect the Data

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(mosaic)

Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

The 'mosaic' package masks several functions from core packages in order to add 
additional features.  The original behavior of these functions should not be affected by this.

Attaching package: 'mosaic'

The following object is masked from 'package:Matrix':

    mean

The following objects are masked from 'package:dplyr':

    count, do, tally

The following object is masked from 'package:purrr':

    cross

The following object is masked from 'package:ggplot2':

    stat

The following objects are masked from 'package:stats':

    binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
    quantile, sd, t.test, var

The following objects are masked from 'package:base':

    max, mean, min, prod, range, sample, sum

library(rio)


Attaching package: 'rio'

The following object is masked from 'package:mosaic':

    factorize

library(ggplot2)
library(car)

Loading required package: carData

Attaching package: 'car'

The following objects are masked from 'package:mosaic':

    deltaMethod, logit

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

library(dplyr)

# Be sure to make this using ggplot()

data <- import('~/Desktop/Math/W1&2/Student_Work/6-Semester_Project/ab_testing.csv')

Describe/Summarize the Data

Time Spent on Site

favstats(data$`Time Spent`~data$Group)

  data$Group min    Q1 median  Q3 max     mean       sd    n missing
1          A  40 137.5    241 343 449 241.7332 117.3400 2519       0
2          B  40 136.0    244 348 449 243.3039 119.1936 2481       0

ggplot(data, aes(x=`Group`, y=`Time Spent`)) +
  geom_boxplot(fill = "lightblue", color = "darkblue") +
  theme_minimal() +
  labs(title = "Group A vs B Time Spent ", y = "Time (sec)")

Page Views

favstats(data$`Page Views`~data$Group)

  data$Group min Q1 median Q3 max     mean       sd    n missing
1          A   1  4      8 11  14 7.581580 4.080066 2519       0
2          B   1  4      8 11  14 7.492946 3.963448 2481       0

ggplot(data, aes(x=`Group`, y=`Page Views`)) +
  geom_boxplot(fill = "lightblue", color = "darkblue") +
  theme_minimal() +
  labs(title = "Group A vs B Page Views ", y = "Page Views")

Summary

After exploring the data is most likely that there is no statistical difference between groups A and B based upon visualize above.

Make Inference

Independent Two-sample T-test

The null hypothesis test for independent samples is:

\[H_o:A \mu_1 = B \mu_2\]

The alternative hypothesis depends on context of the research question and can be, as before:

\[H_a: A\mu_1 = B\mu_2\]

\[ \alpha = 0.05\]

Time Spent on Site

# Perform the t-test
time_t_test <- t.test(data$`Time Spent`~data$Group, alterntive="two-sided")

time_t_test


    Welch Two Sample t-test

data:  data$`Time Spent` by data$Group
t = -0.46949, df = 4993.2, p-value = 0.6387
alternative hypothesis: true difference in means between group A and group B is not equal to 0
95 percent confidence interval:
 -8.129309  4.987944
sample estimates:
mean in group A mean in group B 
       241.7332        243.3039

T: -0.46949

P-Value: 0.6387

DF: 4993.2

t.test(data$`Time Spent`~data$Group, conf.level = .95)$conf.int

[1] -8.129309  4.987944
attr(,"conf.level")
[1] 0.95

I am 95% confident that confidence interval for the difference of the means is between -8.129309 and 4.987944.

Page Views

# Perform the t-test
page_t_test <- t.test(data$`Page Views`~data$Group, alterntive="two-sided")

page_t_test


    Welch Two Sample t-test

data:  data$`Page Views` by data$Group
t = 0.77916, df = 4997, p-value = 0.4359
alternative hypothesis: true difference in means between group A and group B is not equal to 0
95 percent confidence interval:
 -0.1343765  0.3116437
sample estimates:
mean in group A mean in group B 
       7.581580        7.492946

T: 0.77916

P-Value: 0.4359

DF: 4997

t.test(data$`Page Views`~data$Group, conf.level = .95)$conf.int

[1] -0.1343765  0.3116437
attr(,"conf.level")
[1] 0.95

I am 95% confident that confidence interval for the difference of the means is between -0.1343765 and 0.3116437.

Summary

From the t test above, it is clear that Group A and Group B are statistically equal both in time spent on site and page views. We know this because we fail to reject the null hypothesis; the p-value is greater than the alpha which 0.05 in both page views and time spent on site.

Check Normailty

Based on the central limit theorem, we can assume that the data is normally distributed and can be trusted because in each category there is more than 30 values.

Conclusion

Based on the data above, we can conclude there’s no statistical difference between Group A and group B. This means there is no statistical difference for White and Black theme in user engagement based on time spent and page views. From this, we can understand for a company to invest time in adding an additional theme does not lead the more time spent or page views increased however, this is not to say an additional theme such as a black theme can prove other factors other of the business or other factors of user engagement.