KNN Model

Predictions ABV & IBU

<!DOCTYPE html>

Cast Study 01 - Beers and Breweries - Budweiser

1) How many breweries are present in each state?

## 
##  AK  AL  AR  AZ  CA  CO  CT  DC  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY  LA  MA 
##   7   3   2  11  39  47   8   1   2  15   7   4   5   5  18  22   3   4   5  23 
##  MD  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH  OK  OR  PA  RI 
##   7   9  32  12   9   2   9  19   1   5   3   3   4   2  16  15   6  29  25   5 
##  SC  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
##   4   1   3  28   4  16  10  23  20   1   4

2) The first 6 observations and the last 6 observations of merge data of Beers and Breweries

#Change Column Name
colnames(Beers)[5]= "Brew_ID"

#Merge Beer and Breweries together
brewbeer = merge(Beers,Brew, by="Brew_ID")

# First Six Observations
head(brewbeer)
##   Brew_ID        Name.x Beer_ID   ABV IBU                               Style
## 1       1  Get Together    2692 0.045  50                        American IPA
## 2       1 Maggie's Leap    2691 0.049  26                  Milk / Sweet Stout
## 3       1    Wall's End    2690 0.048  19                   English Brown Ale
## 4       1       Pumpion    2689 0.060  38                         Pumpkin Ale
## 5       1    Stronghold    2688 0.060  25                     American Porter
## 6       1   Parapet ESB    2687 0.056  47 Extra Special / Strong Bitter (ESB)
##   Ounces             Name.y        City State
## 1     16 NorthGate Brewing  Minneapolis    MN
## 2     16 NorthGate Brewing  Minneapolis    MN
## 3     16 NorthGate Brewing  Minneapolis    MN
## 4     16 NorthGate Brewing  Minneapolis    MN
## 5     16 NorthGate Brewing  Minneapolis    MN
## 6     16 NorthGate Brewing  Minneapolis    MN
# Last Six Observations
tail(brewbeer)
##      Brew_ID                    Name.x Beer_ID   ABV IBU
## 2405     556             Pilsner Ukiah      98 0.055  NA
## 2406     557  Heinnieweisse Weissebier      52 0.049  NA
## 2407     557           Snapperhead IPA      51 0.068  NA
## 2408     557         Moo Thunder Stout      50 0.049  NA
## 2409     557         Porkslap Pale Ale      49 0.043  NA
## 2410     558 Urban Wilderness Pale Ale      30 0.049  NA
##                        Style Ounces                        Name.y          City
## 2405         German Pilsener     12         Ukiah Brewing Company         Ukiah
## 2406              Hefeweizen     12       Butternuts Beer and Ale Garrattsville
## 2407            American IPA     12       Butternuts Beer and Ale Garrattsville
## 2408      Milk / Sweet Stout     12       Butternuts Beer and Ale Garrattsville
## 2409 American Pale Ale (APA)     12       Butternuts Beer and Ale Garrattsville
## 2410        English Pale Ale     12 Sleeping Lady Brewing Company     Anchorage
##      State
## 2405    CA
## 2406    NY
## 2407    NY
## 2408    NY
## 2409    NY
## 2410    AK

3. Address the missing values in each column.

After working with my team and comparing our results, I found that there are about 1000 missing IBU values. The total amount of IBU values are 210. This can significantly interfer with our test. Thus, we are going to impute our missing NAs with the median for ABV and IBU, and run a KNN model.

hello_NA = brewbeer[!complete.cases(brewbeer),]
dim(hello_NA)
## [1] 1005   10
head(hello_NA)
##    Brew_ID          Name.x Beer_ID   ABV IBU                            Style
## 17       2  Kamen Knuddeln    2676 0.065  NA                American Wild Ale
## 35       6      Blackbeard    2657 0.093  NA American Double / Imperial Stout
## 36       6        Rye Knot    2656 0.062  NA               American Brown Ale
## 37       6        Dead Arm    2655 0.060  NA          American Pale Ale (APA)
## 38       6 32°/50° Kölsch     2654 0.048  NA                           Kölsch
## 39       6          HopArt    2653 0.077  NA                     American IPA
##    Ounces                    Name.y       City State
## 17     16 Against the Grain Brewery Louisville    KY
## 35     12     COAST Brewing Company Charleston    SC
## 36     12     COAST Brewing Company Charleston    SC
## 37     12     COAST Brewing Company Charleston    SC
## 38     16     COAST Brewing Company Charleston    SC
## 39     16     COAST Brewing Company Charleston    SC
library(naniar)
gg_miss_var(brewbeer)
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

There are over 1000 missing values in IBU, almost 100 in ABV, and 5 in Style. We impute those missing values with median for our KNN prediction model.

4. Compute the median alcohol content and international bitterness unit for each state. Plot a bar chart to compare.

## Warning: Removed 1 rows containing missing values (position_stack).

5. Which state has the maximum alcoholic (ABV) beer? Which state has the most bitter (IBU) beer?

Colorado has a maximum alcoholic beer of 0.128 ABV.

Oregon has the most bitter beer of 138 IBU

## # A tibble: 6 × 2
##   State max_alc
##   <chr>   <dbl>
## 1 " CO"   0.128
## 2 " KY"   0.125
## 3 " IN"   0.12 
## 4 " NY"   0.1  
## 5 " CA"   0.099
## 6 " ID"   0.099
## # A tibble: 6 × 2
##   State max_ibu
##   <chr>   <dbl>
## 1 " OR"     138
## 2 " VA"     135
## 3 " MA"     130
## 4 " OH"     126
## 5 " MN"     120
## 6 " VT"     120

6. Comment on the summary statistics and distribution of the ABV variable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00100 0.05000 0.05600 0.05977 0.06700 0.12800      62

The mean of ABV is 0.5977, where the maximum is 0.128 and the minimum is .001. ABV variable has a mode around .5 and a range about .11 by the summary statistics. The distribution of the histogram is right skewed.

7. Is there an apparent relationship between the bitterness of the beer and its alcoholic content? Draw a scatter plot. Make your best judgment of a relationship and EXPLAIN your answer.

library(ggplot2)
brewbeer %>% ggplot(aes(x = ABV, y = IBU)) + 
geom_point(color = "blue")
## Warning: Removed 1005 rows containing missing values (geom_point).

scatter.smooth(x=brewbeer$ABV, y=brewbeer$IBU, main = "Mild positive linear relationship",xlab = "ABV", ylab = "IBU", col="blue")

The graph above is comparing standarderize ABV and IBU values, and the trend shows a mild positive linear relationship between bitterness and alcoholic content(3rd degree polynomial). Thus, there is an apparent relationship as ABV increases, IBU also increases. If we look at the variances though, there data points that don’t fit this relationship such as the ABV values around .09 and also being around 2 bitterness. However, the the majority of our data sits in the center being with a more positive linear relationship.

8. Budweiser KNN model ABV and IBU to predict beer style

In our scatter plot above, we standardize our continuous variables to proper scale.

## [1] 5
## [1] 0.5462379

k = 5 has a maximum accuracy.

##                
## classifications ALE IPA Other
##           ALE   154  31   147
##           IPA    41 124    43
##           Other  66  13   104
## Confusion Matrix and Statistics
## 
##                
## classifications ALE IPA Other
##           ALE   154  31   147
##           IPA    41 124    43
##           Other  66  13   104
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5284          
##                  95% CI : (0.4912, 0.5653)
##     No Information Rate : 0.4066          
##     P-Value [Acc > NIR] : 2.727e-11       
##                                           
##                   Kappa : 0.2902          
##                                           
##  Mcnemar's Test P-Value : 1.872e-10       
## 
## Statistics by Class:
## 
##                      Class: ALE Class: IPA Class: Other
## Sensitivity              0.5900     0.7381       0.3537
## Specificity              0.6147     0.8486       0.8159
## Pos Pred Value           0.4639     0.5962       0.5683
## Neg Pred Value           0.7263     0.9146       0.6481
## Prevalence               0.3610     0.2324       0.4066
## Detection Rate           0.2130     0.1715       0.1438
## Detection Prevalence     0.4592     0.2877       0.2531
## Balanced Accuracy        0.6024     0.7934       0.5848

We ran classification with k = 5, and the probabilities are above. Accuracy is 52-54%.

Our KNN Model for Predictions

Now that we have our KNN model running with a maximum accuracy k = 5, we can use take in any values of IBU and ABV to predict the type of alcohol drink.

### Example:
#Input
bitterness = 100
alcohol = 0.089

#scale
scaled_center_bitterness = mean(brewbeerimpute$IBU)
scaled_scale_bitterness = sd(brewbeerimpute$IBU)
scaled_center_alc = mean(brewbeerimpute$ABV)
scaled_scale_alc = sd(brewbeerimpute$ABV)

y=(bitterness-scaled_center_bitterness)/scaled_scale_bitterness
x=(alcohol-scaled_center_alc)/scaled_scale_alc
test1= c(x,y)
knn(train[,c(1,2)],test1,train$Type, prob = TRUE, k = 5)
## [1] IPA
## attr(,"prob")
## [1] 1
## Levels: ALE IPA Other

9. If Budweiser was to release a type of alcoholic drink for Texas, what would be the best type of drink based on our KNN model and market analysis? We are first going to find the Market Share Analysis based on the total number of beers produced to find the average ABV and IBU values of the top three breweries, then run our KNN model to determine the type of beer based on that ABV and IBU values for the Texas market.

##   Brew_ID  n                       Name         City State
## 1      11 62             Brewery Vivant Grand Rapids    MI
## 2      26 38   Sun King Brewing Company Indianapolis    IN
## 3     167 33        Oskar Blues Brewery     Longmont    CO
## 4     142 25 Cigar City Brewing Company        Tampa    FL
## 5      47 24        Sixpoint Craft Ales     Brooklyn    NY
## 6      81 23     Hopworks Urban Brewery     Portland    OR

The top three breweries with most produced beers are Brewery Vivant in MI as #1, the Sun King Brewing Company in IN as #2, and Oskar Blues Brewery in CO as #3.

# texas beers abv and ibu
hello_texas=filter(brewbeerimpute, grepl('TX', State))
hello_texas %>% ggplot(aes(x = ABV, y = IBU, color=Type)) + 
geom_point() + ggtitle("Beers in Texas") +
stat_ellipse()
## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

hello_texas %>% ggplot(aes(x=Type,fill=Type)) + geom_bar()

What is the best, optimized ABV and IBU for the Texas market compared to the best three markets?

From our analysis of the mean of the three top breweries, we see that the best mean ABV and IBU value are .07107 and 54.65313.

Using the best mean ABV and IBU values of .07107 and 54.65313, if we were to make an alcoholic drink in Texas, let’s use our KNN model to see which one would be best.

#Input
bitterness = 54.65313
alcohol = 0.07107107

#scale
scaled_center_bitterness = mean(brewbeerimpute$IBU)
scaled_scale_bitterness = sd(brewbeerimpute$IBU)
scaled_center_alc = mean(brewbeerimpute$ABV)
scaled_scale_alc = sd(brewbeerimpute$ABV)

y=(bitterness-scaled_center_bitterness)/scaled_scale_bitterness
x=(alcohol-scaled_center_alc)/scaled_scale_alc
test1= c(x,y)
knn(train[,c(1,2)],test1,train$Type, prob = TRUE, k = 7)
## [1] IPA
## attr(,"prob")
## [1] 0.5714286
## Levels: ALE IPA Other

Using our KNN model and ABV of 0.07107107 and IBU of 54.65313, we conclude that this would be an IPA. Our model gave us IPA, but looking at the popularity of beers, we see that there are IPA are the lowest produced type of beers. There are multiple factors that can come into play for this analysis, the first in being that the top three ABV or IBU values could be based on IPAs. Our choice of the values for ABV or IBU can be biased for IPAs.