This post was semi automatically converted from blogdown to Quarto and may contain errors. The original can be found in the archive.
In this post, I will use the results of the exploratory analysis from the previous post and try to predict the position of players in FIFA 18 using different machine learning algorithms.
As a quick reminder, these were the figures we obtained using PCA, t-SNE and a self organizing map.
#used packages
library(tidyverse) # for data wrangling
library(hrbrthemes) # nice themes for ggplot
library(caret) # ML algorithms
Data
The data we use comes from Kaggle and contains around 18,000 players of the game FIFA 18 with 75 features per player. Check the previous post. for the data preparation steps.
After the preprocessing, the data has the following structure.
glimpse(fifa_tbl)
## Observations: 17,981
## Variables: 36
## $ Acceleration <int> 89, 92, 94, 88, 58, 79, 57, 93, 60, 78, 7...
## $ Aggression <int> 63, 48, 56, 78, 29, 80, 38, 54, 60, 50, 8...
## $ Agility <int> 89, 90, 96, 86, 52, 78, 60, 93, 71, 75, 7...
## $ Balance <int> 63, 95, 82, 60, 35, 80, 43, 91, 69, 69, 6...
## $ `Ball control` <int> 93, 95, 95, 91, 48, 89, 42, 92, 89, 85, 8...
## $ Composure <int> 95, 96, 92, 83, 70, 87, 64, 87, 85, 86, 8...
## $ Crossing <int> 85, 77, 75, 77, 15, 62, 17, 80, 85, 68, 6...
## $ Curve <int> 81, 89, 81, 86, 14, 77, 21, 82, 85, 74, 7...
## $ Dribbling <int> 91, 97, 96, 86, 30, 85, 18, 93, 79, 84, 6...
## $ Finishing <int> 94, 95, 89, 94, 13, 91, 13, 83, 76, 91, 6...
## $ `Free kick accuracy` <int> 76, 90, 84, 84, 11, 84, 19, 79, 84, 62, 6...
## $ `GK diving` <int> 7, 6, 9, 27, 91, 15, 90, 11, 10, 5, 11, 1...
## $ `GK handling` <int> 11, 11, 9, 25, 90, 6, 85, 12, 11, 12, 8, ...
## $ `GK kicking` <int> 15, 15, 15, 31, 95, 12, 87, 6, 13, 7, 9, ...
## $ `GK positioning` <int> 14, 14, 15, 33, 91, 8, 86, 8, 7, 5, 7, 10...
## $ `GK reflexes` <int> 11, 8, 11, 37, 89, 10, 90, 8, 10, 10, 11,...
## $ `Heading accuracy` <int> 88, 71, 62, 77, 25, 85, 21, 57, 54, 86, 9...
## $ Interceptions <int> 29, 22, 36, 41, 30, 39, 30, 41, 85, 20, 8...
## $ Jumping <int> 95, 68, 61, 69, 78, 84, 67, 59, 32, 79, 9...
## $ `Long passing` <int> 77, 87, 75, 64, 59, 65, 51, 81, 93, 59, 7...
## $ `Long shots` <int> 92, 88, 77, 86, 16, 83, 12, 82, 90, 82, 5...
## $ Marking <int> 22, 13, 21, 30, 10, 25, 13, 25, 63, 12, 8...
## $ Penalties <int> 85, 74, 81, 85, 47, 81, 40, 86, 73, 70, 6...
## $ Positioning <int> 95, 93, 90, 92, 12, 91, 12, 85, 79, 92, 5...
## $ Reactions <int> 96, 95, 88, 93, 85, 91, 88, 85, 86, 88, 8...
## $ `Short passing` <int> 83, 88, 81, 83, 55, 83, 50, 86, 90, 75, 7...
## $ `Shot power` <int> 94, 85, 80, 87, 25, 88, 31, 79, 87, 88, 7...
## $ `Sliding tackle` <int> 23, 26, 33, 38, 11, 19, 13, 22, 69, 18, 9...
## $ `Sprint speed` <int> 91, 87, 90, 77, 61, 83, 58, 87, 52, 80, 7...
## $ Stamina <int> 92, 73, 78, 89, 44, 79, 40, 79, 77, 72, 8...
## $ `Standing tackle` <int> 31, 28, 24, 45, 10, 42, 21, 27, 82, 22, 8...
## $ Strength <int> 80, 59, 53, 80, 83, 84, 64, 65, 74, 85, 8...
## $ Vision <int> 85, 90, 80, 84, 70, 78, 68, 86, 88, 70, 6...
## $ Volleys <int> 88, 85, 83, 88, 11, 87, 13, 79, 82, 88, 6...
## $ position <fctr> ST, RW, LW, ST, GK, ST, GK, LW, CDM, ST,...
## $ position2 <fctr> O, O, O, O, GK, O, GK, O, M, O, D, M, GK...
Machine Learning in R and the Caret Package
If you have looked into machine learning in R or own a copy of this book, you may have noticed that there are a gazillion packages for it. This can be quite overwhelming as a start, also given that most packages use different syntax and data structures. Fortunately, there is the comprehensive caret
package that brings together many of these packages in a standardized way. At the time of writing this post, the package contained 238(!!!) implementations of machine learning algorithms. Knowing this package is thus enough to solve almost any supervised problem. It comes with a uniform interface to several machine learning algorithms and standardizes tasks such as
- Data splitting
- Data pre-processing
- Feature selection
- Variable importance estimation
There is an extensive manual for it and I can only recommend to take a look at it. I also found this introduction quite helpful.
Creating Training and Test Sets
Once the data is prepared for an analysis, you should split it into a training and test set. As far as I understand, there is no golden standard on the ratio. I have seen 70/30, 75/25 and 90/10 as training/test split. I will go with the Machine Learning with R book and take an 80/20 split. You may also want to consider this stackoverflow question.
Splitting the data can be done with the createDataPartition()
function. The function tries to balance out the class distribution so that they are similar in both the training and test set. The parameter p controls the split ratio.
set.seed(6185)
<- createDataPartition(fifa_tbl$position2,p = 0.8,list = FALSE)
train_sample <- fifa_tbl[train_sample,] %>% select(-position)
train_data <- fifa_tbl[-train_sample,] %>% select(-position) test_data
position | GK | D | M | O |
training set | 0.11 | 0.30 | 0.40 | 0.19 |
test set | 0.11 | 0.3 | 0.4 | 0.19 |
Notice how evenly the data is split.
Training a Model
So now we are good to go for some machine learning mastery. As I mentioned before, the caret package implements more than 200 methods making it hard to choose the “right” one. I settled for four rather common methods listed below, together with some helpful links.
- K-Nearest Neighbors (Wiki, 1, 2)
- Random Forest (Wiki, 1)
- Support Vector Machine (Wiki, 1, 2)
- Neural Network (Wiki, 1)
I will not spend any time on the details of these algorithms, since others did a much greater job in introducing them.
Before any of these algorithms is trained, we have to set some basic control parameters for the training step. This is done with the function trainControl()
. There are many different arguments that can be set, where the following are the most important for our task¹.
- method defines the resampling method during training. We here use a repeated k-fold cross validation.
- number sets the “k” in k-fold. We will use the standard 10 here.
- repeats defines how often the cross validation should be repeated (3 times).
The train()
function is used to train all our models. Most of its arguments are self-explanatory, the once that are not are listed below.
- preProcess is used to scale and center the data.
- tuneGrid defines a grid of hyperparameters to be used in the training. At the end of the cross-validation, the best set of hyperparameters is chosen for the final model.²
After the training, we can check the performance on the training set with the function predict()
. The resulting performance can then be viewed with the function confusionMatrix()
.
K-Nearest Neighbors
The only hyperparameter of this model is k, the number of considered neighbors.
<- trainControl(method = "repeatedcv", number = 10, repeats = 3)
train_control <- expand.grid(.k=seq(20,120,5))
grid_knn <- train(position2~., data=train_data, method = "knn",
fifa_knn trControl = train_control,preProcess = c("center","scale"),
tuneGrid = grid_knn)
The summary of the training is given below.
fifa_knn
## k-Nearest Neighbors
##
## 14387 samples
## 34 predictor
## 4 classes: 'GK', 'D', 'M', 'O'
##
## Pre-processing: centered (33), scaled (33)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 12948, 12947, 12948, 12949, 12948, 12949, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 20 0.8199772 0.7415707
## 25 0.8222250 0.7447378
## 30 0.8220862 0.7443668
## 35 0.8236380 0.7465378
## 40 0.8229891 0.7455043
## 45 0.8222942 0.7444655
## 50 0.8216221 0.7434531
## 55 0.8216686 0.7434513
## 60 0.8222477 0.7442770
## 65 0.8225490 0.7446787
## 70 0.8222019 0.7441699
## 75 0.8215065 0.7430829
## 80 0.8214832 0.7430529
## 85 0.8205563 0.7416563
## 90 0.8206723 0.7418073
## 95 0.8206489 0.7417507
## 100 0.8201158 0.7409817
## 105 0.8192590 0.7397423
## 110 0.8191429 0.7395715
## 115 0.8195599 0.7401216
## 120 0.8193513 0.7398058
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 35.
The highest accuracy is achieved for k=35, which can also be seen by plotting the result.
plot(fifa_knn)
<- predict(fifa_knn,newdata = test_data)
fifa_knn_predict confusionMatrix(fifa_knn_predict,test_data$position2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction GK D M O
## GK 405 0 0 0
## D 0 920 152 7
## M 0 167 1168 207
## O 0 1 114 453
##
## Overall Statistics
##
## Accuracy : 0.8197
## 95% CI : (0.8067, 0.8321)
## No Information Rate : 0.399
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7409
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: GK Class: D Class: M Class: O
## Sensitivity 1.0000 0.8456 0.8145 0.6792
## Specificity 1.0000 0.9366 0.8269 0.9607
## Pos Pred Value 1.0000 0.8526 0.7575 0.7975
## Neg Pred Value 1.0000 0.9332 0.8704 0.9293
## Prevalence 0.1127 0.3027 0.3990 0.1856
## Detection Rate 0.1127 0.2560 0.3250 0.1260
## Detection Prevalence 0.1127 0.3002 0.4290 0.1580
## Balanced Accuracy 1.0000 0.8911 0.8207 0.8199
Random Forest
For the random forest, we have to decide on how many features should be used for each split in the decision try. A rule of thumb is to used the square root of the number of features. After some trial-and-error, I settled with 4, 8, 12, 16 and 20.
<- trainControl(method = "repeatedcv",number = 10, repeats = 3)
train_control <- expand.grid(.mtry=c(4,8,12,16,20))
grid_rf <- train(position2~., data=train_data, method = "rf",
fifa_rf trControl = train_control,preProcess = c("center","scale"),
tuneGrid = grid_rf)
fifa_rf
## Random Forest
##
## 14387 samples
## 34 predictor
## 4 classes: 'GK', 'D', 'M', 'O'
##
## Pre-processing: centered (33), scaled (33)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 12949, 12948, 12949, 12949, 12948, 12947, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 4 0.8367979 0.7664186
## 8 0.8372147 0.7671352
## 12 0.8385584 0.7691039
## 16 0.8386507 0.7692242
## 20 0.8382106 0.7686364
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 16.
plot(fifa_rf)
<- predict(fifa_rf,newdata = test_data)
fifa_rf_predict confusionMatrix(fifa_rf_predict,test_data$position2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction GK D M O
## GK 405 0 0 0
## D 0 957 162 7
## M 0 129 1149 186
## O 0 2 123 474
##
## Overall Statistics
##
## Accuracy : 0.8306
## 95% CI : (0.8179, 0.8427)
## No Information Rate : 0.399
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7576
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: GK Class: D Class: M Class: O
## Sensitivity 1.0000 0.8796 0.8013 0.7106
## Specificity 1.0000 0.9326 0.8542 0.9573
## Pos Pred Value 1.0000 0.8499 0.7848 0.7913
## Neg Pred Value 1.0000 0.9469 0.8662 0.9356
## Prevalence 0.1127 0.3027 0.3990 0.1856
## Detection Rate 0.1127 0.2663 0.3197 0.1319
## Detection Prevalence 0.1127 0.3133 0.4073 0.1667
## Balanced Accuracy 1.0000 0.9061 0.8277 0.8340
Support Vector Machines
We use a linear SVM here. If you are interested in non-linear ones, you should familiarize yourself with the Kernel trick. The hyperparameter to be set is a cost value, which determines the margin around the separating hyperplanes (the bigger the value, the smaller the margin).
<- trainControl(method = "repeatedcv", number = 10, repeats = 3)
train_control <- expand.grid(.cost=c(0.75, 0.9, 1, 1.1, 1.25,2))
grid_svm <- train(position2 ~., data = train_data, method = "svmLinear2",
fifa_svm_linear trControl=train_control,
preProcess = c("center", "scale"),
tuneGrid = grid_svm)
fifa_svm_linear
## Support Vector Machines with Linear Kernel
##
## 14387 samples
## 34 predictor
## 4 classes: 'GK', 'D', 'M', 'O'
##
## Pre-processing: centered (33), scaled (33)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 12949, 12949, 12949, 12949, 12947, 12949, ...
## Resampling results across tuning parameters:
##
## cost Accuracy Kappa
## 0.75 0.8333450 0.7614291
## 0.90 0.8333682 0.7614680
## 1.00 0.8334145 0.7615323
## 1.10 0.8333682 0.7614632
## 1.25 0.8332987 0.7613696
## 2.00 0.8330671 0.7610370
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cost = 1.
plot(fifa_svm_linear)
<- predict(fifa_svm_linear,newdata = test_data)
fifa_svm_linear_predict confusionMatrix(fifa_svm_linear_predict,test_data$position2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction GK D M O
## GK 405 0 0 0
## D 0 945 153 8
## M 0 140 1164 194
## O 0 3 117 465
##
## Overall Statistics
##
## Accuracy : 0.8289
## 95% CI : (0.8162, 0.8411)
## No Information Rate : 0.399
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7547
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: GK Class: D Class: M Class: O
## Sensitivity 1.0000 0.8686 0.8117 0.6972
## Specificity 1.0000 0.9358 0.8454 0.9590
## Pos Pred Value 1.0000 0.8544 0.7770 0.7949
## Neg Pred Value 1.0000 0.9425 0.8712 0.9329
## Prevalence 0.1127 0.3027 0.3990 0.1856
## Detection Rate 0.1127 0.2629 0.3239 0.1294
## Detection Prevalence 0.1127 0.3077 0.4168 0.1628
## Balanced Accuracy 1.0000 0.9022 0.8285 0.8281
Neural Network
To train a neural network, we need to decide on the number of hidden nodes. The more you choose, the more complex the model (but not necessarily better!).
<- trainControl(method = "repeatedcv",number = 10, repeats = 3)
train_control <- expand.grid(.decay=c(0.2,0.5,0.7),.size=c(1,5,10,15))
grid_nnet <- train(position2~., data=train_data, method = "nnet",
fifa_nnet trControl = train_control, preProcess = c("center","scale"),
tuneGrid = grid_nnet, maxit = 500, abstol=1e-2)
fifa_nnet
## Neural Network
##
## 14387 samples
## 34 predictor
## 4 classes: 'GK', 'D', 'M', 'O'
##
## Pre-processing: centered (33), scaled (33)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 12948, 12949, 12948, 12949, 12948, 12948, ...
## Resampling results across tuning parameters:
##
## decay size Accuracy Kappa
## 0.00 1 0.7963667 0.7100156
## 0.00 5 0.8301466 0.7570811
## 0.00 10 0.8170800 0.7387677
## 0.00 15 0.8030628 0.7187341
## 0.00 20 0.7938647 0.7063276
## 0.01 1 0.7930771 0.7049725
## 0.01 5 0.8320470 0.7598107
## 0.01 10 0.8223396 0.7462697
## 0.01 15 0.8170799 0.7387687
## 0.01 20 0.8082991 0.7262713
## 0.10 1 0.7961579 0.7102102
## 0.10 5 0.8319543 0.7596776
## 0.10 10 0.8291743 0.7559585
## 0.10 15 0.8215055 0.7451117
## 0.10 20 0.8138834 0.7342370
## 0.20 1 0.7943741 0.7078687
## 0.20 5 0.8338773 0.7624642
## 0.20 10 0.8298462 0.7568273
## 0.20 15 0.8233824 0.7476862
## 0.20 20 0.8179835 0.7400740
## 0.50 1 0.7937018 0.7067714
## 0.50 5 0.8346653 0.7634174
## 0.50 10 0.8308658 0.7582532
## 0.50 15 0.8264866 0.7521366
## 0.50 20 0.8229418 0.7471506
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.5.
plot(fifa_nnet)
<- predict(fifa_nnet,newdata = test_data)
fifa_nnet_predict confusionMatrix(fifa_nnet_predict,test_data$position2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction GK D M O
## GK 405 0 0 0
## D 0 967 162 9
## M 0 119 1144 190
## O 0 2 128 468
##
## Overall Statistics
##
## Accuracy : 0.8303
## 95% CI : (0.8176, 0.8424)
## No Information Rate : 0.399
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7573
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: GK Class: D Class: M Class: O
## Sensitivity 1.0000 0.8888 0.7978 0.7016
## Specificity 1.0000 0.9318 0.8569 0.9556
## Pos Pred Value 1.0000 0.8497 0.7873 0.7826
## Neg Pred Value 1.0000 0.9507 0.8645 0.9336
## Prevalence 0.1127 0.3027 0.3990 0.1856
## Detection Rate 0.1127 0.2691 0.3183 0.1302
## Detection Prevalence 0.1127 0.3166 0.4043 0.1664
## Balanced Accuracy 1.0000 0.9103 0.8274 0.8286
Summary
#best training performances
<- t(rbind(knn = apply(fifa_knn$results[,2:3],2,max),
train_performance rf = apply(fifa_rf$results[,2:3],2,max),
svm = apply(fifa_svm_linear$results[,2:3],2,max),
nnet= apply(fifa_nnet$results[,3:4],2,max)))
<- tibble(knn = fifa_knn_predict,
predicted rf = fifa_rf_predict,
svm = fifa_svm_linear_predict,
nnet= fifa_nnet_predict)
#test performance
<- apply(predicted, 2, postResample, obs = test_data$position2) test_performance
Best training performances
knn | rf | svm | nnet | |
---|---|---|---|---|
Accuracy | 0.8236 | 0.8387 | 0.8334 | 0.8348 |
Kappa | 0.7465 | 0.7692 | 0.7615 | 0.7637 |
Best test performances
knn | rf | svm | nnet | |
---|---|---|---|---|
Accuracy | 0.8197 | 0.8306 | 0.8289 | 0.8303 |
Kappa | 0.7409 | 0.7576 | 0.7547 | 0.7573 |
The random forest predicted the correct position of players with the highest accuracy. But really only by a tiny margin. If this would be some kind of competition, this would make a difference. But since we did this “just for fun”, even the knn results are acceptable.
Overall, all models did a very good job in predicting positions. If you take a closer look at the confusion matrices, you will notice that goalkeepers were always 100% correctly classified. This comes with no real surprise, given that their skills are rather distinct from other players. Also, the dimension reduction in the last post was a dead giveaway that it will be easy to separate them from the rest.
Addendum
This post should not be taken as a serious introduction to the caret package. I have left out many tuning possibilities or hyperparameter for each model. I tried to link as much as possible to other sources that do a way better job in describing the functionality of the package.
You may notice that I set the same trainingControl in each code chunk. This is certainly not necessary. I just realized to late that I did that and I did not want to re-train all the models (which took an awful long time).↩︎
You can also use tuneLength and let the function decide on the hyperparameters to use.↩︎
Reuse
Citation
@online{schoch2017,
author = {Schoch, David},
title = {Predicting {Player} {Positions} of {FIFA} 18 {Players}},
date = {2017-11-24},
url = {http://blog.schochastics.net/posts/2017-11-24_predicting-player-positions},
langid = {en}
}