[Esteartigofoipublicadopelaprimeiravezem[Thisarticlewasfirstpublishedon R – Oi! Eu sou Nagdev, e gentilmente contribuiu para os R-blogueiros]. (Você pode relatar um problema sobre o conteúdo desta página aqui)
Deseja compartilhar seu conteúdo com R-blogueiros? clique aqui se você tiver um blog ou aqui se não tiver.
A maioria de nós já se deparou com situações em que não temos dados suficientes para construir modelos confiáveis devido a várias razões, como, por exemplo, é caro coletar dados (estudos em humanos), recursos limitados, falta de disponibilidade de dados históricos (terremotos). Antes mesmo de começarmos a falar sobre como superar o desafio, vamos primeiro falar sobre por que precisamos de amostras mínimas antes mesmo de pensarmos em construir um modelo. Primeiro de tudo, podemos construir um modelo com poucas amostras. Definitivamente é possível! Porém, conforme o número de amostras diminui, a margem de erro aumenta e vice-versa. Se você deseja construir um modelo com a maior precisão, precisará ter o maior número possível de amostras. Se o modelo for para um aplicativo do mundo real, você precisará ter dados por vários dias para responder a quaisquer alterações no sistema. Existe uma fórmula que pode ser usada para calcular o tamanho da amostra e é a seguinte:
Onde, n = tamanho da amostra
Z = valor da pontuação Z
σ = desvio padrão preenchido
MOE = margem de erro aceitável
Você também pode calcular com uma calculadora on-line, como neste link
https://www.qualtrics.com/blog/calculating-sample-size/
Agora sabemos que por que são necessárias amostras mínimas para alcançar a precisão exigida, digamos que, em alguns casos, não temos a oportunidade de coletar mais amostras ou disponíveis. Então temos a opção de fazer o seguinte
- Validação cruzada K-fold
- Validação cruzada de não-P-out
- Validação cruzada de exclusão única
- Criação de novos dados por estimativa
No método K-fold, os dados são divididos em k partições e, em seguida, são treinados com cada partição e testados com a k-ésima partição. No método k-hold, nem todas as combinações são consideradas. Somente partições especificadas pelo usuário são consideradas. Enquanto estiver em deixar um / sair, todas as combinações ou partições são consideradas. Esta é uma técnica mais exaustiva na validação dos resultados. As duas técnicas a seguir acima são as técnicas mais populares usadas no aprendizado de máquina e no aprendizado profundo.
Quando se trata de lidar com NAs em um conjunto de dados, sempre o imputamos por meio de média, mediana, zero e números aleatórios. Mas isso provavelmente não faria sentido quando queremos criar novos dados.
Na criação de novos dados por meio da técnica de estimativa, linhas de dados ausentes são criadas no conjunto de dados e um modelo de imputação de dados separado é usado para imputar dados ausentes nas linhas. A Imputação Multivariada por Equações em Cadeia (MICE) é um dos algoritmos mais populares disponíveis para inserir dados ausentes, independentemente de tipos de dados, como misturas de dados categóricos contínuos, binários, não ordenados e categóricos ordenados.
Existem vários tutoriais disponíveis para dobrar em k e deixar um modelo de fora. Este tutorial se concentrará no quarto modelo em que novos dados serão criados para lidar com menos tamanho de amostra. No modelo de classificação simples e com ser treinado para ver se houve uma melhoria significativa. Além disso, a distribuição de dados imputados e não imputados será comparada para ver qualquer diferença significativa.
Contents
- 1 Carregar dados em um quadro de dados
- 2 Crie alguns recursos a partir de dados
- 3 Dados do estudo
- 4 Atribuir NAs com melhores valores usando o método de iteração
- 5 Ver resultados imputados
- 6 Looking at distribution actual data and imputed data
- 7 Building a classification model based on actual data and Imputed data
Carregar bibliotecas
Vamos carregar todas as bibliotecas necessárias por enquanto.
options(warn=-1) # load libraies library(mice) library(dplyr)
Carregar dados em um quadro de dados
Os dados disponíveis no meu repositório GitHub são usados para a análise.
setwd("C:/OpenSourceWork/Experiment") #read csv files file1 = read.csv("dry run.csv", sep=",", header =T) file2 = read.csv("base.csv", sep=",", header =T) file3 = read.csv("imbalance 1.csv", sep=",", header =T) file4 = read.csv("imbalance 2.csv", sep=",", header =T) #Add labels to data file1$y = 1 file2$y = 2 file3$y = 3 file4$y = 4 #view top rows of data head(file1)
Tempo | machado | ay | az | aT | y |
---|---|---|---|---|---|
0,002 | -0,3246 | 0,2748 | 0,1502 | 0,451 | 1 1 |
0,009 | 0,6020 | -0,1900 | -0,3227 | 0,709 | 1 1 |
0,019 | 0,9787 | 0,3258 | 0,0124 | 1.032 | 1 1 |
0,027 | 0.6141 | -0,4179 | 0,0471 | 0,744 | 1 1 |
0,038 | -0,3218 | -0,6389 | -0,4259 | 0.833 | 1 1 |
0,047 | -0,3607 | 0,1332 | -0,1291 | 0,406 | 1 1 |
Crie alguns recursos a partir de dados
Os dados utilizados neste estudo são dados de vibração com diferentes estados. Os dados foram coletados a 100 Hz. Os dados a serem utilizados como são são de alta dimensão também, não temos um bom resumo dos dados. Portanto, algumas características estatísticas são extraídas. Nesse caso, o desvio padrão da amostra, a média da amostra, a amostra mínima, a amostra máxima e a mediana da amostra são calculadas. Além disso, os dados são agregados por 1 segundo.
file1$group = as.factor(round(file1$time)) file2$group = as.factor(round(file2$time)) file3$group = as.factor(round(file3$time)) file4$group = as.factor(round(file4$time)) #(file1,20) #list of all files files = list(file1, file2, file3, file4) #loop through all files and combine features = NULL for (i in 1:4){ res = files[[i]] %>% group_by(group) %>% summarize(ax_mean = mean(ax), ax_sd = sd(ax), ax_min = min(ax), ax_max = max(ax), ax_median = median(ax), ay_mean = mean(ay), ay_sd = sd(ay), ay_min = min(ay), ay_may = max(ay), ay_median = median(ay), az_mean = mean(az), az_sd = sd(az), az_min = min(az), az_maz = max(az), az_median = median(az), aT_mean = mean(aT), aT_sd = sd(aT), aT_min = min(aT), aT_maT = max(aT), aT_median = median(aT), y = mean(y) ) features = rbind(features, res) } features = subset(features, select = -group) # store it in a df for future reference actual.features = features
Dados do estudo
Primeiro, vamos analisar o tamanho de nossas populações e o resumo de nossos recursos, juntamente com seus tipos de dados.
# show data types str(features)
Classes 'tbl_df', 'tbl' and 'data.frame': 362 obs. of 21 variables: $ ax_mean : num -0.03816 -0.00581 0.06985 0.01155 0.04669 ... $ ax_sd : num 0.659 0.633 0.667 0.551 0.643 ... $ ax_min : num -1.26 -1.62 -1.46 -1.93 -1.78 ... $ ax_max : num 1.38 1.19 1.47 1.2 1.48 ... $ ax_median: num -0.0955 -0.0015 0.107 0.0675 0.0836 ... $ ay_mean : num -0.068263 0.003791 0.074433 0.000826 -0.017759 ... $ ay_sd : num 0.751 0.782 0.802 0.789 0.751 ... $ ay_min : num -1.39 -1.56 -1.48 -2 -1.66 ... $ ay_may : num 1.64 1.54 1.8 1.56 1.44 ... $ ay_median: num -0.19 0.0101 0.1186 -0.0027 -0.0253 ... $ az_mean : num -0.138 -0.205 -0.0641 -0.0929 -0.1399 ... $ az_sd : num 0.985 0.925 0.929 0.889 0.927 ... $ az_min : num -2.68 -3.08 -1.82 -2.16 -1.85 ... $ az_maz : num 2.75 2.72 2.49 3.24 3.55 ... $ az_median: num 0.0254 -0.2121 -0.1512 -0.1672 -0.1741 ... $ aT_mean : num 1.27 1.26 1.3 1.2 1.23 ... $ aT_sd : num 0.583 0.545 0.513 0.513 0.582 ... $ aT_min : num 0.4 0.41 0.255 0.393 0.313 0.336 0.275 0.196 0.032 0.358 ... $ aT_maT : num 3.03 3.2 2.64 3.32 3.6 ... $ aT_median: num 1.08 1.14 1.28 1.12 1.17 ... $ y : num 1 1 1 1 1 1 1 1 1 1 ...
Crie observações com valores de NA no final
A seguir, atribuiremos algumas NAs para este tutorial no final da tabela.
features1 = features for(i in 363:400){ features1[i,] = NA }
Visualizar nas 50 linhas inferiores
Vemos os valores ausentes no final da tabela.
Isenção de responsabilidade: aqui apresentamos todas as últimas 50 linhas como NA. No mundo real, é altamente improvável. Você pode ter apenas alguns valores ausentes.
tail(features1, 50)
ax_mean | ax_sd | ax_min | ax_max | ax_median | ay_mean | ay_sd | ay_min | ay_may | ay_median | … | az_sd | az_min | az_maz | az_median | aT_mean | aT_sd | aT_min | aT_maT | aT_median | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
-0.016097030 | 0.8938523 | -2,3445 | 2.3006 | -0,07360 | -0.009759406 | 1.311817 | -3,4215 | 2,5028 | 0,10890 | … | 1,264572 | -2,8751 | 3,3718 | -0,07070 | 1,866030 | 0.7808319 | 0,380 | 4.098 | 1.8200 | 4 |
-0,015565347 | 0.8956615 | -2,2661 | 2,5089 | 0,08640 | 0.027313861 | 1,294063 | -2,9421 | 2,3497 | 0,15260 | … | 1,368576 | -3,3165 | 2,6989 | -0,01660 | 1.930426 | 0.7749686 | 0,127 | 4.463 | 1,8350 | 4 |
0,024006250 | 0.8653758 | -2,4040 | 2.5328 | -0,03170 | 0,008440625 | 1.376398 | -3,0422 | 2,3727 | 0.11390 | … | 1.449783 | -4,2171 | 4.7703 | 0,00110 | 2.003552 | 0.8300253 | 0,387 | 5.138 | 1.9920 | 4 |
-0,015563000 | 0.8720967 | -2,3451 | 2,3329 | -0,05325 | 0,013962000 | 1.240091 | -3,1360 | 2.8563 | 0,09145 | … | 1.418988 | -3,3758 | 3.4279 | -0,10410 | 1.895380 | 0.8351505 | 0,173 | 4,458 | 1,8735 | 4 |
0,003894898 | 0.8806773 | -2,3098 | 3.1902 | -0,09260 | 0,022575510 | 1.301955 | -3,2561 | 2,7833 | -0,05380 | … | 1,271799 | -3,8035 | 3,1323 | -0,26115 | 1,852265 | 0.7909640 | 0,436 | 3.944 | 1,7570 | 4 |
-0.039379208 | 0.8127135 | -2,1523 | 1,8828 | -0,11250 | 0,005454455 | 1.189519 | -2,8057 | 2,4852 | 0,03040 | … | 1.366368 | -3,3928 | 2.4507 | 0,05430 | 1,828059 | 0,7562042 | 0,580 | 3.573 | 1,6960 | 4 |
0,021469000 | 0.8272527 | -1,5895 | 3.7505 | -0,08995 | 0,011312000 | 1.285206 | -2,7423 | 2.6785 | -0,03640 | … | 1.177012 | -2,6649 | 2.1685 | 0,02755 | 1,785930 | 0,7120829 | 0,298 | 3.895 | 1,7575 | 4 |
0,005917000 | 0.9139808 | -2,3310 | 2.8131 | -0,07800 | -0,040868000 | 1.320873 | -2,9778 | 2,2841 | -0,01435 | … | 1.401567 | -3,3728 | 3.3165 | 0,19485 | 1.947570 | 0.8513573 | 0,397 | 4,191 | 1,8180 | 4 |
-0.034448571 | 0.8640626 | -2,4917 | 2.4113 | -0,01960 | -0,013410476 | 1.235196 | -3,3305 | 2,4912 | 0,09420 | … | 1.327886 | -2,9864 | 2,8430 | -0,05300 | 1,882590 | 0.6971337 | 0,370 | 3.775 | 1,9030 | 4 |
0.046837374 | 0.9776022 | -1,8688 | 2,66644 | -0,03600 | 0.019817172 | 1,293644 | -2,7836 | 2.6166 | 0,12540 | … | 1.245906 | -2,4813 | 3,2677 | -0,11460 | 1.901646 | 0.7296095 | 0,283 | 3.813 | 1,8440 | 4 |
-0.014453061 | 0.9553743 | -2,7118 | 2,4640 | -0,01000 | -0.037717347 | 1.285358 | -3.1225 | 2.4506 | 0,03085 | … | 1,457232 | -4,2512 | 3.3754 | 0,09325 | 1.984418 | 0.8511168 | 0,446 | 4,351 | 1,8600 | 4 |
0.046810870 | 0.9259427 | -1,5309 | 1,9420 | -0,11455 | 0.230676087 | 1.491983 | -2,8435 | 2.8405 | 0,33060 | … | 1.111205 | -2,1748 | 2.9009 | -0,03790 | 1.927174 | 0,7622031 | 0,491 | 3,335 | 2.1620 | 4 |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | … | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D | N / D |
Atribuir NAs com melhores valores usando o método de iteração
Em seguida, para atribuir valores ausentes, usaremos a função de mouse. Manteremos o máximo de iterações para 50 e o método como ‘pmm’.
imputed_Data = mice(features1, m=1, maxit = 50, method = 'pmm', seed = 999, printFlag =FALSE)
Ver resultados imputados
Agora temos resultados imputados. Usaremos o primeiro quadro de dados imputados para este estudo. Você pode testar todas as diferentes imputações para ver qual funciona melhor.
imputedResultData = mice::complete(imputed_Data,1) tail(imputedResultData, 50)
ax_mean | ax_sd | ax_min | ax_max | ax_median | ay_mean | ay_sd | ay_min | ay_may | ay_median | … | az_sd | az_min | az_maz | az_median | aT_mean | aT_sd | aT_min | aT_maT | aT_median | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
351 | -0.016097030 | 0.8938523 | -2,3445 | 2.3006 | -0,07360 | -0.009759406 | 1.3118166 | -3,4215 | 2,5028 | 0,10890 | … | 1,2645719 | -2,8751 | 3,3718 | -0,07070 | 1,8660297 | 0.7808319 | 0,380 | 4.098 | 1.8200 | 4 |
352 | -0,015565347 | 0.8956615 | -2,2661 | 2,5089 | 0,08640 | 0.027313861 | 1.2940627 | -2,9421 | 2,3497 | 0,15260 | … | 1.3685757 | -3,3165 | 2,6989 | -0,01660 | 1.9304257 | 0.7749686 | 0,127 | 4.463 | 1,8350 | 4 |
353 | 0,024006250 | 0.8653758 | -2,4040 | 2.5328 | -0,03170 | 0,008440625 | 1.3763983 | -3,0422 | 2,3727 | 0.11390 | … | 1.4497833 | -4,2171 | 4.7703 | 0,00110 | 2.0035521 | 0.8300253 | 0,387 | 5.138 | 1.9920 | 4 |
354 | -0,015563000 | 0.8720967 | -2,3451 | 2,3329 | -0,05325 | 0,013962000 | 1.2400913 | -3,1360 | 2.8563 | 0,09145 | … | 1.4189884 | -3,3758 | 3.4279 | -0,10410 | 1.8953800 | 0.8351505 | 0,173 | 4,458 | 1,8735 | 4 |
355 | 0,003894898 | 0.8806773 | -2,3098 | 3.1902 | -0,09260 | 0,022575510 | 1.3019546 | -3,2561 | 2,7833 | -0,05380 | … | 1.2717989 | -3,8035 | 3,1323 | -0,26115 | 1,8522653 | 0.7909640 | 0,436 | 3.944 | 1,7570 | 4 |
356 | -0.039379208 | 0.8127135 | -2,1523 | 1,8828 | -0,11250 | 0,005454455 | 1.1895194 | -2,8057 | 2,4852 | 0,03040 | … | 1.3663678 | -3,3928 | 2.4507 | 0,05430 | 1,8280594 | 0,7562042 | 0,580 | 3.573 | 1,6960 | 4 |
357 | 0,021469000 | 0.8272527 | -1,5895 | 3.7505 | -0,08995 | 0,011312000 | 1.2852056 | -2,7423 | 2.6785 | -0,03640 | … | 1.1770121 | -2,6649 | 2.1685 | 0,02755 | 1,7859300 | 0,7120829 | 0,298 | 3.895 | 1,7575 | 4 |
358 | 0,005917000 | 0.9139808 | -2,3310 | 2.8131 | -0,07800 | -0,040868000 | 1.3208731 | -2,9778 | 2,2841 | -0,01435 | … | 1.4015674 | -3,3728 | 3.3165 | 0,19485 | 1.9475700 | 0.8513573 | 0,397 | 4,191 | 1,8180 | 4 |
359 | -0.034448571 | 0.8640626 | -2,4917 | 2.4113 | -0,01960 | -0,013410476 | 1.2351957 | -3,3305 | 2,4912 | 0,09420 | … | 1.3278861 | -2,9864 | 2,8430 | -0,05300 | 1.8825905 | 0.6971337 | 0,370 | 3.775 | 1,9030 | 4 |
360 | 0.046837374 | 0.9776022 | -1,8688 | 2,66644 | -0,03600 | 0.019817172 | 1.2936436 | -2,7836 | 2.6166 | 0,12540 | … | 1.2459059 | -2,4813 | 3,2677 | -0,11460 | 1.9016465 | 0.7296095 | 0,283 | 3.813 | 1,8440 | 4 |
361 | -0.014453061 | 0.9553743 | -2,7118 | 2,4640 | -0,01000 | -0.037717347 | 1.2853576 | -3.1225 | 2.4506 | 0,03085 | … | 1.4572321 | -4,2512 | 3.3754 | 0,09325 | 1.9844184 | 0.8511168 | 0,446 | 4,351 | 1,8600 | 4 |
362 | 0.046810870 | 0.9259427 | -1,5309 | 1,9420 | -0,11455 | 0.230676087 | 1.4919834 | -2,8435 | 2.8405 | 0,33060 | … | 1.1112049 | -2,1748 | 2.9009 | -0,03790 | 1.9271739 | 0,7622031 | 0,491 | 3,335 | 2.1620 | 4 |
363 | 0.011238614 | 0.8127502 | -1,9602 | 2.1430 | 0,00680 | -0.013367308 | 1.3019546 | -3,0628 | 2.7338 | 0,00070 | … | 1.4534581 | -4,4325 | 2,9648 | -0,03520 | 1.9383000 | 0.8526128 | 0,373 | 4,351 | 1.8705 | 4 |
364 | -0.009812264 | 0.7680463 | -2,3492 | 1,3919 | 0,03110 | 0.013984158 | 0.6084791 | -1,4155 | 0,9273 | 0,11860 | … | 0.9997898 | -3,0031 | 3.5781 | -0,25930 | 1.2219510 | 0,6450616 | 0,233 | 3,603 | 1.0730 | 1 1 |
365 | -0,026760000 | 0.4780558 | -1,1826 | 0,9934 | 0,05560 | -0.035218269 | 0,5632648 | -1,0761 | 1.2307 | -0,08165 | … | 0.7635922 | -2,3115 | 1,8934 | 0,03005 | 0.9714200 | 0.4214891 | 0,214 | 2.180 | 0,9265 | 1 1 |
366 | 0,029083000 | 0,7515921 | -2,2628 | 2,4640 | -0,00820 | 0.011159596 | 1.3073606 | -3,1360 | 2,8527 | 0,04010 | … | 1.4534581 | -3,6751 | 2.6187 | -0,22680 | 1.9367549 | 0,7439326 | 0,354 | 4,156 | 1,8450 | 4 |
367 | 0,002401000 | 0.5641062 | -1,1533 | 1,4479 | -0,04215 | 0.011159596 | 1.0358946 | -1,9850 | 2,9217 | -0,07040 | … | 0.7141977 | -1,7791 | 1,3013 | -0,20785 | 1.2607358 | 0.4523664 | 0,376 | 2,106 | 1,2830 | 4 |
368 | 0.017670707 | 0.4158231 | -0,9785 | 1.0647 | 0,07680 | -0.026719608 | 0.4759174 | -0,9340 | 0,9077 | -0,03650 | … | 0.6919936 | -1,6094 | 2.0555 | -0.19365 | 0.8742105 | 0.3962710 | 0,230 | 2.123 | 0,8120 | 1 1 |
369 | -0.078038776 | 0.4413032 | -1.1099 | 0,9826 | -0,03910 | -0,010626042 | 0.4768587 | -0,9392 | 0.8497 | -0,04655 | … | 0.8165436 | -2,2936 | 2.1036 | -0,29570 | 0.9319524 | 0.4517633 | 0,193 | 2.380 | 0,8865 | 2 |
370 | 0,004372632 | 0.8352791 | -1,6966 | 2,3897 | 0,00845 | -0,010064000 | 1.2746954 | -2,7832 | 2,2841 | 0,03085 | … | 1.2177225 | -3.1289 | 3.0919 | 0,01905 | 1,7844653 | 0.7343952 | 0,489 | 3.764 | 1,7520 | 3 |
371 | 0,016103000 | 0,3997476 | -0,9537 | 1,1546 | 0,03655 | -0,031622772 | 0.4828770 | -0,9772 | 1.1237 | -0,14540 | … | 0,7672163 | -1,9818 | 1.8173 | -0,09240 | 0.9053800 | 0.4160549 | 0201 | 2.053 | 0,8520 | 2 |
372 | -0.020355446 | 0.4178729 | -1,0524 | 0,9076 | -0,09340 | 0,044400000 | 0,5439558 | -0,9828 | 1.0798 | 0,14000 | … | 0,7552593 | -2,0607 | 1.6134 | -0.17990 | 0.9498911 | 0,3846176 | 0,222 | 1,752 | 0,8950 | 1 1 |
373 | 0,001363636 | 0.4868077 | -0,9027 | 1.5155 | 0,04820 | 0,031339000 | 1.0619675 | -2,3261 | 2,4081 | -0,00210 | … | 0.7598489 | -1,7482 | 1,3013 | -0.20075 | 1.3272772 | 0.4315494 | 0,478 | 2,288 | 1,3220 | 4 |
374 | -0.008122222 | 0.8831968 | -1,9394 | 3,3244 | -0,09610 | 0.017400971 | 1.3778757 | -3,7580 | 2,4527 | 0,16935 | … | 1,4260617 | -3,1893 | 3.5781 | 0,09325 | 1.9576857 | 0.9167571 | 0,295 | 4.830 | 1,9430 | 4 |
375 | -0.065401010 | 0.8489219 | -2,4871 | 2.1672 | -0,11250 | -0.043491753 | 0,5648206 | -1,5188 | 0.8497 | 0,05440 | … | 1.4259974 | -3,1893 | 4.6557 | 0,08010 | 1.4950297 | 0.8012418 | 0,198 | 4.290 | 1,2550 | 1 1 |
376 | 0.039720000 | 0.5946125 | -1,5250 | 1,7390 | 0,05040 | 0,061424510 | 0.8133879 | -1,2303 | 1.6255 | 0,05660 | … | 0.9355264 | -2,2936 | 2.9202 | 0,02420 | 1.2507900 | 0.5391791 | 0,294 | 3.081 | 1.1770 | 3 |
377 | 0,022841000 | 0.8646867 | -2.1253 | 2.6378 | 0,05720 | 0,052515306 | 1.1332836 | -2,5429 | 2,3692 | 0,10620 | … | 1.0360114 | -3,0924 | 3.0590 | 0,00110 | 1.5811275 | 0.7053254 | 0,326 | 3,742 | 1,5815 | 3 |
378 | -0.001924510 | 0.5975310 | -1,4775 | 1,4089 | -0,11455 | -0,040868000 | 1.0363392 | -2,3289 | 2.2123 | 0,03025 | … | 0,7546022 | -1,6175 | 1,2922 | -0,18510 | 1.3324845 | 0.5131552 | 0,305 | 2.091 | 1,2830 | 4 |
379 | 0.017975000 | 0.4780750 | -1.2011 | 1,4923 | -0,07450 | -0.022319802 | 0.5072372 | -1.1404 | 1.0361 | -0,04135 | … | 0,7439169 | -2,0052 | 1,7066 | -0,09450 | 0.9151400 | 0.4541700 | 0,226 | 2,264 | 0,8270 | 2 |
380 | -0.070804000 | 0.4780558 | -1,9254 | 0,9244 | -0,05830 | -0,074927551 | 0.5037149 | -1,0485 | 1.0710 | -0,07750 | … | 0.7598489 | -2,1735 | 2.0385 | -0,24560 | 0.9281400 | 0.4813814 | 0,150 | 2.084 | 0,7900 | 2 |
381 | -0.002204762 | 0.9310547 | -2,7832 | 2.5242 | -0,07875 | -0.019305882 | 1.3019546 | -2,4215 | 2,8615 | -0,02880 | … | 1.1771775 | -3,0903 | 2,4800 | -0,19155 | 1.8377451 | 0,7254306 | 0,377 | 3.348 | 1,7770 | 4 |
382 | 0,021469000 | 0.8646867 | -2.0001 | 2,4477 | -0,03400 | 0.051977895 | 1.3628383 | -2,6574 | 2,7414 | 0,15305 | … | 1.1474602 | -2,9516 | 2.6371 | 0,08870 | 1.7884124 | 0.7520192 | 0,400 | 3,661 | 1,9180 | 4 |
383 | -0,015468354 | 0.8127502 | -2,2034 | 2.3405 | -0,02150 | 0.046179798 | 1.3628383 | -2,8594 | 2.7288 | 0,02130 | … | 1.1112049 | -4,2171 | 1,7215 | 0,09600 | 1.7592828 | 0.7680118 | 0,295 | 3.671 | 1,7780 | 4 |
384 | -0.002143000 | 0.4442709 | -0,9949 | 1.0734 | -0,04265 | -0.007904000 | 0.5386439 | -1,2828 | 1,2250 | -0,06765 | … | 0,7335329 | -2,2694 | 2.1640 | -0,30150 | 0.9293627 | 0.4517633 | 0,266 | 2,407 | 0.8000 | 2 |
385 | 0,027587129 | 0.4551125 | -1,2785 | 1.0285 | 0,05660 | -0.035263725 | 0.4854652 | -1,0143 | 1.1332 | -0,03650 | … | 0.7048400 | -2,1237 | 1.8689 | 0.11100 | 0.8571800 | 0.4493956 | 0,164 | 2,222 | 0,8120 | 2 |
386 | 0.017670707 | 0.6981887 | -1,5387 | 2.1808 | -0,04500 | 0.043603191 | 1.2152972 | -2,6631 | 3,1973 | 0,09380 | … | 0.8017314 | -1,6094 | 1,2922 | -0,10680 | 1.4910700 | 0.5158915 | 0,376 | 2.428 | 1,5820 | 4 |
387 | 0,017401000 | 0.7680463 | -1,4528 | 2,2822 | -0,00350 | 0.055612871 | 1.0989870 | -2,7737 | 2.3134 | 0,16785 | … | 1.0468209 | -2,8051 | 1,7055 | -0,01470 | 1.5737525 | 0,6825190 | 0,428 | 2.988 | 1,5810 | 4 |
388 | 0,001363636 | 0.4354711 | -1,0677 | 0,9579 | 0,03655 | -0.017115842 | 0.5501718 | -1.1134 | 1.0798 | -0,01640 | … | 0,7466890 | -2,1237 | 2.0555 | 0,02230 | 0.9342100 | 0.4437911 | 0,266 | 2,222 | 0,8410 | 1 1 |
389 | 0.036087000 | 0.8741671 | -2.2967 | 3.3393 | -0.03330 | -0.019919792 | 1.4065464 | -2.9778 | 3.0511 | -0.04680 | … | 1.2155255 | -3.8281 | 1.9302 | 0.08820 | 1.8953800 | 0.7778120 | 0.242 | 4.098 | 1.9170 | 4 |
390 | 0.007588000 | 0.8409728 | -1.9602 | 2.2383 | -0.07985 | 0.025797000 | 1.3525870 | -3.1511 | 2.7414 | -0.02135 | … | 1.4189884 | -3.6947 | 2.7486 | -0.14945 | 1.9648889 | 0.8489206 | 0.397 | 3.963 | 1.8600 | 4 |
391 | 0.065754545 | 0.4533416 | -0.7769 | 1.1179 | 0.10470 | 0.047955446 | 0.5539467 | -0.9340 | 1.0356 | 0.03360 | … | 0.7569361 | -2.1362 | 2.3655 | -0.10495 | 0.9663913 | 0.4276036 | 0.285 | 2.353 | 0.8930 | 2 |
392 | -0.030526733 | 0.4442709 | -1.7119 | 1.0302 | 0.03000 | -0.021866667 | 0.6103892 | -1.0198 | 1.6418 | -0.01105 | … | 1.4149706 | -3.3599 | 5.0202 | -0.11600 | 1.3062900 | 0.7562042 | 0.131 | 4.443 | 1.1075 | 1 1 |
393 | -0.001643000 | 0.8086920 | -1.9033 | 2.5242 | -0.03200 | -0.033747959 | 1.3111909 | -3.0231 | 2.3208 | 0.01690 | … | 1.1671442 | -3.7451 | 2.0425 | -0.19155 | 1.7976224 | 0.7133729 | 0.326 | 3.651 | 1.7310 | 4 |
394 | -0.023916346 | 0.4139117 | -0.6977 | 1.1179 | -0.04360 | 0.011312000 | 0.4828770 | -1.2828 | 1.1237 | 0.04940 | … | 0.7135787 | -1.9553 | 1.8769 | -0.23950 | 0.8609714 | 0.4064190 | 0.054 | 2.031 | 0.7900 | 2 |
395 | 0.037914706 | 0.4369138 | -0.9701 | 0.9937 | 0.07080 | -0.011703810 | 0.4883374 | -1.0822 | 1.1166 | -0.08405 | … | 0.7141977 | -1.9285 | 2.0766 | 0.08010 | 0.8621584 | 0.4222442 | 0.193 | 2.180 | 0.7910 | 2 |
396 | -0.024820792 | 0.8127135 | -1.9299 | 2.6378 | 0.01800 | -0.044580000 | 1.1363141 | -2.5429 | 2.4081 | -0.12910 | … | 1.0066063 | -2.4043 | 1.5056 | -0.12860 | 1.6121359 | 0.5853224 | 0.052 | 2.517 | 1.6945 | 4 |
397 | -0.016237500 | 0.7620745 | -2.4099 | 1.7855 | -0.05150 | 0.032355102 | 1.1534694 | -2.6734 | 2.4506 | 0.07725 | … | 1.4259974 | -4.1238 | 4.2297 | -0.24790 | 1.7976224 | 0.9082928 | 0.212 | 5.397 | 1.6595 | 3 |
398 | -0.039379208 | 0.5614528 | -1.7119 | 1.4600 | -0.11620 | -0.032463000 | 1.1096189 | -2.4111 | 2.4533 | -0.09910 | … | 1.1076786 | -3.1215 | 2.2947 | -0.14000 | 1.5025833 | 0.7521618 | 0.168 | 3.790 | 1.4420 | 3 |
399 | 0.026206186 | 0.7980083 | -1.9033 | 2.3863 | 0.00210 | 0.009870874 | 1.2557210 | -2.8507 | 2.4343 | 0.13105 | … | 1.2135140 | -2.5112 | 2.1638 | -0.22680 | 1.7924158 | 0.6828006 | 0.397 | 3.197 | 1.7150 | 3 |
400 | 0.072777778 | 0.4051881 | -0.8386 | 0.8847 | 0.15575 | 0.015370408 | 0.4759174 | -0.9340 | 1.2039 | 0.01090 | … | 0.7135787 | -2.1186 | 1.5632 | -0.13970 | 0.9087400 | 0.3767882 | 0.170 | 2.507 | 0.8120 | 1 1 |
Looking at distribution actual data and imputed data
We will first compare basic statistics and then distributions of the couple of features. In the comparison of statistics between actual and imputed we can observe that the mean and SD for both imputed and actual are almost equal.
data.frame(actual_ax_mean = c(mean(features$ax_mean), sd(features$ax_mean)) , imputed_ax_mean = c(mean(imputedResultData$ax_mean), sd(imputedResultData$ax_mean)) , actual_ax_median = c(mean(features$ax_median), sd(features$ax_median)) , imputed_ax_median = c(mean(imputedResultData$ax_median), sd(imputedResultData$ax_median)) , actual_az_sd = c(mean(features$az_sd), sd(features$az_sd)) , imputed_az_sd = c(mean(imputedResultData$az_sd), sd(imputedResultData$az_sd)) , row.names = c("mean", "sd"))
actual_ax_mean | imputed_ax_mean | actual_ax_median | imputed_ax_median | actual_az_sd | imputed_az_sd | |
---|---|---|---|---|---|---|
significar | 0.006307909 | 0.005851233 | -0.001328867 | -0.00214025 | 1.0588650 | 1.0528059 |
sd | 0.030961085 | 0.031125848 | 0.059619834 | 0.06011342 | 0.2446782 | 0.2477697 |
Now, lets look at the distributions in the data. From the distribution below, we can observe that the distributions for actual data and imputed data is almost identical. We can confirm it with the bandwidth in the plots.
par(mfrow=c(3,2)) plot(density(features$ax_mean), main = "Actual ax_mean", type="l", col="red") plot(density(imputedResultData$ax_mean), main = "Imputed ax_mean", type="l", col="red") plot(density(features$ax_median), main = "Actual ax_median", type="l", col="red") plot(density(imputedResultData$ax_median), main = "Imputed ax_median", type="l", col="red") plot(density(features$az_sd), main = "Actual az_sdn", type="l", col="red") plot(density(imputedResultData$az_sd), main = "Imputed az_sd", type="l", col="red")
Building a classification model based on actual data and Imputed data
In the following data y will be our classification variable. We will build a classification model using a simple support vector machine(SVM) with actual and imputed data. No transformation will be done on the data. In the end we will compare the results
Actual Data
Sample data creation
Let’s split the data into train and test with ratio’s of 80:20.
#create samples of 80:20 ratio features$y = as.factor(features$y) sample = sample(nrow(features) , nrow(features)* 0.8) train = features[sample,] test = features[-sample,]
Build a SVM model
Now, we can train the model using train set. We will not do any parameter tuning in this example.
library(e1071) ibrary(caret) actual.svm.model = svm(y ~., data = train) summary(actual.svm.model)
Loading required package: ggplot2 Call: svm(formula = y ~ ., data = train) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: 0.05 Number of Support Vectors: 142 ( 47 18 47 30 ) Number of Classes: 4 Levels: 1 2 3 4
Validate SVM model
In the below confusion matrix, we observe the following
- accuary>NIR indicating model is very good
- Higher accuray and kappa value indicates a very accurate model
- Even the balanced accuracy is close to 1 indicating the model is highly accurate
# build a confusion matrix using caret package confusionMatrix(predict(actual.svm.model, test), test$y)
Confusion Matrix and Statistics Reference Prediction 1 2 3 4 1 10 1 0 0 2 0 26 0 0 3 0 0 22 0 4 0 0 3 11 Overall Statistics Accuracy : 0.9452 95% CI : (0.8656, 0.9849) No Information Rate : 0.3699 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.9234 Mcnemar's Test P-Value : NA Statistics by Class: Class: 1 Class: 2 Class: 3 Class: 4 Sensitivity 1.0000 0.9630 0.8800 1.0000 Specificity 0.9841 1.0000 1.0000 0.9516 Pos Pred Value 0.9091 1.0000 1.0000 0.7857 Neg Pred Value 1.0000 0.9787 0.9412 1.0000 Prevalence 0.1370 0.3699 0.3425 0.1507 Detection Rate 0.1370 0.3562 0.3014 0.1507 Detection Prevalence 0.1507 0.3562 0.3014 0.1918 Balanced Accuracy 0.9921 0.9815 0.9400 0.9758
Imputed Data
Sample data creation
# create samples of 80:20 ratio imputedResultData$y = as.factor(imputedResultData$y) sample = sample(nrow(imputedResultData) , nrow(imputedResultData)* 0.8) train = imputedResultData[sample,] test = imputedResultData[-sample,]
Build a SVM model
imputed.svm.model = svm(y ~., data = train) summary(imputed.svm.model)
Call: svm(formula = y ~ ., data = train) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: 0.05 Number of Support Vectors: 167 ( 59 47 36 25 ) Number of Classes: 4 Levels: 1 2 3 4
Validate SVM model
In the below confusion matrix, we observe the following
- accuary>NIR indicating model is very good
- Higher accuray and kappa value indicates a very accurate model
- Even the balanced accuracy is close to 1 indicating the model is highly accurate
confusionMatrix(predict(imputed.svm.model, test), test$y)
Confusion Matrix and Statistics Reference Prediction 1 2 3 4 1 15 0 0 0 2 1 21 0 0 3 0 0 17 0 4 0 0 0 26 Overall Statistics Accuracy : 0.9875 95% CI : (0.9323, 0.9997) No Information Rate : 0.325 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.9831 Mcnemar's Test P-Value : NA Statistics by Class: Class: 1 Class: 2 Class: 3 Class: 4 Sensitivity 0.9375 1.0000 1.0000 1.000 Specificity 1.0000 0.9831 1.0000 1.000 Pos Pred Value 1.0000 0.9545 1.0000 1.000 Neg Pred Value 0.9846 1.0000 1.0000 1.000 Prevalence 0.2000 0.2625 0.2125 0.325 Detection Rate 0.1875 0.2625 0.2125 0.325 Detection Prevalence 0.1875 0.2750 0.2125 0.325 Balanced Accuracy 0.9688 0.9915 1.0000 1.000
Overall results
What we saw above and their interpretation is completely subjective. One way to truly validate them is to create random train and test samples multiple times (say 30), build a model, validate the model, capture kappa value. Finally use a simple t-test to see if there is a significant difference.
Null hypothesis:
H0: there is no significant difference between two samples.
# lets create functions to simplify the process test.function = (data){ # create samples sample = sample(nrow(data) , nrow(data)* 0.75) train = data[sample,] test = data[-sample,] # build model svm.model = svm(y ~., data = train) # get metrics metrics = confusionMatrix(predict(svm.model, test), test$y) return(metrics$overall['Accuracy']) }
# now lets calculate accuracy with actual data to get 30 results actual.results = NULL for(i in 1:100) { actual.results[i] = test.function(features) } head(actual.results) # 0.978021978021978 # 0.978021978021978 # 0.978021978021978 # 0.945054945054945 # 0.989010989010989 # 0.967032967032967
# now lets calculate accuracy with imputed data to get 30 results imputed.results = NULL for(i in 1:100) { imputed.results[i] = test.function(imputedResultData) } head(imputed.results) # 0.97 # 0.95 # 0.92 # 0.96 # 0.92 # 0.96
T-test to test the results
What’s better than statistically prove if there is significant difference right? So, we will do a t-test to see if there is any statistical difference in the accuracy.
# Do a simple t-test to see if there is a difference in accuracy when data is imputed t.test(x= actual.results, y = imputed.results, conf.level = 0.95)
Welch Two Sample t-test data: actual.results and imputed.results t = 7.9834, df = 194.03, p-value = 1.222e-13 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.01673213 0.02771182 sample estimates: mean of x mean of y 0.968022 0.945800
In the above t-test we have set the confidence level at 95%. From the results we can observe that the p-value is less than 0.05 indicating that there is a significant difference in accuracy between actual data and imputed data. From the means we can notice that the average accuracy of actual data is about 96.5% while the accuracy of imputed data y is about 92.5%. There is a variation of 4%. So, does that mean imputing more data results in reducing the accuracy across various models?
Why not do a test to compare the results? let’s consider 4 other models for that and those will be
- Random forest
- Decision tree
- KNN
- Naive Bayes
Random Forest
Let’s use all the same steps as above and fit different models. The results of accuracy will be in the below table
library(randomForest) # lets create functions to simplify the process test.rf.function = function(data){ # create samples sample = sample(nrow(data) , nrow(data)* 0.75) train = data[sample,] test = data[-sample,] # build model rf.model = randomForest(y ~., data = train) # get metrics metrics = confusionMatrix(predict(rf.model, test), test$y) return(metrics$overall['Accuracy']) } # now lets calculate accuracy with actual data to get 30 results actual.rf.results = NULL for(i in 1:100) { actual.rf.results[i] = test.rf.function(features) } #head(actual.rf.results) # now lets calculate accuracy with imputed data to get 30 results imputed.rf.results = NULL for(i in 1:100) { imputed.rf.results[i] = test.rf.function(imputedResultData) } head(data.frame(Actual = actual.rf.results, Imputed = imputed.rf.results)) # Do a simple t-test to see if there is a difference in accuracy when data is imputed t.test(x= actual.rf.results, y = imputed.rf.results, conf.level = 0.95)
Actual | Imputed |
---|---|
0.956044 | 0.95 |
1.000000 | 0.93 |
0.967033 | 0.96 |
0.967033 | 0.96 |
1.000000 | 0.97 |
0.967033 | 0.93 |
Welch Two Sample t-test data: actual.rf.results and imputed.rf.results t = 11.734, df = 183.2, p-value 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.02183138 0.03065654 sample estimates: mean of x mean of y 0.976044 0.949800
In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 2.5% difference.
Decision Tree
library(rpart) # lets create functions to simplify the process test.dt.function = function(data){ # create samples sample = sample(nrow(data) , nrow(data)* 0.75) train = data[sample,] test = data[-sample,] # build model dt.model = rpart(y ~., data = train, method="class") # get metrics metrics = confusionMatrix(predict(dt.model, test, type="class"), test$y) return(metrics$overall['Accuracy']) } # now lets calculate accuracy with actual data to get 30 results actual.dt.results = NULL for(i in 1:100) { actual.dt.results[i] = test.dt.function(features) } #head(actual.rf.results) # now lets calculate accuracy with imputed data to get 30 results imputed.dt.results = NULL for(i in 1:100) { imputed.dt.results[i] = test.dt.function(imputedResultData) } head(data.frame(Actual = actual.dt.results, Imputed = imputed.dt.results)) # Do a simple t-test to see if there is a difference in accuracy when data is imputed t.test(x= actual.dt.results, y = imputed.dt.results, conf.level = 0.95)
Actual | Imputed |
---|---|
0.978022 | 0.92 |
0.967033 | 0.94 |
0.967033 | 0.95 |
0.956044 | 0.94 |
0.956044 | 0.94 |
0.978022 | 0.95 |
Welch Two Sample t-test data: actual.dt.results and imputed.dt.results t = 16.24, df = 167.94, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.03331888 0.04254046 sample estimates: mean of x mean of y 0.9703297 0.9324000
In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 3.5% difference.
K-Nearest Neighbor (KNN)
library(class) # lets create functions to simplify the process test.knn.function = function(data){ # create samples sample = sample(nrow(data) , nrow(data)* 0.75) train = data[sample,] test = data[-sample,] # build model knn.model = knn(train,test, cl=train$y, k=5) # get metrics metrics = confusionMatrix(knn.model, test$y) return(metrics$overall['Accuracy']) } # now lets calculate accuracy with actual data to get 30 results actual.dt.results = NULL for(i in 1:100) { actual.dt.results[i] = test.knn.function(features) } #head(actual.rf.results) # now lets calculate accuracy with imputed data to get 30 results imputed.dt.results = NULL for(i in 1:100) { imputed.dt.results[i] = test.knn.function(imputedResultData) } head(data.frame(Actual = actual.dt.results, Imputed = imputed.dt.results)) # Do a simple t-test to see if there is a difference in accuracy when data is imputed t.test(x= actual.dt.results, y = imputed.dt.results, conf.level = 0.95)
Actual | Imputed |
---|---|
0.967033 | 0.97 |
1.000000 | 0.98 |
0.978022 | 0.99 |
0.978022 | 1.00 |
0.967033 | 1.00 |
0.978022 | 1.00 |
Welch Two Sample t-test data: actual.dt.results and imputed.dt.results t = 3.2151, df = 166.45, p-value = 0.001566 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.002126868 0.008895110 sample estimates: mean of x mean of y 0.989011 0.983500
In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 0.05% difference.
Naive Bayes
# lets create functions to simplify the process test.nb.function = function(data){ # create samples sample = sample(nrow(data) , nrow(data)* 0.75) train = data[sample,] test = data[-sample,] # build model nb.model = naiveBayes(y ~., data = train) # get metrics metrics = confusionMatrix(predict(nb.model, test), test$y) return(metrics$overall['Accuracy']) } # now lets calculate accuracy with actual data to get 30 results actual.nb.results = NULL for(i in 1:100) { actual.nb.results[i] = test.nb.function(features) } #head(actual.rf.results) # now lets calculate accuracy with imputed data to get 30 results imputed.nb.results = NULL for(i in 1:100) { imputed.nb.results[i] = test.nb.function(imputedResultData) } head(data.frame(Actual = actual.nb.results, Imputed = imputed.nb.results)) # Do a simple t-test to see if there is a difference in accuracy when data is imputed t.test(x= actual.nb.results, y = imputed.nb.results, conf.level = 0.95)
Actual | Imputed |
---|---|
0.989011 | 0.95 |
0.967033 | 0.92 |
0.978022 | 0.94 |
1.000000 | 0.95 |
0.989011 | 0.90 |
0.967033 | 0.93 |
Welch Two Sample t-test data: actual.nb.results and imputed.nb.results t = 18.529, df = 174.88, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.04214191 0.05218996 sample estimates: mean of x mean of y 0.9740659 0.9269000
In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 4.5% difference.
Conclusão
From the above results we observe that irrespective of the type of model built, we observed a standard variation in accuracy in the range of 3% – 5% between using actual data and imputed data. In all the cases, actual data helped in building a better model compared to using imputed data for building the model.
If you enjoyed this tutorial, then check out my other tutorials and my GitHub page for all the source code and various R-packages.
-
Testing the Effect of Data Imputation on Model Accuracy
-
Anomaly Detection for Predictive Maintenance using Keras
-
Predictive Maintenance: Zero to Deployment in Manufacturing
-
Free coding education in the time of Covid-19
-
AutoML Frameworks in R & Python
The post Testing the Effect of Data Imputation on Model Accuracy appeared first on Hi! I am Nagdev.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };
(function(d, t) {
var s = d.createElement
s.src = '//cdn.viglink.com/api/vglnk.js';
var r = d.getElementsByTagName
}(document, 'script'));
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.