Testando o efeito da imputação de dados na precisão do modelo

cupom com desconto - o melhor site de cupom de desconto cupomcomdesconto.com.br


[Esteartigofoipublicadopelaprimeiravezem[Thisarticlewasfirstpublishedon R – Oi! Eu sou Nagdev, e gentilmente contribuiu para os R-blogueiros]. (Você pode relatar um problema sobre o conteúdo desta página aqui)


Deseja compartilhar seu conteúdo com R-blogueiros? clique aqui se você tiver um blog ou aqui se não tiver.

A maioria de nós já se deparou com situações em que não temos dados suficientes para construir modelos confiáveis ​​devido a várias razões, como, por exemplo, é caro coletar dados (estudos em humanos), recursos limitados, falta de disponibilidade de dados históricos (terremotos). Antes mesmo de começarmos a falar sobre como superar o desafio, vamos primeiro falar sobre por que precisamos de amostras mínimas antes mesmo de pensarmos em construir um modelo. Primeiro de tudo, podemos construir um modelo com poucas amostras. Definitivamente é possível! Porém, conforme o número de amostras diminui, a margem de erro aumenta e vice-versa. Se você deseja construir um modelo com a maior precisão, precisará ter o maior número possível de amostras. Se o modelo for para um aplicativo do mundo real, você precisará ter dados por vários dias para responder a quaisquer alterações no sistema. Existe uma fórmula que pode ser usada para calcular o tamanho da amostra e é a seguinte:

Onde, n = tamanho da amostra

Z = valor da pontuação Z

σ = desvio padrão preenchido

MOE = margem de erro aceitável

Você também pode calcular com uma calculadora on-line, como neste link
https://www.qualtrics.com/blog/calculating-sample-size/

Agora sabemos que por que são necessárias amostras mínimas para alcançar a precisão exigida, digamos que, em alguns casos, não temos a oportunidade de coletar mais amostras ou disponíveis. Então temos a opção de fazer o seguinte

  1. Validação cruzada K-fold
  2. Validação cruzada de não-P-out
  3. Validação cruzada de exclusão única
  4. Criação de novos dados por estimativa

No método K-fold, os dados são divididos em k partições e, em seguida, são treinados com cada partição e testados com a k-ésima partição. No método k-hold, nem todas as combinações são consideradas. Somente partições especificadas pelo usuário são consideradas. Enquanto estiver em deixar um / sair, todas as combinações ou partições são consideradas. Esta é uma técnica mais exaustiva na validação dos resultados. As duas técnicas a seguir acima são as técnicas mais populares usadas no aprendizado de máquina e no aprendizado profundo.

Quando se trata de lidar com NAs em um conjunto de dados, sempre o imputamos por meio de média, mediana, zero e números aleatórios. Mas isso provavelmente não faria sentido quando queremos criar novos dados.

Na criação de novos dados por meio da técnica de estimativa, linhas de dados ausentes são criadas no conjunto de dados e um modelo de imputação de dados separado é usado para imputar dados ausentes nas linhas. A Imputação Multivariada por Equações em Cadeia (MICE) é um dos algoritmos mais populares disponíveis para inserir dados ausentes, independentemente de tipos de dados, como misturas de dados categóricos contínuos, binários, não ordenados e categóricos ordenados.

Existem vários tutoriais disponíveis para dobrar em k e deixar um modelo de fora. Este tutorial se concentrará no quarto modelo em que novos dados serão criados para lidar com menos tamanho de amostra. No modelo de classificação simples e com ser treinado para ver se houve uma melhoria significativa. Além disso, a distribuição de dados imputados e não imputados será comparada para ver qualquer diferença significativa.

Carregar bibliotecas

Vamos carregar todas as bibliotecas necessárias por enquanto.

options(warn=-1)

# load libraies
library(mice)
library(dplyr)

Carregar dados em um quadro de dados

Os dados disponíveis no meu repositório GitHub são usados ​​para a análise.

setwd("C:/OpenSourceWork/Experiment")
#read csv files
file1 = read.csv("dry run.csv", sep=",", header =T)
file2 = read.csv("base.csv", sep=",", header =T)
file3 = read.csv("imbalance 1.csv", sep=",", header =T)
file4 = read.csv("imbalance 2.csv", sep=",", header =T)

#Add labels to data
file1$y = 1
file2$y = 2
file3$y = 3
file4$y = 4

#view top rows of data
head(file1)
Tempo machado ay az aT y
0,002 -0,3246 0,2748 0,1502 0,451 1 1
0,009 0,6020 -0,1900 -0,3227 0,709 1 1
0,019 0,9787 0,3258 0,0124 1.032 1 1
0,027 0.6141 -0,4179 0,0471 0,744 1 1
0,038 -0,3218 -0,6389 -0,4259 0.833 1 1
0,047 -0,3607 0,1332 -0,1291 0,406 1 1
Dados não tratados

Crie alguns recursos a partir de dados

Os dados utilizados neste estudo são dados de vibração com diferentes estados. Os dados foram coletados a 100 Hz. Os dados a serem utilizados como são são de alta dimensão também, não temos um bom resumo dos dados. Portanto, algumas características estatísticas são extraídas. Nesse caso, o desvio padrão da amostra, a média da amostra, a amostra mínima, a amostra máxima e a mediana da amostra são calculadas. Além disso, os dados são agregados por 1 segundo.

file1$group = as.factor(round(file1$time))
file2$group = as.factor(round(file2$time))
file3$group = as.factor(round(file3$time))
file4$group = as.factor(round(file4$time))
#(file1,20)

#list of all files
files = list(file1, file2, file3, file4)

#loop through all files and combine
features = NULL
for (i in 1:4){
res = files[[i]] %>%
    group_by(group) %>%
    summarize(ax_mean = mean(ax),
              ax_sd = sd(ax),
              ax_min = min(ax),
              ax_max = max(ax),
              ax_median = median(ax),
              ay_mean = mean(ay),
              ay_sd = sd(ay),
              ay_min = min(ay),
              ay_may = max(ay),
              ay_median = median(ay),
              az_mean = mean(az),
              az_sd = sd(az),
              az_min = min(az),
              az_maz = max(az),
              az_median = median(az),
              aT_mean = mean(aT),
              aT_sd = sd(aT),
              aT_min = min(aT),
              aT_maT = max(aT),
              aT_median = median(aT),
              y = mean(y)
             )
    features = rbind(features, res)
}

features = subset(features, select = -group)

# store it in a df for future reference
actual.features = features

Dados do estudo

Primeiro, vamos analisar o tamanho de nossas populações e o resumo de nossos recursos, juntamente com seus tipos de dados.

# show data types
str(features)
Classes 'tbl_df', 'tbl' and 'data.frame':	362 obs. of  21 variables:
 $ ax_mean  : num  -0.03816 -0.00581 0.06985 0.01155 0.04669 ...
 $ ax_sd    : num  0.659 0.633 0.667 0.551 0.643 ...
 $ ax_min   : num  -1.26 -1.62 -1.46 -1.93 -1.78 ...
 $ ax_max   : num  1.38 1.19 1.47 1.2 1.48 ...
 $ ax_median: num  -0.0955 -0.0015 0.107 0.0675 0.0836 ...
 $ ay_mean  : num  -0.068263 0.003791 0.074433 0.000826 -0.017759 ...
 $ ay_sd    : num  0.751 0.782 0.802 0.789 0.751 ...
 $ ay_min   : num  -1.39 -1.56 -1.48 -2 -1.66 ...
 $ ay_may   : num  1.64 1.54 1.8 1.56 1.44 ...
 $ ay_median: num  -0.19 0.0101 0.1186 -0.0027 -0.0253 ...
 $ az_mean  : num  -0.138 -0.205 -0.0641 -0.0929 -0.1399 ...
 $ az_sd    : num  0.985 0.925 0.929 0.889 0.927 ...
 $ az_min   : num  -2.68 -3.08 -1.82 -2.16 -1.85 ...
 $ az_maz   : num  2.75 2.72 2.49 3.24 3.55 ...
 $ az_median: num  0.0254 -0.2121 -0.1512 -0.1672 -0.1741 ...
 $ aT_mean  : num  1.27 1.26 1.3 1.2 1.23 ...
 $ aT_sd    : num  0.583 0.545 0.513 0.513 0.582 ...
 $ aT_min   : num  0.4 0.41 0.255 0.393 0.313 0.336 0.275 0.196 0.032 0.358 ...
 $ aT_maT   : num  3.03 3.2 2.64 3.32 3.6 ...
 $ aT_median: num  1.08 1.14 1.28 1.12 1.17 ...
 $ y        : num  1 1 1 1 1 1 1 1 1 1 ...

Crie observações com valores de NA no final

A seguir, atribuiremos algumas NAs para este tutorial no final da tabela.

features1 = features
for(i in 363:400){
  features1[i,] = NA
}

Visualizar nas 50 linhas inferiores

Vemos os valores ausentes no final da tabela.

Leia Também  7 perguntas financeiras a serem feitas antes de comprar uma casa

Isenção de responsabilidade: aqui apresentamos todas as últimas 50 linhas como NA. No mundo real, é altamente improvável. Você pode ter apenas alguns valores ausentes.

tail(features1, 50)
ax_mean ax_sd ax_min ax_max ax_median ay_mean ay_sd ay_min ay_may ay_median az_sd az_min az_maz az_median aT_mean aT_sd aT_min aT_maT aT_median y
-0.016097030 0.8938523 -2,3445 2.3006 -0,07360 -0.009759406 1.311817 -3,4215 2,5028 0,10890 1,264572 -2,8751 3,3718 -0,07070 1,866030 0.7808319 0,380 4.098 1.8200 4
-0,015565347 0.8956615 -2,2661 2,5089 0,08640 0.027313861 1,294063 -2,9421 2,3497 0,15260 1,368576 -3,3165 2,6989 -0,01660 1.930426 0.7749686 0,127 4.463 1,8350 4
0,024006250 0.8653758 -2,4040 2.5328 -0,03170 0,008440625 1.376398 -3,0422 2,3727 0.11390 1.449783 -4,2171 4.7703 0,00110 2.003552 0.8300253 0,387 5.138 1.9920 4
-0,015563000 0.8720967 -2,3451 2,3329 -0,05325 0,013962000 1.240091 -3,1360 2.8563 0,09145 1.418988 -3,3758 3.4279 -0,10410 1.895380 0.8351505 0,173 4,458 1,8735 4
0,003894898 0.8806773 -2,3098 3.1902 -0,09260 0,022575510 1.301955 -3,2561 2,7833 -0,05380 1,271799 -3,8035 3,1323 -0,26115 1,852265 0.7909640 0,436 3.944 1,7570 4
-0.039379208 0.8127135 -2,1523 1,8828 -0,11250 0,005454455 1.189519 -2,8057 2,4852 0,03040 1.366368 -3,3928 2.4507 0,05430 1,828059 0,7562042 0,580 3.573 1,6960 4
0,021469000 0.8272527 -1,5895 3.7505 -0,08995 0,011312000 1.285206 -2,7423 2.6785 -0,03640 1.177012 -2,6649 2.1685 0,02755 1,785930 0,7120829 0,298 3.895 1,7575 4
0,005917000 0.9139808 -2,3310 2.8131 -0,07800 -0,040868000 1.320873 -2,9778 2,2841 -0,01435 1.401567 -3,3728 3.3165 0,19485 1.947570 0.8513573 0,397 4,191 1,8180 4
-0.034448571 0.8640626 -2,4917 2.4113 -0,01960 -0,013410476 1.235196 -3,3305 2,4912 0,09420 1.327886 -2,9864 2,8430 -0,05300 1,882590 0.6971337 0,370 3.775 1,9030 4
0.046837374 0.9776022 -1,8688 2,66644 -0,03600 0.019817172 1,293644 -2,7836 2.6166 0,12540 1.245906 -2,4813 3,2677 -0,11460 1.901646 0.7296095 0,283 3.813 1,8440 4
-0.014453061 0.9553743 -2,7118 2,4640 -0,01000 -0.037717347 1.285358 -3.1225 2.4506 0,03085 1,457232 -4,2512 3.3754 0,09325 1.984418 0.8511168 0,446 4,351 1,8600 4
0.046810870 0.9259427 -1,5309 1,9420 -0,11455 0.230676087 1.491983 -2,8435 2.8405 0,33060 1.111205 -2,1748 2.9009 -0,03790 1.927174 0,7622031 0,491 3,335 2.1620 4
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D
N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D N / D

Atribuir NAs com melhores valores usando o método de iteração

Em seguida, para atribuir valores ausentes, usaremos a função de mouse. Manteremos o máximo de iterações para 50 e o método como ‘pmm’.

imputed_Data = mice(features1, 
                    m=1, 
                    maxit = 50, 
                    method = 'pmm', 
                    seed = 999, 
                    printFlag =FALSE)

Ver resultados imputados

Agora temos resultados imputados. Usaremos o primeiro quadro de dados imputados para este estudo. Você pode testar todas as diferentes imputações para ver qual funciona melhor.

Leia Também  Parabéns Classe de 2020! | R-bloggers
cupom com desconto - o melhor site de cupom de desconto cupomcomdesconto.com.br
imputedResultData = mice::complete(imputed_Data,1)
tail(imputedResultData, 50)
ax_mean ax_sd ax_min ax_max ax_median ay_mean ay_sd ay_min ay_may ay_median az_sd az_min az_maz az_median aT_mean aT_sd aT_min aT_maT aT_median y
351 -0.016097030 0.8938523 -2,3445 2.3006 -0,07360 -0.009759406 1.3118166 -3,4215 2,5028 0,10890 1,2645719 -2,8751 3,3718 -0,07070 1,8660297 0.7808319 0,380 4.098 1.8200 4
352 -0,015565347 0.8956615 -2,2661 2,5089 0,08640 0.027313861 1.2940627 -2,9421 2,3497 0,15260 1.3685757 -3,3165 2,6989 -0,01660 1.9304257 0.7749686 0,127 4.463 1,8350 4
353 0,024006250 0.8653758 -2,4040 2.5328 -0,03170 0,008440625 1.3763983 -3,0422 2,3727 0.11390 1.4497833 -4,2171 4.7703 0,00110 2.0035521 0.8300253 0,387 5.138 1.9920 4
354 -0,015563000 0.8720967 -2,3451 2,3329 -0,05325 0,013962000 1.2400913 -3,1360 2.8563 0,09145 1.4189884 -3,3758 3.4279 -0,10410 1.8953800 0.8351505 0,173 4,458 1,8735 4
355 0,003894898 0.8806773 -2,3098 3.1902 -0,09260 0,022575510 1.3019546 -3,2561 2,7833 -0,05380 1.2717989 -3,8035 3,1323 -0,26115 1,8522653 0.7909640 0,436 3.944 1,7570 4
356 -0.039379208 0.8127135 -2,1523 1,8828 -0,11250 0,005454455 1.1895194 -2,8057 2,4852 0,03040 1.3663678 -3,3928 2.4507 0,05430 1,8280594 0,7562042 0,580 3.573 1,6960 4
357 0,021469000 0.8272527 -1,5895 3.7505 -0,08995 0,011312000 1.2852056 -2,7423 2.6785 -0,03640 1.1770121 -2,6649 2.1685 0,02755 1,7859300 0,7120829 0,298 3.895 1,7575 4
358 0,005917000 0.9139808 -2,3310 2.8131 -0,07800 -0,040868000 1.3208731 -2,9778 2,2841 -0,01435 1.4015674 -3,3728 3.3165 0,19485 1.9475700 0.8513573 0,397 4,191 1,8180 4
359 -0.034448571 0.8640626 -2,4917 2.4113 -0,01960 -0,013410476 1.2351957 -3,3305 2,4912 0,09420 1.3278861 -2,9864 2,8430 -0,05300 1.8825905 0.6971337 0,370 3.775 1,9030 4
360 0.046837374 0.9776022 -1,8688 2,66644 -0,03600 0.019817172 1.2936436 -2,7836 2.6166 0,12540 1.2459059 -2,4813 3,2677 -0,11460 1.9016465 0.7296095 0,283 3.813 1,8440 4
361 -0.014453061 0.9553743 -2,7118 2,4640 -0,01000 -0.037717347 1.2853576 -3.1225 2.4506 0,03085 1.4572321 -4,2512 3.3754 0,09325 1.9844184 0.8511168 0,446 4,351 1,8600 4
362 0.046810870 0.9259427 -1,5309 1,9420 -0,11455 0.230676087 1.4919834 -2,8435 2.8405 0,33060 1.1112049 -2,1748 2.9009 -0,03790 1.9271739 0,7622031 0,491 3,335 2.1620 4
363 0.011238614 0.8127502 -1,9602 2.1430 0,00680 -0.013367308 1.3019546 -3,0628 2.7338 0,00070 1.4534581 -4,4325 2,9648 -0,03520 1.9383000 0.8526128 0,373 4,351 1.8705 4
364 -0.009812264 0.7680463 -2,3492 1,3919 0,03110 0.013984158 0.6084791 -1,4155 0,9273 0,11860 0.9997898 -3,0031 3.5781 -0,25930 1.2219510 0,6450616 0,233 3,603 1.0730 1 1
365 -0,026760000 0.4780558 -1,1826 0,9934 0,05560 -0.035218269 0,5632648 -1,0761 1.2307 -0,08165 0.7635922 -2,3115 1,8934 0,03005 0.9714200 0.4214891 0,214 2.180 0,9265 1 1
366 0,029083000 0,7515921 -2,2628 2,4640 -0,00820 0.011159596 1.3073606 -3,1360 2,8527 0,04010 1.4534581 -3,6751 2.6187 -0,22680 1.9367549 0,7439326 0,354 4,156 1,8450 4
367 0,002401000 0.5641062 -1,1533 1,4479 -0,04215 0.011159596 1.0358946 -1,9850 2,9217 -0,07040 0.7141977 -1,7791 1,3013 -0,20785 1.2607358 0.4523664 0,376 2,106 1,2830 4
368 0.017670707 0.4158231 -0,9785 1.0647 0,07680 -0.026719608 0.4759174 -0,9340 0,9077 -0,03650 0.6919936 -1,6094 2.0555 -0.19365 0.8742105 0.3962710 0,230 2.123 0,8120 1 1
369 -0.078038776 0.4413032 -1.1099 0,9826 -0,03910 -0,010626042 0.4768587 -0,9392 0.8497 -0,04655 0.8165436 -2,2936 2.1036 -0,29570 0.9319524 0.4517633 0,193 2.380 0,8865 2
370 0,004372632 0.8352791 -1,6966 2,3897 0,00845 -0,010064000 1.2746954 -2,7832 2,2841 0,03085 1.2177225 -3.1289 3.0919 0,01905 1,7844653 0.7343952 0,489 3.764 1,7520 3
371 0,016103000 0,3997476 -0,9537 1,1546 0,03655 -0,031622772 0.4828770 -0,9772 1.1237 -0,14540 0,7672163 -1,9818 1.8173 -0,09240 0.9053800 0.4160549 0201 2.053 0,8520 2
372 -0.020355446 0.4178729 -1,0524 0,9076 -0,09340 0,044400000 0,5439558 -0,9828 1.0798 0,14000 0,7552593 -2,0607 1.6134 -0.17990 0.9498911 0,3846176 0,222 1,752 0,8950 1 1
373 0,001363636 0.4868077 -0,9027 1.5155 0,04820 0,031339000 1.0619675 -2,3261 2,4081 -0,00210 0.7598489 -1,7482 1,3013 -0.20075 1.3272772 0.4315494 0,478 2,288 1,3220 4
374 -0.008122222 0.8831968 -1,9394 3,3244 -0,09610 0.017400971 1.3778757 -3,7580 2,4527 0,16935 1,4260617 -3,1893 3.5781 0,09325 1.9576857 0.9167571 0,295 4.830 1,9430 4
375 -0.065401010 0.8489219 -2,4871 2.1672 -0,11250 -0.043491753 0,5648206 -1,5188 0.8497 0,05440 1.4259974 -3,1893 4.6557 0,08010 1.4950297 0.8012418 0,198 4.290 1,2550 1 1
376 0.039720000 0.5946125 -1,5250 1,7390 0,05040 0,061424510 0.8133879 -1,2303 1.6255 0,05660 0.9355264 -2,2936 2.9202 0,02420 1.2507900 0.5391791 0,294 3.081 1.1770 3
377 0,022841000 0.8646867 -2.1253 2.6378 0,05720 0,052515306 1.1332836 -2,5429 2,3692 0,10620 1.0360114 -3,0924 3.0590 0,00110 1.5811275 0.7053254 0,326 3,742 1,5815 3
378 -0.001924510 0.5975310 -1,4775 1,4089 -0,11455 -0,040868000 1.0363392 -2,3289 2.2123 0,03025 0,7546022 -1,6175 1,2922 -0,18510 1.3324845 0.5131552 0,305 2.091 1,2830 4
379 0.017975000 0.4780750 -1.2011 1,4923 -0,07450 -0.022319802 0.5072372 -1.1404 1.0361 -0,04135 0,7439169 -2,0052 1,7066 -0,09450 0.9151400 0.4541700 0,226 2,264 0,8270 2
380 -0.070804000 0.4780558 -1,9254 0,9244 -0,05830 -0,074927551 0.5037149 -1,0485 1.0710 -0,07750 0.7598489 -2,1735 2.0385 -0,24560 0.9281400 0.4813814 0,150 2.084 0,7900 2
381 -0.002204762 0.9310547 -2,7832 2.5242 -0,07875 -0.019305882 1.3019546 -2,4215 2,8615 -0,02880 1.1771775 -3,0903 2,4800 -0,19155 1.8377451 0,7254306 0,377 3.348 1,7770 4
382 0,021469000 0.8646867 -2.0001 2,4477 -0,03400 0.051977895 1.3628383 -2,6574 2,7414 0,15305 1.1474602 -2,9516 2.6371 0,08870 1.7884124 0.7520192 0,400 3,661 1,9180 4
383 -0,015468354 0.8127502 -2,2034 2.3405 -0,02150 0.046179798 1.3628383 -2,8594 2.7288 0,02130 1.1112049 -4,2171 1,7215 0,09600 1.7592828 0.7680118 0,295 3.671 1,7780 4
384 -0.002143000 0.4442709 -0,9949 1.0734 -0,04265 -0.007904000 0.5386439 -1,2828 1,2250 -0,06765 0,7335329 -2,2694 2.1640 -0,30150 0.9293627 0.4517633 0,266 2,407 0.8000 2
385 0,027587129 0.4551125 -1,2785 1.0285 0,05660 -0.035263725 0.4854652 -1,0143 1.1332 -0,03650 0.7048400 -2,1237 1.8689 0.11100 0.8571800 0.4493956 0,164 2,222 0,8120 2
386 0.017670707 0.6981887 -1,5387 2.1808 -0,04500 0.043603191 1.2152972 -2,6631 3,1973 0,09380 0.8017314 -1,6094 1,2922 -0,10680 1.4910700 0.5158915 0,376 2.428 1,5820 4
387 0,017401000 0.7680463 -1,4528 2,2822 -0,00350 0.055612871 1.0989870 -2,7737 2.3134 0,16785 1.0468209 -2,8051 1,7055 -0,01470 1.5737525 0,6825190 0,428 2.988 1,5810 4
388 0,001363636 0.4354711 -1,0677 0,9579 0,03655 -0.017115842 0.5501718 -1.1134 1.0798 -0,01640 0,7466890 -2,1237 2.0555 0,02230 0.9342100 0.4437911 0,266 2,222 0,8410 1 1
389 0.036087000 0.8741671 -2.2967 3.3393 -0.03330 -0.019919792 1.4065464 -2.9778 3.0511 -0.04680 1.2155255 -3.8281 1.9302 0.08820 1.8953800 0.7778120 0.242 4.098 1.9170 4
390 0.007588000 0.8409728 -1.9602 2.2383 -0.07985 0.025797000 1.3525870 -3.1511 2.7414 -0.02135 1.4189884 -3.6947 2.7486 -0.14945 1.9648889 0.8489206 0.397 3.963 1.8600 4
391 0.065754545 0.4533416 -0.7769 1.1179 0.10470 0.047955446 0.5539467 -0.9340 1.0356 0.03360 0.7569361 -2.1362 2.3655 -0.10495 0.9663913 0.4276036 0.285 2.353 0.8930 2
392 -0.030526733 0.4442709 -1.7119 1.0302 0.03000 -0.021866667 0.6103892 -1.0198 1.6418 -0.01105 1.4149706 -3.3599 5.0202 -0.11600 1.3062900 0.7562042 0.131 4.443 1.1075 1 1
393 -0.001643000 0.8086920 -1.9033 2.5242 -0.03200 -0.033747959 1.3111909 -3.0231 2.3208 0.01690 1.1671442 -3.7451 2.0425 -0.19155 1.7976224 0.7133729 0.326 3.651 1.7310 4
394 -0.023916346 0.4139117 -0.6977 1.1179 -0.04360 0.011312000 0.4828770 -1.2828 1.1237 0.04940 0.7135787 -1.9553 1.8769 -0.23950 0.8609714 0.4064190 0.054 2.031 0.7900 2
395 0.037914706 0.4369138 -0.9701 0.9937 0.07080 -0.011703810 0.4883374 -1.0822 1.1166 -0.08405 0.7141977 -1.9285 2.0766 0.08010 0.8621584 0.4222442 0.193 2.180 0.7910 2
396 -0.024820792 0.8127135 -1.9299 2.6378 0.01800 -0.044580000 1.1363141 -2.5429 2.4081 -0.12910 1.0066063 -2.4043 1.5056 -0.12860 1.6121359 0.5853224 0.052 2.517 1.6945 4
397 -0.016237500 0.7620745 -2.4099 1.7855 -0.05150 0.032355102 1.1534694 -2.6734 2.4506 0.07725 1.4259974 -4.1238 4.2297 -0.24790 1.7976224 0.9082928 0.212 5.397 1.6595 3
398 -0.039379208 0.5614528 -1.7119 1.4600 -0.11620 -0.032463000 1.1096189 -2.4111 2.4533 -0.09910 1.1076786 -3.1215 2.2947 -0.14000 1.5025833 0.7521618 0.168 3.790 1.4420 3
399 0.026206186 0.7980083 -1.9033 2.3863 0.00210 0.009870874 1.2557210 -2.8507 2.4343 0.13105 1.2135140 -2.5112 2.1638 -0.22680 1.7924158 0.6828006 0.397 3.197 1.7150 3
400 0.072777778 0.4051881 -0.8386 0.8847 0.15575 0.015370408 0.4759174 -0.9340 1.2039 0.01090 0.7135787 -2.1186 1.5632 -0.13970 0.9087400 0.3767882 0.170 2.507 0.8120 1 1

Looking at distribution actual data and imputed data

We will first compare basic statistics and then distributions of the couple of features. In the comparison of statistics between actual and imputed we can observe that the mean and SD for both imputed and actual are almost equal.

data.frame(actual_ax_mean = c(mean(features$ax_mean), sd(features$ax_mean)) 
           , imputed_ax_mean = c(mean(imputedResultData$ax_mean), sd(imputedResultData$ax_mean))
           , actual_ax_median = c(mean(features$ax_median), sd(features$ax_median)) 
           , imputed_ax_median = c(mean(imputedResultData$ax_median), sd(imputedResultData$ax_median))
           , actual_az_sd = c(mean(features$az_sd), sd(features$az_sd)) 
           , imputed_az_sd = c(mean(imputedResultData$az_sd), sd(imputedResultData$az_sd))
           , row.names = c("mean", "sd"))
actual_ax_mean imputed_ax_mean actual_ax_median imputed_ax_median actual_az_sd imputed_az_sd
significar 0.006307909 0.005851233 -0.001328867 -0.00214025 1.0588650 1.0528059
sd 0.030961085 0.031125848 0.059619834 0.06011342 0.2446782 0.2477697

Now, lets look at the distributions in the data. From the distribution below, we can observe that the distributions for actual data and imputed data is almost identical. We can confirm it with the bandwidth in the plots.

par(mfrow=c(3,2))
plot(density(features$ax_mean), main = "Actual ax_mean", type="l", col="red")
plot(density(imputedResultData$ax_mean), main = "Imputed ax_mean", type="l", col="red")
plot(density(features$ax_median), main = "Actual ax_median", type="l", col="red")
plot(density(imputedResultData$ax_median), main = "Imputed ax_median", type="l", col="red")
plot(density(features$az_sd), main = "Actual az_sdn", type="l", col="red")
plot(density(imputedResultData$az_sd), main = "Imputed az_sd", type="l", col="red")
Density plots

Building a classification model based on actual data and Imputed data

In the following data y will be our classification variable. We will build a classification model using a simple support vector machine(SVM) with actual and imputed data. No transformation will be done on the data. In the end we will compare the results

Leia Também  A tyntec traz os recursos rich media da WhatsApp Business API para seu cliente de serviços financeiros e ajuda os consumidores a receber extratos de conta na hora certa

Actual Data

Sample data creation

Let’s split the data into train and test with ratio’s of 80:20.

#create samples of 80:20 ratio
features$y = as.factor(features$y)
sample = sample(nrow(features) , nrow(features)* 0.8)
train = features[sample,]
test = features[-sample,]

Build a SVM model

Now, we can train the model using train set. We will not do any parameter tuning in this example.

library(e1071)
ibrary(caret)

actual.svm.model = svm(y ~., data = train)
summary(actual.svm.model)
Loading required package: ggplot2
Call:
svm(formula = y ~ ., data = train)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.05 

Number of Support Vectors:  142

 ( 47 18 47 30 )


Number of Classes:  4 

Levels: 
 1 2 3 4


Validate SVM model

In the below confusion matrix, we observe the following

  1. accuary>NIR indicating model is very good
  2. Higher accuray and kappa value indicates a very accurate model
  3. Even the balanced accuracy is close to 1 indicating the model is highly accurate
# build a confusion matrix using caret package
confusionMatrix(predict(actual.svm.model, test), test$y)
Confusion Matrix and Statistics

          Reference
Prediction  1  2  3  4
         1 10  1  0  0
         2  0 26  0  0
         3  0  0 22  0
         4  0  0  3 11

Overall Statistics
                                          
               Accuracy : 0.9452          
                 95% CI : (0.8656, 0.9849)
    No Information Rate : 0.3699          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9234          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity            1.0000   0.9630   0.8800   1.0000
Specificity            0.9841   1.0000   1.0000   0.9516
Pos Pred Value         0.9091   1.0000   1.0000   0.7857
Neg Pred Value         1.0000   0.9787   0.9412   1.0000
Prevalence             0.1370   0.3699   0.3425   0.1507
Detection Rate         0.1370   0.3562   0.3014   0.1507
Detection Prevalence   0.1507   0.3562   0.3014   0.1918
Balanced Accuracy      0.9921   0.9815   0.9400   0.9758

Imputed Data

Sample data creation

# create samples of 80:20 ratio
imputedResultData$y = as.factor(imputedResultData$y)
sample = sample(nrow(imputedResultData) , nrow(imputedResultData)* 0.8)
train = imputedResultData[sample,]
test = imputedResultData[-sample,]

Build a SVM model

imputed.svm.model = svm(y ~., data = train)
summary(imputed.svm.model)
Call:
svm(formula = y ~ ., data = train)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.05 

Number of Support Vectors:  167

 ( 59 47 36 25 )


Number of Classes:  4 

Levels: 
 1 2 3 4


Validate SVM model

In the below confusion matrix, we observe the following

  1. accuary>NIR indicating model is very good
  2. Higher accuray and kappa value indicates a very accurate model
  3. Even the balanced accuracy is close to 1 indicating the model is highly accurate
confusionMatrix(predict(imputed.svm.model, test), test$y)
Confusion Matrix and Statistics

          Reference
Prediction  1  2  3  4
         1 15  0  0  0
         2  1 21  0  0
         3  0  0 17  0
         4  0  0  0 26

Overall Statistics
                                          
               Accuracy : 0.9875          
                 95% CI : (0.9323, 0.9997)
    No Information Rate : 0.325           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9831          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity            0.9375   1.0000   1.0000    1.000
Specificity            1.0000   0.9831   1.0000    1.000
Pos Pred Value         1.0000   0.9545   1.0000    1.000
Neg Pred Value         0.9846   1.0000   1.0000    1.000
Prevalence             0.2000   0.2625   0.2125    0.325
Detection Rate         0.1875   0.2625   0.2125    0.325
Detection Prevalence   0.1875   0.2750   0.2125    0.325
Balanced Accuracy      0.9688   0.9915   1.0000    1.000

Overall results

What we saw above and their interpretation is completely subjective. One way to truly validate them is to create random train and test samples multiple times (say 30), build a model, validate the model, capture kappa value. Finally use a simple t-test to see if there is a significant difference.

Null hypothesis:
H0: there is no significant difference between two samples.

# lets create functions to simplify the process

test.function = (data){
    # create samples
    sample = sample(nrow(data) , nrow(data)* 0.75)
    train = data[sample,]
    test = data[-sample,]
    
    # build model
    svm.model = svm(y ~., data = train)
    
    # get metrics
    metrics = confusionMatrix(predict(svm.model, test), test$y)
    return(metrics$overall['Accuracy'])
    
}
# now lets calculate accuracy with actual data to get 30 results
actual.results  = NULL
for(i in 1:100) {
    actual.results[i] = test.function(features)
}
head(actual.results)

# 0.978021978021978
# 0.978021978021978
# 0.978021978021978
# 0.945054945054945
# 0.989010989010989
# 0.967032967032967
# now lets calculate accuracy with imputed data to get 30 results
imputed.results  = NULL
for(i in 1:100) {
    imputed.results[i] = test.function(imputedResultData)
}
head(imputed.results)
# 0.97
# 0.95
# 0.92
# 0.96
# 0.92
# 0.96

T-test to test the results

What’s better than statistically prove if there is significant difference right? So, we will do a t-test to see if there is any statistical difference in the accuracy.

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.results, y = imputed.results, conf.level = 0.95)
	Welch Two Sample t-test

data:  actual.results and imputed.results
t = 7.9834, df = 194.03, p-value = 1.222e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.01673213 0.02771182
sample estimates:
mean of x mean of y 
 0.968022  0.945800 

In the above t-test we have set the confidence level at 95%. From the results we can observe that the p-value is less than 0.05 indicating that there is a significant difference in accuracy between actual data and imputed data. From the means we can notice that the average accuracy of actual data is about 96.5% while the accuracy of imputed data y is about 92.5%. There is a variation of 4%. So, does that mean imputing more data results in reducing the accuracy across various models?

Why not do a test to compare the results? let’s consider 4 other models for that and those will be

  1. Random forest
  2. Decision tree
  3. KNN
  4. Naive Bayes

Random Forest

Let’s use all the same steps as above and fit different models. The results of accuracy will be in the below table

library(randomForest)

# lets create functions to simplify the process

test.rf.function = function(data){
    # create samples
    sample = sample(nrow(data) , nrow(data)* 0.75)
    train = data[sample,]
    test = data[-sample,]
    
    # build model
    rf.model = randomForest(y ~., data = train)
    
    # get metrics
    metrics = confusionMatrix(predict(rf.model, test), test$y)
    return(metrics$overall['Accuracy'])
    
}

# now lets calculate accuracy with actual data to get 30 results
actual.rf.results  = NULL
for(i in 1:100) {
    actual.rf.results[i] = test.rf.function(features)
}
#head(actual.rf.results)

# now lets calculate accuracy with imputed data to get 30 results
imputed.rf.results  = NULL
for(i in 1:100) {
    imputed.rf.results[i] = test.rf.function(imputedResultData)
}
head(data.frame(Actual = actual.rf.results, Imputed = imputed.rf.results))

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.rf.results, y = imputed.rf.results, conf.level = 0.95)
Actual Imputed
0.956044 0.95
1.000000 0.93
0.967033 0.96
0.967033 0.96
1.000000 0.97
0.967033 0.93
Random forest accuracy for actual and imputed data
	Welch Two Sample t-test

data:  actual.rf.results and imputed.rf.results
t = 11.734, df = 183.2, p-value 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.02183138 0.03065654
sample estimates:
mean of x mean of y 
 0.976044  0.949800 

In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 2.5% difference.

Decision Tree

library(rpart)

# lets create functions to simplify the process

test.dt.function = function(data){
    # create samples
    sample = sample(nrow(data) , nrow(data)* 0.75)
    train = data[sample,]
    test = data[-sample,]
    
    # build model
    dt.model = rpart(y ~., data = train, method="class")
    
    # get metrics
    metrics = confusionMatrix(predict(dt.model, test, type="class"), test$y)
    return(metrics$overall['Accuracy'])
    
}

# now lets calculate accuracy with actual data to get 30 results
actual.dt.results  = NULL
for(i in 1:100) {
    actual.dt.results[i] = test.dt.function(features)
}
#head(actual.rf.results)

# now lets calculate accuracy with imputed data to get 30 results
imputed.dt.results  = NULL
for(i in 1:100) {
    imputed.dt.results[i] = test.dt.function(imputedResultData)
}
head(data.frame(Actual = actual.dt.results, Imputed = imputed.dt.results))

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.dt.results, y = imputed.dt.results, conf.level = 0.95)
Actual Imputed
0.978022 0.92
0.967033 0.94
0.967033 0.95
0.956044 0.94
0.956044 0.94
0.978022 0.95
Decision tree accuracy for actual and imputed data
	Welch Two Sample t-test

data:  actual.dt.results and imputed.dt.results
t = 16.24, df = 167.94, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.03331888 0.04254046
sample estimates:
mean of x mean of y 
0.9703297 0.9324000 

In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 3.5% difference.

K-Nearest Neighbor (KNN)

library(class)

# lets create functions to simplify the process

test.knn.function = function(data){
    # create samples
    sample = sample(nrow(data) , nrow(data)* 0.75)
    train = data[sample,]
    test = data[-sample,]
    
    # build model
    knn.model = knn(train,test, cl=train$y, k=5)
    
    # get metrics
    metrics = confusionMatrix(knn.model, test$y)
    return(metrics$overall['Accuracy'])
    
}

# now lets calculate accuracy with actual data to get 30 results
actual.dt.results  = NULL
for(i in 1:100) {
    actual.dt.results[i] = test.knn.function(features)
}
#head(actual.rf.results)

# now lets calculate accuracy with imputed data to get 30 results
imputed.dt.results  = NULL
for(i in 1:100) {
    imputed.dt.results[i] = test.knn.function(imputedResultData)
}
head(data.frame(Actual = actual.dt.results, Imputed = imputed.dt.results))

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.dt.results, y = imputed.dt.results, conf.level = 0.95)
Actual Imputed
0.967033 0.97
1.000000 0.98
0.978022 0.99
0.978022 1.00
0.967033 1.00
0.978022 1.00
KNN accuracy for actual and imputed data
	Welch Two Sample t-test

data:  actual.dt.results and imputed.dt.results
t = 3.2151, df = 166.45, p-value = 0.001566
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.002126868 0.008895110
sample estimates:
mean of x mean of y 
 0.989011  0.983500 

In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 0.05% difference.

Naive Bayes

# lets create functions to simplify the process

test.nb.function = function(data){
    # create samples
    sample = sample(nrow(data) , nrow(data)* 0.75)
    train = data[sample,]
    test = data[-sample,]
    
    # build model
    nb.model = naiveBayes(y ~., data = train)
    
    # get metrics
    metrics = confusionMatrix(predict(nb.model, test), test$y)
    return(metrics$overall['Accuracy'])
    
}

# now lets calculate accuracy with actual data to get 30 results
actual.nb.results  = NULL
for(i in 1:100) {
    actual.nb.results[i] = test.nb.function(features)
}
#head(actual.rf.results)

# now lets calculate accuracy with imputed data to get 30 results
imputed.nb.results  = NULL
for(i in 1:100) {
    imputed.nb.results[i] = test.nb.function(imputedResultData)
}
head(data.frame(Actual = actual.nb.results, Imputed = imputed.nb.results))

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.nb.results, y = imputed.nb.results, conf.level = 0.95)
Actual Imputed
0.989011 0.95
0.967033 0.92
0.978022 0.94
1.000000 0.95
0.989011 0.90
0.967033 0.93
Naive Bayes accuracy for actual and imputed data
	Welch Two Sample t-test

data:  actual.nb.results and imputed.nb.results
t = 18.529, df = 174.88, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.04214191 0.05218996
sample estimates:
mean of x mean of y 
0.9740659 0.9269000 

In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 4.5% difference.

Conclusão

From the above results we observe that irrespective of the type of model built, we observed a standard variation in accuracy in the range of 3% – 5% between using actual data and imputed data. In all the cases, actual data helped in building a better model compared to using imputed data for building the model.

If you enjoyed this tutorial, then check out my other tutorials and my GitHub page for all the source code and various R-packages.

  • Testing the Effect of Data Imputation on Model Accuracy
  • Anomaly Detection for Predictive Maintenance using Keras
  • Predictive Maintenance: Zero to Deployment in Manufacturing
  • Free coding education in the time of Covid-19
  • AutoML Frameworks in R & Python

The post Testing the Effect of Data Imputation on Model Accuracy appeared first on Hi! I am Nagdev.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };

(function(d, t) {
var s = d.createElement
s.src = '//cdn.viglink.com/api/vglnk.js';
var r = d.getElementsByTagName
}(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Hi! I am Nagdev.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.



cupom com desconto - o melhor site de cupom de desconto cupomcomdesconto.com.br