Testando o efeito da imputação de dados na precisão do modelo

cupom com desconto - o melhor site de cupom de desconto cupomcomdesconto.com.br


[Esteartigofoipublicadopelaprimeiravezem[Thisarticlewasfirstpublishedon R – Oi! Eu sou Nagdev, e gentilmente contribuiu para os R-blogueiros]. (Você pode relatar um problema sobre o conteúdo desta página aqui)


Deseja compartilhar seu conteúdo com R-blogueiros? clique aqui se você tiver um blog ou aqui se não tiver.

A maioria de nós já se deparou com situações em que não temos dados suficientes para construir modelos confiáveis ​​devido a várias razões, como, por exemplo, é caro coletar dados (estudos em humanos), recursos limitados, falta de disponibilidade de dados históricos (terremotos). Antes mesmo de começarmos a falar sobre como superar o desafio, vamos primeiro falar sobre por que precisamos de amostras mínimas antes mesmo de pensarmos em construir um modelo. Primeiro de tudo, podemos construir um modelo com poucas amostras. Definitivamente é possível! Porém, conforme o número de amostras diminui, a margem de erro aumenta e vice-versa. Se você deseja construir um modelo com a maior precisão, precisará ter o maior número possível de amostras. Se o modelo for para um aplicativo do mundo real, você precisará ter dados por vários dias para responder a quaisquer alterações no sistema. Existe uma fórmula que pode ser usada para calcular o tamanho da amostra e é a seguinte:

Imagem

Onde, n = tamanho da amostra

Z = valor da pontuação Z

σ = desvio padrão preenchido

MOE = margem de erro aceitável

Você também pode calcular com uma calculadora on-line, como neste link
https://www.qualtrics.com/blog/calculating-sample-size/

Agora sabemos que por que são necessárias amostras mínimas para alcançar a precisão exigida, digamos que, em alguns casos, não temos a oportunidade de coletar mais amostras ou disponíveis. Então temos a opção de fazer o seguinte

  1. Validação cruzada K-fold
  2. Validação cruzada de não-P-out
  3. Validação cruzada de exclusão única
  4. Criação de novos dados por estimativa

No método K-fold, os dados são divididos em k partições e, em seguida, são treinados com cada partição e testados com a k-ésima partição. No método k-hold, nem todas as combinações são consideradas. Somente partições especificadas pelo usuário são consideradas. Enquanto estiver em deixar um / sair, todas as combinações ou partições são consideradas. Esta é uma técnica mais exaustiva na validação dos resultados. As duas técnicas a seguir acima são as técnicas mais populares usadas no aprendizado de máquina e no aprendizado profundo.

Quando se trata de lidar com NAs em um conjunto de dados, sempre o imputamos por meio de média, mediana, zero e números aleatórios. Mas isso provavelmente não faria sentido quando queremos criar novos dados.

Na criação de novos dados por meio da técnica de estimativa, linhas de dados ausentes são criadas no conjunto de dados e um modelo de imputação de dados separado é usado para imputar dados ausentes nas linhas. A Imputação Multivariada por Equações em Cadeia (MICE) é um dos algoritmos mais populares disponíveis para inserir dados ausentes, independentemente de tipos de dados, como misturas de dados categóricos contínuos, binários, não ordenados e categóricos ordenados.

Existem vários tutoriais disponíveis para dobrar em k e deixar um modelo de fora. Este tutorial se concentrará no quarto modelo em que novos dados serão criados para lidar com menos tamanho de amostra. No modelo de classificação simples e com ser treinado para ver se houve uma melhoria significativa. Além disso, a distribuição de dados imputados e não imputados será comparada para ver qualquer diferença significativa.

Carregar bibliotecas

Vamos carregar todas as bibliotecas necessárias por enquanto.

options(warn=-1)

# load libraies
library(mice)
library(dplyr)

Carregar dados em um quadro de dados

Os dados disponíveis no meu repositório GitHub são usados ​​para a análise.

setwd("C:/OpenSourceWork/Experiment")
#read csv files
file1 = read.csv("dry run.csv", sep=",", header =T)
file2 = read.csv("base.csv", sep=",", header =T)
file3 = read.csv("imbalance 1.csv", sep=",", header =T)
file4 = read.csv("imbalance 2.csv", sep=",", header =T)

#Add labels to data
file1$y = 1
file2$y = 2
file3$y = 3
file4$y = 4

#view top rows of data
head(file1)
TempomachadoayazaTy
0,002-0,32460,27480,15020,4511 1
0,0090,6020-0,1900-0,32270,7091 1
0,0190,97870,32580,01241.0321 1
0,0270.6141-0,41790,04710,7441 1
0,038-0,3218-0,6389-0,42590.8331 1
0,047-0,36070,1332-0,12910,4061 1
Dados não tratados

Crie alguns recursos a partir de dados

Os dados utilizados neste estudo são dados de vibração com diferentes estados. Os dados foram coletados a 100 Hz. Os dados a serem utilizados como são são de alta dimensão também, não temos um bom resumo dos dados. Portanto, algumas características estatísticas são extraídas. Nesse caso, o desvio padrão da amostra, a média da amostra, a amostra mínima, a amostra máxima e a mediana da amostra são calculadas. Além disso, os dados são agregados por 1 segundo.

file1$group = as.factor(round(file1$time))
file2$group = as.factor(round(file2$time))
file3$group = as.factor(round(file3$time))
file4$group = as.factor(round(file4$time))
#(file1,20)

#list of all files
files = list(file1, file2, file3, file4)

#loop through all files and combine
features = NULL
for (i in 1:4){
res = files[[i]] %>%
    group_by(group) %>%
    summarize(ax_mean = mean(ax),
              ax_sd = sd(ax),
              ax_min = min(ax),
              ax_max = max(ax),
              ax_median = median(ax),
              ay_mean = mean(ay),
              ay_sd = sd(ay),
              ay_min = min(ay),
              ay_may = max(ay),
              ay_median = median(ay),
              az_mean = mean(az),
              az_sd = sd(az),
              az_min = min(az),
              az_maz = max(az),
              az_median = median(az),
              aT_mean = mean(aT),
              aT_sd = sd(aT),
              aT_min = min(aT),
              aT_maT = max(aT),
              aT_median = median(aT),
              y = mean(y)
             )
    features = rbind(features, res)
}

features = subset(features, select = -group)

# store it in a df for future reference
actual.features = features

Dados do estudo

Primeiro, vamos analisar o tamanho de nossas populações e o resumo de nossos recursos, juntamente com seus tipos de dados.

# show data types
str(features)
Classes 'tbl_df', 'tbl' and 'data.frame':	362 obs. of  21 variables:
 $ ax_mean  : num  -0.03816 -0.00581 0.06985 0.01155 0.04669 ...
 $ ax_sd    : num  0.659 0.633 0.667 0.551 0.643 ...
 $ ax_min   : num  -1.26 -1.62 -1.46 -1.93 -1.78 ...
 $ ax_max   : num  1.38 1.19 1.47 1.2 1.48 ...
 $ ax_median: num  -0.0955 -0.0015 0.107 0.0675 0.0836 ...
 $ ay_mean  : num  -0.068263 0.003791 0.074433 0.000826 -0.017759 ...
 $ ay_sd    : num  0.751 0.782 0.802 0.789 0.751 ...
 $ ay_min   : num  -1.39 -1.56 -1.48 -2 -1.66 ...
 $ ay_may   : num  1.64 1.54 1.8 1.56 1.44 ...
 $ ay_median: num  -0.19 0.0101 0.1186 -0.0027 -0.0253 ...
 $ az_mean  : num  -0.138 -0.205 -0.0641 -0.0929 -0.1399 ...
 $ az_sd    : num  0.985 0.925 0.929 0.889 0.927 ...
 $ az_min   : num  -2.68 -3.08 -1.82 -2.16 -1.85 ...
 $ az_maz   : num  2.75 2.72 2.49 3.24 3.55 ...
 $ az_median: num  0.0254 -0.2121 -0.1512 -0.1672 -0.1741 ...
 $ aT_mean  : num  1.27 1.26 1.3 1.2 1.23 ...
 $ aT_sd    : num  0.583 0.545 0.513 0.513 0.582 ...
 $ aT_min   : num  0.4 0.41 0.255 0.393 0.313 0.336 0.275 0.196 0.032 0.358 ...
 $ aT_maT   : num  3.03 3.2 2.64 3.32 3.6 ...
 $ aT_median: num  1.08 1.14 1.28 1.12 1.17 ...
 $ y        : num  1 1 1 1 1 1 1 1 1 1 ...

Crie observações com valores de NA no final

A seguir, atribuiremos algumas NAs para este tutorial no final da tabela.

features1 = features
for(i in 363:400){
  features1[i,] = NA
}

Visualizar nas 50 linhas inferiores

Vemos os valores ausentes no final da tabela.

Leia Também  4 grandes lições aprendidas no investimento imobiliário

Isenção de responsabilidade: aqui apresentamos todas as últimas 50 linhas como NA. No mundo real, é altamente improvável. Você pode ter apenas alguns valores ausentes.

tail(features1, 50)
ax_meanax_sdax_minax_maxax_medianay_meanay_sday_minay_mayay_medianaz_sdaz_minaz_mazaz_medianaT_meanaT_sdaT_minaT_maTaT_mediany
-0.0160970300.8938523-2,34452.3006-0,07360-0.0097594061.311817-3,42152,50280,108901,264572-2,87513,3718-0,070701,8660300.78083190,3804.0981.82004
-0,0155653470.8956615-2,26612,50890,086400.0273138611,294063-2,94212,34970,152601,368576-3,31652,6989-0,016601.9304260.77496860,1274.4631,83504
0,0240062500.8653758-2,40402.5328-0,031700,0084406251.376398-3,04222,37270.113901.449783-4,21714.77030,001102.0035520.83002530,3875.1381.99204
-0,0155630000.8720967-2,34512,3329-0,053250,0139620001.240091-3,13602.85630,091451.418988-3,37583.4279-0,104101.8953800.83515050,1734,4581,87354
0,0038948980.8806773-2,30983.1902-0,092600,0225755101.301955-3,25612,7833-0,053801,271799-3,80353,1323-0,261151,8522650.79096400,4363.9441,75704
-0.0393792080.8127135-2,15231,8828-0,112500,0054544551.189519-2,80572,48520,030401.366368-3,39282.45070,054301,8280590,75620420,5803.5731,69604
0,0214690000.8272527-1,58953.7505-0,089950,0113120001.285206-2,74232.6785-0,036401.177012-2,66492.16850,027551,7859300,71208290,2983.8951,75754
0,0059170000.9139808-2,33102.8131-0,07800-0,0408680001.320873-2,97782,2841-0,014351.401567-3,37283.31650,194851.9475700.85135730,3974,1911,81804
-0.0344485710.8640626-2,49172.4113-0,01960-0,0134104761.235196-3,33052,49120,094201.327886-2,98642,8430-0,053001,8825900.69713370,3703.7751,90304
0.0468373740.9776022-1,86882,66644-0,036000.0198171721,293644-2,78362.61660,125401.245906-2,48133,2677-0,114601.9016460.72960950,2833.8131,84404
-0.0144530610.9553743-2,71182,4640-0,01000-0.0377173471.285358-3.12252.45060,030851,457232-4,25123.37540,093251.9844180.85111680,4464,3511,86004
0.0468108700.9259427-1,53091,9420-0,114550.2306760871.491983-2,84352.84050,330601.111205-2,17482.9009-0,037901.9271740,76220310,4913,3352.16204
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D
N / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / DN / D

Atribuir NAs com melhores valores usando o método de iteração

Em seguida, para atribuir valores ausentes, usaremos a função de mouse. Manteremos o máximo de iterações para 50 e o método como ‘pmm’.

imputed_Data = mice(features1, 
                    m=1, 
                    maxit = 50, 
                    method = 'pmm', 
                    seed = 999, 
                    printFlag =FALSE)

Ver resultados imputados

Agora temos resultados imputados. Usaremos o primeiro quadro de dados imputados para este estudo. Você pode testar todas as diferentes imputações para ver qual funciona melhor.

Leia Também  IN2Retail seleciona CashFlows para dar suporte à implantação de caixas eletrônicos na Holanda e na Irlanda
cupom com desconto - o melhor site de cupom de desconto cupomcomdesconto.com.br
imputedResultData = mice::complete(imputed_Data,1)
tail(imputedResultData, 50)
ax_meanax_sdax_minax_maxax_medianay_meanay_sday_minay_mayay_medianaz_sdaz_minaz_mazaz_medianaT_meanaT_sdaT_minaT_maTaT_mediany
351-0.0160970300.8938523-2,34452.3006-0,07360-0.0097594061.3118166-3,42152,50280,108901,2645719-2,87513,3718-0,070701,86602970.78083190,3804.0981.82004
352-0,0155653470.8956615-2,26612,50890,086400.0273138611.2940627-2,94212,34970,152601.3685757-3,31652,6989-0,016601.93042570.77496860,1274.4631,83504
3530,0240062500.8653758-2,40402.5328-0,031700,0084406251.3763983-3,04222,37270.113901.4497833-4,21714.77030,001102.00355210.83002530,3875.1381.99204
354-0,0155630000.8720967-2,34512,3329-0,053250,0139620001.2400913-3,13602.85630,091451.4189884-3,37583.4279-0,104101.89538000.83515050,1734,4581,87354
3550,0038948980.8806773-2,30983.1902-0,092600,0225755101.3019546-3,25612,7833-0,053801.2717989-3,80353,1323-0,261151,85226530.79096400,4363.9441,75704
356-0.0393792080.8127135-2,15231,8828-0,112500,0054544551.1895194-2,80572,48520,030401.3663678-3,39282.45070,054301,82805940,75620420,5803.5731,69604
3570,0214690000.8272527-1,58953.7505-0,089950,0113120001.2852056-2,74232.6785-0,036401.1770121-2,66492.16850,027551,78593000,71208290,2983.8951,75754
3580,0059170000.9139808-2,33102.8131-0,07800-0,0408680001.3208731-2,97782,2841-0,014351.4015674-3,37283.31650,194851.94757000.85135730,3974,1911,81804
359-0.0344485710.8640626-2,49172.4113-0,01960-0,0134104761.2351957-3,33052,49120,094201.3278861-2,98642,8430-0,053001.88259050.69713370,3703.7751,90304
3600.0468373740.9776022-1,86882,66644-0,036000.0198171721.2936436-2,78362.61660,125401.2459059-2,48133,2677-0,114601.90164650.72960950,2833.8131,84404
361-0.0144530610.9553743-2,71182,4640-0,01000-0.0377173471.2853576-3.12252.45060,030851.4572321-4,25123.37540,093251.98441840.85111680,4464,3511,86004
3620.0468108700.9259427-1,53091,9420-0,114550.2306760871.4919834-2,84352.84050,330601.1112049-2,17482.9009-0,037901.92717390,76220310,4913,3352.16204
3630.0112386140.8127502-1,96022.14300,00680-0.0133673081.3019546-3,06282.73380,000701.4534581-4,43252,9648-0,035201.93830000.85261280,3734,3511.87054
364-0.0098122640.7680463-2,34921,39190,031100.0139841580.6084791-1,41550,92730,118600.9997898-3,00313.5781-0,259301.22195100,64506160,2333,6031.07301 1
365-0,0267600000.4780558-1,18260,99340,05560-0.0352182690,5632648-1,07611.2307-0,081650.7635922-2,31151,89340,030050.97142000.42148910,2142.1800,92651 1
3660,0290830000,7515921-2,26282,4640-0,008200.0111595961.3073606-3,13602,85270,040101.4534581-3,67512.6187-0,226801.93675490,74393260,3544,1561,84504
3670,0024010000.5641062-1,15331,4479-0,042150.0111595961.0358946-1,98502,9217-0,070400.7141977-1,77911,3013-0,207851.26073580.45236640,3762,1061,28304
3680.0176707070.4158231-0,97851.06470,07680-0.0267196080.4759174-0,93400,9077-0,036500.6919936-1,60942.0555-0.193650.87421050.39627100,2302.1230,81201 1
369-0.0780387760.4413032-1.10990,9826-0,03910-0,0106260420.4768587-0,93920.8497-0,046550.8165436-2,29362.1036-0,295700.93195240.45176330,1932.3800,88652
3700,0043726320.8352791-1,69662,38970,00845-0,0100640001.2746954-2,78322,28410,030851.2177225-3.12893.09190,019051,78446530.73439520,4893.7641,75203
3710,0161030000,3997476-0,95371,15460,03655-0,0316227720.4828770-0,97721.1237-0,145400,7672163-1,98181.8173-0,092400.90538000.416054902012.0530,85202
372-0.0203554460.4178729-1,05240,9076-0,093400,0444000000,5439558-0,98281.07980,140000,7552593-2,06071.6134-0.179900.94989110,38461760,2221,7520,89501 1
3730,0013636360.4868077-0,90271.51550,048200,0313390001.0619675-2,32612,4081-0,002100.7598489-1,74821,3013-0.200751.32727720.43154940,4782,2881,32204
374-0.0081222220.8831968-1,93943,3244-0,096100.0174009711.3778757-3,75802,45270,169351,4260617-3,18933.57810,093251.95768570.91675710,2954.8301,94304
375-0.0654010100.8489219-2,48712.1672-0,11250-0.0434917530,5648206-1,51880.84970,054401.4259974-3,18934.65570,080101.49502970.80124180,1984.2901,25501 1
3760.0397200000.5946125-1,52501,73900,050400,0614245100.8133879-1,23031.62550,056600.9355264-2,29362.92020,024201.25079000.53917910,2943.0811.17703
3770,0228410000.8646867-2.12532.63780,057200,0525153061.1332836-2,54292,36920,106201.0360114-3,09243.05900,001101.58112750.70532540,3263,7421,58153
378-0.0019245100.5975310-1,47751,4089-0,11455-0,0408680001.0363392-2,32892.21230,030250,7546022-1,61751,2922-0,185101.33248450.51315520,3052.0911,28304
3790.0179750000.4780750-1.20111,4923-0,07450-0.0223198020.5072372-1.14041.0361-0,041350,7439169-2,00521,7066-0,094500.91514000.45417000,2262,2640,82702
380-0.0708040000.4780558-1,92540,9244-0,05830-0,0749275510.5037149-1,04851.0710-0,077500.7598489-2,17352.0385-0,245600.92814000.48138140,1502.0840,79002
381-0.0022047620.9310547-2,78322.5242-0,07875-0.0193058821.3019546-2,42152,8615-0,028801.1771775-3,09032,4800-0,191551.83774510,72543060,3773.3481,77704
3820,0214690000.8646867-2.00012,4477-0,034000.0519778951.3628383-2,65742,74140,153051.1474602-2,95162.63710,088701.78841240.75201920,4003,6611,91804
383-0,0154683540.8127502-2,20342.3405-0,021500.0461797981.3628383-2,85942.72880,021301.1112049-4,21711,72150,096001.75928280.76801180,2953.6711,77804
384-0.0021430000.4442709-0,99491.0734-0,04265-0.0079040000.5386439-1,28281,2250-0,067650,7335329-2,26942.1640-0,301500.92936270.45176330,2662,4070.80002
3850,0275871290.4551125-1,27851.02850,05660-0.0352637250.4854652-1,01431.1332-0,036500.7048400-2,12371.86890.111000.85718000.44939560,1642,2220,81202
3860.0176707070.6981887-1,53872.1808-0,045000.0436031911.2152972-2,66313,19730,093800.8017314-1,60941,2922-0,106801.49107000.51589150,3762.4281,58204
3870,0174010000.7680463-1,45282,2822-0,003500.0556128711.0989870-2,77372.31340,167851.0468209-2,80511,7055-0,014701.57375250,68251900,4282.9881,58104
3880,0013636360.4354711-1,06770,95790,03655-0.0171158420.5501718-1.11341.0798-0,016400,7466890-2,12372.05550,022300.93421000.44379110,2662,2220,84101 1
3890.0360870000.8741671-2.29673.3393-0.03330-0.0199197921.4065464-2.97783.0511-0.046801.2155255-3.82811.93020.088201.89538000.77781200.2424.0981.91704
3900.0075880000.8409728-1.96022.2383-0.079850.0257970001.3525870-3.15112.7414-0.021351.4189884-3.69472.7486-0.149451.96488890.84892060.3973.9631.86004
3910.0657545450.4533416-0.77691.11790.104700.0479554460.5539467-0.93401.03560.033600.7569361-2.13622.3655-0.104950.96639130.42760360.2852.3530.89302
392-0.0305267330.4442709-1.71191.03020.03000-0.0218666670.6103892-1.01981.6418-0.011051.4149706-3.35995.0202-0.116001.30629000.75620420.1314.4431.10751 1
393-0.0016430000.8086920-1.90332.5242-0.03200-0.0337479591.3111909-3.02312.32080.016901.1671442-3.74512.0425-0.191551.79762240.71337290.3263.6511.73104
394-0.0239163460.4139117-0.69771.1179-0.043600.0113120000.4828770-1.28281.12370.049400.7135787-1.95531.8769-0.239500.86097140.40641900.0542.0310.79002
3950.0379147060.4369138-0.97010.99370.07080-0.0117038100.4883374-1.08221.1166-0.084050.7141977-1.92852.07660.080100.86215840.42224420.1932.1800.79102
396-0.0248207920.8127135-1.92992.63780.01800-0.0445800001.1363141-2.54292.4081-0.129101.0066063-2.40431.5056-0.128601.61213590.58532240.0522.5171.69454
397-0.0162375000.7620745-2.40991.7855-0.051500.0323551021.1534694-2.67342.45060.077251.4259974-4.12384.2297-0.247901.79762240.90829280.2125.3971.65953
398-0.0393792080.5614528-1.71191.4600-0.11620-0.0324630001.1096189-2.41112.4533-0.099101.1076786-3.12152.2947-0.140001.50258330.75216180.1683.7901.44203
3990.0262061860.7980083-1.90332.38630.002100.0098708741.2557210-2.85072.43430.131051.2135140-2.51122.1638-0.226801.79241580.68280060.3973.1971.71503
4000.0727777780.4051881-0.83860.88470.155750.0153704080.4759174-0.93401.20390.010900.7135787-2.11861.5632-0.139700.90874000.37678820.1702.5070.81201 1

Looking at distribution actual data and imputed data

We will first compare basic statistics and then distributions of the couple of features. In the comparison of statistics between actual and imputed we can observe that the mean and SD for both imputed and actual are almost equal.

data.frame(actual_ax_mean = c(mean(features$ax_mean), sd(features$ax_mean)) 
           , imputed_ax_mean = c(mean(imputedResultData$ax_mean), sd(imputedResultData$ax_mean))
           , actual_ax_median = c(mean(features$ax_median), sd(features$ax_median)) 
           , imputed_ax_median = c(mean(imputedResultData$ax_median), sd(imputedResultData$ax_median))
           , actual_az_sd = c(mean(features$az_sd), sd(features$az_sd)) 
           , imputed_az_sd = c(mean(imputedResultData$az_sd), sd(imputedResultData$az_sd))
           , row.names = c("mean", "sd"))
actual_ax_meanimputed_ax_meanactual_ax_medianimputed_ax_medianactual_az_sdimputed_az_sd
significar0.0063079090.005851233-0.001328867-0.002140251.05886501.0528059
sd0.0309610850.0311258480.0596198340.060113420.24467820.2477697

Now, lets look at the distributions in the data. From the distribution below, we can observe that the distributions for actual data and imputed data is almost identical. We can confirm it with the bandwidth in the plots.

par(mfrow=c(3,2))
plot(density(features$ax_mean), main = "Actual ax_mean", type="l", col="red")
plot(density(imputedResultData$ax_mean), main = "Imputed ax_mean", type="l", col="red")
plot(density(features$ax_median), main = "Actual ax_median", type="l", col="red")
plot(density(imputedResultData$ax_median), main = "Imputed ax_median", type="l", col="red")
plot(density(features$az_sd), main = "Actual az_sdn", type="l", col="red")
plot(density(imputedResultData$az_sd), main = "Imputed az_sd", type="l", col="red")
Density plots

Building a classification model based on actual data and Imputed data

In the following data y will be our classification variable. We will build a classification model using a simple support vector machine(SVM) with actual and imputed data. No transformation will be done on the data. In the end we will compare the results

Leia Também  inscrições abertas para o curso online NIMBLE, de 3 a 5 de junho de 2020

Actual Data

Sample data creation

Let’s split the data into train and test with ratio’s of 80:20.

#create samples of 80:20 ratio
features$y = as.factor(features$y)
sample = sample(nrow(features) , nrow(features)* 0.8)
train = features[sample,]
test = features[-sample,]

Build a SVM model

Now, we can train the model using train set. We will not do any parameter tuning in this example.

library(e1071)
ibrary(caret)

actual.svm.model = svm(y ~., data = train)
summary(actual.svm.model)
Loading required package: ggplot2
Call:
svm(formula = y ~ ., data = train)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.05 

Number of Support Vectors:  142

 ( 47 18 47 30 )


Number of Classes:  4 

Levels: 
 1 2 3 4


Validate SVM model

In the below confusion matrix, we observe the following

  1. accuary>NIR indicating model is very good
  2. Higher accuray and kappa value indicates a very accurate model
  3. Even the balanced accuracy is close to 1 indicating the model is highly accurate
# build a confusion matrix using caret package
confusionMatrix(predict(actual.svm.model, test), test$y)
Confusion Matrix and Statistics

          Reference
Prediction  1  2  3  4
         1 10  1  0  0
         2  0 26  0  0
         3  0  0 22  0
         4  0  0  3 11

Overall Statistics
                                          
               Accuracy : 0.9452          
                 95% CI : (0.8656, 0.9849)
    No Information Rate : 0.3699          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9234          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity            1.0000   0.9630   0.8800   1.0000
Specificity            0.9841   1.0000   1.0000   0.9516
Pos Pred Value         0.9091   1.0000   1.0000   0.7857
Neg Pred Value         1.0000   0.9787   0.9412   1.0000
Prevalence             0.1370   0.3699   0.3425   0.1507
Detection Rate         0.1370   0.3562   0.3014   0.1507
Detection Prevalence   0.1507   0.3562   0.3014   0.1918
Balanced Accuracy      0.9921   0.9815   0.9400   0.9758

Imputed Data

Sample data creation

# create samples of 80:20 ratio
imputedResultData$y = as.factor(imputedResultData$y)
sample = sample(nrow(imputedResultData) , nrow(imputedResultData)* 0.8)
train = imputedResultData[sample,]
test = imputedResultData[-sample,]

Build a SVM model

imputed.svm.model = svm(y ~., data = train)
summary(imputed.svm.model)
Call:
svm(formula = y ~ ., data = train)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.05 

Number of Support Vectors:  167

 ( 59 47 36 25 )


Number of Classes:  4 

Levels: 
 1 2 3 4


Validate SVM model

In the below confusion matrix, we observe the following

  1. accuary>NIR indicating model is very good
  2. Higher accuray and kappa value indicates a very accurate model
  3. Even the balanced accuracy is close to 1 indicating the model is highly accurate
confusionMatrix(predict(imputed.svm.model, test), test$y)
Confusion Matrix and Statistics

          Reference
Prediction  1  2  3  4
         1 15  0  0  0
         2  1 21  0  0
         3  0  0 17  0
         4  0  0  0 26

Overall Statistics
                                          
               Accuracy : 0.9875          
                 95% CI : (0.9323, 0.9997)
    No Information Rate : 0.325           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9831          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity            0.9375   1.0000   1.0000    1.000
Specificity            1.0000   0.9831   1.0000    1.000
Pos Pred Value         1.0000   0.9545   1.0000    1.000
Neg Pred Value         0.9846   1.0000   1.0000    1.000
Prevalence             0.2000   0.2625   0.2125    0.325
Detection Rate         0.1875   0.2625   0.2125    0.325
Detection Prevalence   0.1875   0.2750   0.2125    0.325
Balanced Accuracy      0.9688   0.9915   1.0000    1.000

Overall results

What we saw above and their interpretation is completely subjective. One way to truly validate them is to create random train and test samples multiple times (say 30), build a model, validate the model, capture kappa value. Finally use a simple t-test to see if there is a significant difference.

Null hypothesis:
H0: there is no significant difference between two samples.

# lets create functions to simplify the process

test.function = (data){
    # create samples
    sample = sample(nrow(data) , nrow(data)* 0.75)
    train = data[sample,]
    test = data[-sample,]
    
    # build model
    svm.model = svm(y ~., data = train)
    
    # get metrics
    metrics = confusionMatrix(predict(svm.model, test), test$y)
    return(metrics$overall['Accuracy'])
    
}
# now lets calculate accuracy with actual data to get 30 results
actual.results  = NULL
for(i in 1:100) {
    actual.results[i] = test.function(features)
}
head(actual.results)

# 0.978021978021978
# 0.978021978021978
# 0.978021978021978
# 0.945054945054945
# 0.989010989010989
# 0.967032967032967
# now lets calculate accuracy with imputed data to get 30 results
imputed.results  = NULL
for(i in 1:100) {
    imputed.results[i] = test.function(imputedResultData)
}
head(imputed.results)
# 0.97
# 0.95
# 0.92
# 0.96
# 0.92
# 0.96

T-test to test the results

What’s better than statistically prove if there is significant difference right? So, we will do a t-test to see if there is any statistical difference in the accuracy.

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.results, y = imputed.results, conf.level = 0.95)
	Welch Two Sample t-test

data:  actual.results and imputed.results
t = 7.9834, df = 194.03, p-value = 1.222e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.01673213 0.02771182
sample estimates:
mean of x mean of y 
 0.968022  0.945800 

In the above t-test we have set the confidence level at 95%. From the results we can observe that the p-value is less than 0.05 indicating that there is a significant difference in accuracy between actual data and imputed data. From the means we can notice that the average accuracy of actual data is about 96.5% while the accuracy of imputed data y is about 92.5%. There is a variation of 4%. So, does that mean imputing more data results in reducing the accuracy across various models?

Why not do a test to compare the results? let’s consider 4 other models for that and those will be

  1. Random forest
  2. Decision tree
  3. KNN
  4. Naive Bayes

Random Forest

Let’s use all the same steps as above and fit different models. The results of accuracy will be in the below table

library(randomForest)

# lets create functions to simplify the process

test.rf.function = function(data){
    # create samples
    sample = sample(nrow(data) , nrow(data)* 0.75)
    train = data[sample,]
    test = data[-sample,]
    
    # build model
    rf.model = randomForest(y ~., data = train)
    
    # get metrics
    metrics = confusionMatrix(predict(rf.model, test), test$y)
    return(metrics$overall['Accuracy'])
    
}

# now lets calculate accuracy with actual data to get 30 results
actual.rf.results  = NULL
for(i in 1:100) {
    actual.rf.results[i] = test.rf.function(features)
}
#head(actual.rf.results)

# now lets calculate accuracy with imputed data to get 30 results
imputed.rf.results  = NULL
for(i in 1:100) {
    imputed.rf.results[i] = test.rf.function(imputedResultData)
}
head(data.frame(Actual = actual.rf.results, Imputed = imputed.rf.results))

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.rf.results, y = imputed.rf.results, conf.level = 0.95)
ActualImputed
0.9560440.95
1.0000000.93
0.9670330.96
0.9670330.96
1.0000000.97
0.9670330.93
Random forest accuracy for actual and imputed data
	Welch Two Sample t-test

data:  actual.rf.results and imputed.rf.results
t = 11.734, df = 183.2, p-value 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.02183138 0.03065654
sample estimates:
mean of x mean of y 
 0.976044  0.949800 

In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 2.5% difference.

Decision Tree

library(rpart)

# lets create functions to simplify the process

test.dt.function = function(data){
    # create samples
    sample = sample(nrow(data) , nrow(data)* 0.75)
    train = data[sample,]
    test = data[-sample,]
    
    # build model
    dt.model = rpart(y ~., data = train, method="class")
    
    # get metrics
    metrics = confusionMatrix(predict(dt.model, test, type="class"), test$y)
    return(metrics$overall['Accuracy'])
    
}

# now lets calculate accuracy with actual data to get 30 results
actual.dt.results  = NULL
for(i in 1:100) {
    actual.dt.results[i] = test.dt.function(features)
}
#head(actual.rf.results)

# now lets calculate accuracy with imputed data to get 30 results
imputed.dt.results  = NULL
for(i in 1:100) {
    imputed.dt.results[i] = test.dt.function(imputedResultData)
}
head(data.frame(Actual = actual.dt.results, Imputed = imputed.dt.results))

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.dt.results, y = imputed.dt.results, conf.level = 0.95)
ActualImputed
0.9780220.92
0.9670330.94
0.9670330.95
0.9560440.94
0.9560440.94
0.9780220.95
Decision tree accuracy for actual and imputed data
	Welch Two Sample t-test

data:  actual.dt.results and imputed.dt.results
t = 16.24, df = 167.94, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.03331888 0.04254046
sample estimates:
mean of x mean of y 
0.9703297 0.9324000 

In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 3.5% difference.

K-Nearest Neighbor (KNN)

library(class)

# lets create functions to simplify the process

test.knn.function = function(data){
    # create samples
    sample = sample(nrow(data) , nrow(data)* 0.75)
    train = data[sample,]
    test = data[-sample,]
    
    # build model
    knn.model = knn(train,test, cl=train$y, k=5)
    
    # get metrics
    metrics = confusionMatrix(knn.model, test$y)
    return(metrics$overall['Accuracy'])
    
}

# now lets calculate accuracy with actual data to get 30 results
actual.dt.results  = NULL
for(i in 1:100) {
    actual.dt.results[i] = test.knn.function(features)
}
#head(actual.rf.results)

# now lets calculate accuracy with imputed data to get 30 results
imputed.dt.results  = NULL
for(i in 1:100) {
    imputed.dt.results[i] = test.knn.function(imputedResultData)
}
head(data.frame(Actual = actual.dt.results, Imputed = imputed.dt.results))

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.dt.results, y = imputed.dt.results, conf.level = 0.95)
ActualImputed
0.9670330.97
1.0000000.98
0.9780220.99
0.9780221.00
0.9670331.00
0.9780221.00
KNN accuracy for actual and imputed data
	Welch Two Sample t-test

data:  actual.dt.results and imputed.dt.results
t = 3.2151, df = 166.45, p-value = 0.001566
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.002126868 0.008895110
sample estimates:
mean of x mean of y 
 0.989011  0.983500 

In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 0.05% difference.

Naive Bayes

# lets create functions to simplify the process

test.nb.function = function(data){
    # create samples
    sample = sample(nrow(data) , nrow(data)* 0.75)
    train = data[sample,]
    test = data[-sample,]
    
    # build model
    nb.model = naiveBayes(y ~., data = train)
    
    # get metrics
    metrics = confusionMatrix(predict(nb.model, test), test$y)
    return(metrics$overall['Accuracy'])
    
}

# now lets calculate accuracy with actual data to get 30 results
actual.nb.results  = NULL
for(i in 1:100) {
    actual.nb.results[i] = test.nb.function(features)
}
#head(actual.rf.results)

# now lets calculate accuracy with imputed data to get 30 results
imputed.nb.results  = NULL
for(i in 1:100) {
    imputed.nb.results[i] = test.nb.function(imputedResultData)
}
head(data.frame(Actual = actual.nb.results, Imputed = imputed.nb.results))

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.nb.results, y = imputed.nb.results, conf.level = 0.95)
ActualImputed
0.9890110.95
0.9670330.92
0.9780220.94
1.0000000.95
0.9890110.90
0.9670330.93
Naive Bayes accuracy for actual and imputed data
	Welch Two Sample t-test

data:  actual.nb.results and imputed.nb.results
t = 18.529, df = 174.88, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.04214191 0.05218996
sample estimates:
mean of x mean of y 
0.9740659 0.9269000 

In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 4.5% difference.

Conclusão

From the above results we observe that irrespective of the type of model built, we observed a standard variation in accuracy in the range of 3% – 5% between using actual data and imputed data. In all the cases, actual data helped in building a better model compared to using imputed data for building the model.

If you enjoyed this tutorial, then check out my other tutorials and my GitHub page for all the source code and various R-packages.

  • Testing the Effect of Data Imputation on Model Accuracy
  • Anomaly Detection for Predictive Maintenance using Keras
  • Predictive Maintenance: Zero to Deployment in Manufacturing
  • Free coding education in the time of Covid-19
  • AutoML Frameworks in R & Python

The post Testing the Effect of Data Imputation on Model Accuracy appeared first on Hi! I am Nagdev.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };

(function(d, t) {
var s = d.createElement
s.src = '//cdn.viglink.com/api/vglnk.js';
var r = d.getElementsByTagName
}(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Hi! I am Nagdev.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.



cupom com desconto - o melhor site de cupom de desconto cupomcomdesconto.com.br