% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/mcgraph.R
\name{mcg.impute}
\alias{mcg.impute}
\title{impute missing values using rpart, knn, mean or median}
\usage{
mcg.impute(data, method = "rpart", k = 5)
}
\arguments{
\item{data}{the data frame or matrix with missing values.}

\item{method}{which method to be used to impute the values, either 'mean', 'median', rpart, 'knn', default: 'rpart'}

\item{k}{how many nearest neighbors to be used if method is 'knn', default: 5}
}
\value{
data with imputed values.
}
\description{
`mcg.impute` imputes missing data in data frames or matrices using either
  decision trees, k-nearest-neighbor approach with Eucildean distances or
  simple mean and median computations for the variables.
}
\section{Details}{

    Many mathematical methods did not allow missing data in their inputs.
    The missing values have to be guessed in this case.
    The more basic approaches are replacing the NA values with the median or mean for the variable.
    More advanced methods are using the mean only for a few samples which are very similar to the sample where the values is missing.
    This method is called knn-imputation and uses the k-nearest neighbors (default here 5) only for computing the replacement value.
    For data where the value order in the columns is of importance it is often desired to replace missing values with the mean of the two closest neighbors in the data frame or matrix, for instance if the data are ordered by time. Here the method timemean follows this approach.
    The method rpart uses decision trees to impute the values, it is currently the only method to impute as well factor variables.  The advantage of the rpart method is that it can impute not only numerical values but as well factor variables.
}

\examples{
data(iris)
set.seed(123)
ir=as.matrix(iris[,1:4])
ir.mv=ir
# introduce 5 percent NA's
mv=sample(1:length(ir),as.integer(0.05*length(ir)))
ir.mv[mv]=NA
ir.imp.med=mcg.impute(ir.mv,method='median') # not good
ir.imp.rpart=mcg.impute(ir.mv) # method rpart (default)
ir.imp.knn=mcg.impute(ir.mv,method='knn')
rmse = function (x,y) { return(sqrt(sum((x-y)^2))) }
rmse(ir[mv],ir.imp.med[mv]) # should be high
rmse(ir[mv],ir.imp.rpart[mv]) # should be low!
rmse(ir[mv],ir.imp.knn[mv]) # should be low!
cor(ir[mv],ir.imp.med[mv])
cor(ir[mv],ir.imp.rpart[mv])
cor(ir[mv],ir.imp.knn[mv]) # should be high!
# factor variables
data(iris)
ciris=iris
idx=sample(1:nrow(ciris),15) # 10 percent NA's
ciris$Species[idx]=NA
summary(ciris)
ciris=mcg.impute(ciris,method="rpart")
table(ciris$Species[idx],iris$Species[idx])
}
\seealso{
\link{mcg.new}.
}
\author{
Detlef Groth <email: dgroth@uni-potsdam.de>
}
