博文

数据整理中怎样去掉各组的异常值？

已有 11354 次阅读 2019-2-21 10:18 |个人分类:统计分析|系统分类:科研笔记

假设500人中有男女若干，现要比较饭前(before)和饭后(after)某生理指标的变化。由于数据誊写错误，测量指标中出现了一些异常值(outlier)。问，如果按照性别以及饭前饭后分组，怎样去掉各组内的异常值？

思路：先定义识别和转换异常值的函数, 将一个向量中的异常值转换为NA。再用dplyr程序包将该函数应用于各组数据。

解答：

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## 参考  https://stackoverflow.com/questions/49982794/remove-outliers-by-group-in-r
## 对于一个向量x, 先计算其上下四分位数
## 若任何值超过上四分位数的1.5倍，或低于下四分位数的1.5倍，一般认为是异常值
## 下面的函数将异常值转换为NA
remove_outliers <- function(x, na.rm = TRUE, ...) {
    qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
    H <- 1.5 * IQR(x, na.rm = na.rm)
    y <- x
    y[x < (qnt[1] - H)] <- NA
    y[x > (qnt[2] + H)] <- NA
    y
}

## 生成一套随机数据

test_dat <- data.frame(
    ID = c(1:500,1:500), 
    age = rep(sample(18:70, 500, replace = TRUE), 2) ,
    gender = gl(2, 500, labels = c("male", "female")), 
    meal = gl(2, 500, labels = c("before", "after"))[sample(1:1000)],
    value = c(c(rnorm(490), rnorm(10)*5),  c(rnorm(490), rnorm(10)*5) + 3)
)

head(test_dat)

##   ID age gender  meal        value
## 1  1  25   male after -1.233542780
## 2  2  25   male after -0.003499835
## 3  3  54   male after -0.265865215
## 4  4  24   male after  1.281983039
## 5  5  21   male after -0.114771555
## 6  6  41   male after  0.763784322

ggplot(test_dat, aes(x = gender, y = value, fill = meal)) + 
    geom_boxplot() + 
    ggtitle("Original")

test_dat2 <- test_dat %>%
    group_by_at(.vars = c("meal", "gender")) %>% 
    mutate(value_new = case_when(TRUE ~ remove_outliers(value), TRUE ~ value))

ggplot(test_dat2, aes(x = gender, y = value_new, fill = meal)) + 
    geom_boxplot() + 
    ggtitle("Outlier Removed")

## Warning: Removed 18 rows containing non-finite values (stat_boxplot).

转载本文请联系原作者获取授权，同时请注明本文来自张金龙科学网博客。
链接地址：https://blog.sciencenet.cn/blog-255662-1163375.html

上一篇：过年
下一篇：别有事儿没事儿谈什么“基础理论重大创新”

收藏 IP: 113.28.150.*| 热度|

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (3 个评论)

数据加载中...

返回顶部

张金龙

扫一扫，分享此博文

张金龙的博客分享 http://blog.sciencenet.cn/u/zjlcas 物种适应性、分布与进化

博文

数据整理中怎样去掉各组的异常值？

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (3 个评论)

张金龙

全部作者的精选博文

全部作者的其他最新博文

全部精选博文导读

相关博文

张金龙的博客分享 http://blog.sciencenet.cn/u/zjlcas 物种适应性、分布与进化

博文

数据整理中怎样去掉各组的异常值？

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (3 个评论)

张金龙

全部作者的精选博文

全部作者的其他最新博文

全部精选博文导读

相关博文

该博文允许注册用户评论请点击登录评论 (3 个评论)