dplyr
dplyr: A Grammar of Data Manipulation
A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
转换和添加变量
| 函数 | 说明 |
|---|---|
| mutate(data,…) | 添加新变量*(base::transform)*,可以用已存在的变量,可以和聚合函数配合使用 |
| mutate_all(.tbl, .funs, …) | .funs: List of function calls generated by funs(), or a character vector of function names, or simply a function |
| mutate_if(.tbl, .predicate, .funs, …) | .predicate: A predicate function to be applied to the columns or a logical vector |
| mutate_at(.tbl, .vars, .funs, …) | .vars: A list of columns generated by vars(), a character vector of column names, a numeric vector of column positions |
| transmute(data,…) | 添加新变量,去除已有的变量 |
| case_when(…) | if…else if…else变体,条件按顺序判定依次执行 |
for example
mtcars<-mutate(mtcars,x=ifelse(x>100,NA,x)) #处理异常值 |
筛选
| 函数 | 说明 |
|---|---|
| filter(data,…) | 筛选*(base::subset(x, subset, select))* |
| filter_all(.tbl, .vars_predicate) | .vars_predicate: A quoted predicate expression as returned by all_vars() or any_vars() |
| filter_if(.tbl, .predicate, .vars_predicate) | |
| filter_at(.tbl, .vars, .vars_predicate) | .vars: A list of columns generated by vars(), a character vector of column names, a numeric vector of column positionspopular_dests <- flights %>% group_by(dest) %>% filter(n() > 365) |
| slice(.data,…) | 切片slice(mtcars, n())slice(mtcars, 5:n()) |
select and rename
| 函数 | 说明 |
|---|---|
| select() | 用变量名选择变量select(flights, -(year:day)) |
| select_all(.tbl, .funs = list(), …) | |
| select_if(.tbl, .predicate, .funs = list(), …) | |
| select_at(.tbl, .vars, .funs = list(), …) | |
| rename() | keeps all variablesrename(iris, oldname = newname,...) |
| rename_all(.tbl, .funs = list(), …) | |
| rename_if(.tbl, .predicate, .funs = list(), …) | |
| rename_at(.tbl, .vars, .funs = list(), …) |
参数.vars:
| 赋值 | 说明 |
|---|---|
| ends_with() | ends with aprefixselect(iris, starts_with("Petal")) |
| contains() | containsaliteralstring |
| matches() | 匹配正则表达式(regularexpression) |
| num_range() | 类数字区域likex01,x02,x03. |
| one_of() | variables in character vector. |
| everything() | 换位置输出select(flights, time_hour, air_time, everything()) |
分组
| 函数 | 说明 |
|---|---|
| group_by(.data, …, add = FALSE)/ | 分组,大多数函数可用已分的组 |
| group_by_all(.tbl, .funs = list(), …) | |
| group_by_at(.tbl, .vars, .funs = list(), …,) | |
| group_by_if(.tbl, .predicate, .funs = list(), …,) | |
| ungroup() | 解除分组 |
| base::by(data, INDICES, FUN) | indices:要分组的factor or factor list |
聚合
| 函数 | 说明 |
|---|---|
| summarise() | reduces multiple values down to a single summary. |
| summarise_all(.tbl, .funs, …) | .funs: List of function calls generated by funs(), or a character vector of function names, or simply a function |
| summarise_if(.tbl, .predicate, .funs, …) | .predicate: A predicate function to be applied to the columns or a logical vector |
| summarise_at(.tbl, .vars, .funs, …) | .vars: A list of columns generated by vars(), a character vector of column names, a numeric vector of column positions |
| do(data,…) | 补充函数,可以聚合任何函数*(plyr::dlply)*a<-iris%>%group_by(Species)%>%do(s=summary(.)) |
for example
mtcars %>% group_by(cyl) %>% summarise(a = n(), b = a + 1) |
排序和排名
| 排序 | 说明 |
|---|---|
arrange(data,…, .by_group = FALSE) |
|
arrange_all, arrange_at, arrange_if |
|
desc(x) |
降序 |
| 排名 | (base::rank) |
| row_number(x) | 先出现先排名 |
| min_rank(x) | 以最小的排名计 |
| dense_rank(x) | like min_rank(), but with no gaps between ranks |
| percent_rank(x) | 百分比排名 |
| cume_dist(x) | 累积百分比排名(<=当前排名的比例) |
| ntile(x,n) | 分成n级的粗略排名 |
for example
arrange(mtcars, cyl,desc(disp))
表连接
| 函数 | 说明*(base::merge)* |
|---|---|
| inner_join(x, y, by = NULL); left_join; right_join; full_join; |
1.连接键名相同时,by=c(‘id1’, ‘id2’) 2. 连接键名不相同时,by = c(“a” = “b”) |
| semi_join | 返回y中匹配到的x(x only) |
| anti_join | 返回y中没有匹配到的x(x only) |
| 行/列合并: | |
| bind_rows(…, .id = NULL) | base::rbinds |
| bind_cols(…) | base::colbinds |
集合运算
| 函数 | 说明 |
|---|---|
| intersect(x, y) | 返回x和y交集。 |
| union(x, y) | 返回x和y并集。 |
| setdiff(x, y) | 返回在x,但不在y数据(差集) |
| setequle(x,y) | 逻辑,x和y是否相等 |
计数
| 函数 | 说明*(base::nrow)* |
|---|---|
| n() | 记录数,summarise(), mutate() and filter() 配合函数 |
| n_distinct(…, na.rm = FALSE) | 唯一值计数 |
| tally(x, wt, sort = FALSE) | 计数 mtcars %>% tally() |
| count(x, …, wt = NULL, sort = FALSE) | 组内计数 group_by() + tally() mtcars %>% count(cyl) # cyl分组计数 |
| add_tally(x, wt, sort = FALSE) | 添加计数变量 mutate()+tally() mtcars %>% add_tally() |
| add_count(x, …, wt = NULL, sort = FALSE) | 添加组内计数变量 group_by() + add_tally() mtcars %>% add_count(cyl) |
去重
distinct(.data, ..., .keep_all = FALSE)
(base::duplicated, base::unique)
- 分组后组内去重
- … : 指定要去重的列
- .keep_all 是否保留全部列
df%>%group_by(col1)%>%distinct(col2,.keep_all=TRUE)
抽样(data.frame)
for data.frame:
sample_n(tbl, size, replace = FALSE)
sample_frac(tbl, size, replace = FALSE)
for vector:
base::sample(x,size,replace=FALSE,prob=NULL)
其他函数
| 函数 | 说明 |
|---|---|
| nth(x,n,order_by = NULL) first();last() top_n(x, n, wt) |
提取 first, last or nth value from a vector 参数wt排序变量 |
| lead(x,n) | 偏移量 |
| lag(x,n) | 超前或滞后值 |
| between(x, left, right) | |
| if_else(condition, true, false, missing = NULL) | |
| near(x, y) | 判断浮点数相等 |
| all_vars(expr),any_vars(expr) | Apply predicate to all variables |
vars() |
Select variables |
funs() |
Create a list of functions calls |
plyr
plyr: Tools for Splitting, Applying and Combining Data
函数统一格式:
* * ply |
a: array
l: list
d: data.frame
m: multiple inputs
r: repeat multiple times
_: nothing
for example
aaply(.data, .margins, .fun = NULL, ...,.progress = "none") |
参数.progressin (“none”/”text”/”tk”/”win”)
评论










