dplyr
dplyr: A Grammar of Data Manipulation
A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
转换和添加变量
函数 | 说明 |
---|---|
mutate(data,…) | 添加新变量*(base::transform)*,可以用已存在的变量,可以和聚合函数配合使用 |
mutate_all(.tbl, .funs, …) | .funs : List of function calls generated by funs(), or a character vector of function names, or simply a function |
mutate_if(.tbl, .predicate, .funs, …) | .predicate : A predicate function to be applied to the columns or a logical vector |
mutate_at(.tbl, .vars, .funs, …) | .vars : A list of columns generated by vars(), a character vector of column names, a numeric vector of column positions |
transmute(data,…) | 添加新变量,去除已有的变量 |
case_when(…) | if…else if…else变体,条件按顺序判定依次执行 |
for example
mtcars<-mutate(mtcars,x=ifelse(x>100,NA,x)) #处理异常值 |
筛选
函数 | 说明 |
---|---|
filter(data,…) | 筛选*(base::subset(x, subset, select))* |
filter_all(.tbl, .vars_predicate) | .vars_predicate : A quoted predicate expression as returned by all_vars() or any_vars() |
filter_if(.tbl, .predicate, .vars_predicate) | |
filter_at(.tbl, .vars, .vars_predicate) | .vars : A list of columns generated by vars(), a character vector of column names, a numeric vector of column positionspopular_dests <- flights %>% group_by(dest) %>% filter(n() > 365) |
slice(.data,…) | 切片slice(mtcars, n()) slice(mtcars, 5:n()) |
select and rename
函数 | 说明 |
---|---|
select() | 用变量名选择变量select(flights, -(year:day)) |
select_all(.tbl, .funs = list(), …) | |
select_if(.tbl, .predicate, .funs = list(), …) | |
select_at(.tbl, .vars, .funs = list(), …) | |
rename() | keeps all variablesrename(iris, oldname = newname,...) |
rename_all(.tbl, .funs = list(), …) | |
rename_if(.tbl, .predicate, .funs = list(), …) | |
rename_at(.tbl, .vars, .funs = list(), …) |
参数.vars
:
赋值 | 说明 |
---|---|
ends_with() | ends with aprefixselect(iris, starts_with("Petal")) |
contains() | containsaliteralstring |
matches() | 匹配正则表达式(regularexpression) |
num_range() | 类数字区域likex01,x02,x03. |
one_of() | variables in character vector. |
everything() | 换位置输出select(flights, time_hour, air_time, everything()) |
分组
函数 | 说明 |
---|---|
group_by(.data, …, add = FALSE)/ | 分组,大多数函数可用已分的组 |
group_by_all(.tbl, .funs = list(), …) | |
group_by_at(.tbl, .vars, .funs = list(), …,) | |
group_by_if(.tbl, .predicate, .funs = list(), …,) | |
ungroup() | 解除分组 |
base::by(data, INDICES, FUN) | indices :要分组的factor or factor list |
聚合
函数 | 说明 |
---|---|
summarise() | reduces multiple values down to a single summary. |
summarise_all(.tbl, .funs, …) | .funs : List of function calls generated by funs(), or a character vector of function names, or simply a function |
summarise_if(.tbl, .predicate, .funs, …) | .predicate : A predicate function to be applied to the columns or a logical vector |
summarise_at(.tbl, .vars, .funs, …) | .vars : A list of columns generated by vars(), a character vector of column names, a numeric vector of column positions |
do(data,…) | 补充函数,可以聚合任何函数*(plyr::dlply)*a<-iris%>%group_by(Species)%>%do(s=summary(.)) |
for example
mtcars %>% group_by(cyl) %>% summarise(a = n(), b = a + 1) |
排序和排名
排序 | 说明 |
---|---|
arrange(data,…, .by_group = FALSE) |
|
arrange_all, arrange_at, arrange_if |
|
desc(x) |
降序 |
排名 | (base::rank) |
row_number(x) | 先出现先排名 |
min_rank(x) | 以最小的排名计 |
dense_rank(x) | like min_rank(), but with no gaps between ranks |
percent_rank(x) | 百分比排名 |
cume_dist(x) | 累积百分比排名(<=当前排名的比例) |
ntile(x,n) | 分成n级的粗略排名 |
for example
arrange(mtcars, cyl,desc(disp))
表连接
函数 | 说明*(base::merge)* |
---|---|
inner_join(x, y, by = NULL); left_join; right_join; full_join; |
1.连接键名相同时,by=c(‘id1’, ‘id2’) 2. 连接键名不相同时,by = c(“a” = “b”) |
semi_join | 返回y中匹配到的x(x only) |
anti_join | 返回y中没有匹配到的x(x only) |
行/列合并: | |
bind_rows(…, .id = NULL) | base::rbinds |
bind_cols(…) | base::colbinds |
集合运算
函数 | 说明 |
---|---|
intersect(x, y) | 返回x和y交集。 |
union(x, y) | 返回x和y并集。 |
setdiff(x, y) | 返回在x,但不在y数据(差集) |
setequle(x,y) | 逻辑,x和y是否相等 |
计数
函数 | 说明*(base::nrow)* |
---|---|
n() | 记录数,summarise(), mutate() and filter() 配合函数 |
n_distinct(…, na.rm = FALSE) | 唯一值计数 |
tally(x, wt, sort = FALSE) | 计数 mtcars %>% tally() |
count(x, …, wt = NULL, sort = FALSE) | 组内计数 group_by() + tally() mtcars %>% count(cyl) # cyl分组计数 |
add_tally(x, wt, sort = FALSE) | 添加计数变量 mutate()+tally() mtcars %>% add_tally() |
add_count(x, …, wt = NULL, sort = FALSE) | 添加组内计数变量 group_by() + add_tally() mtcars %>% add_count(cyl) |
去重
distinct(.data, ..., .keep_all = FALSE)
(base::duplicated, base::unique)
- 分组后组内去重
- … : 指定要去重的列
- .keep_all 是否保留全部列
df%>%group_by(col1)%>%distinct(col2,.keep_all=TRUE)
抽样(data.frame)
for data.frame:
sample_n(tbl, size, replace = FALSE)
sample_frac(tbl, size, replace = FALSE)
for vector:
base::sample(x,size,replace=FALSE,prob=NULL)
其他函数
函数 | 说明 |
---|---|
nth(x,n,order_by = NULL) first();last() top_n(x, n, wt) |
提取 first, last or nth value from a vector 参数wt排序变量 |
lead(x,n) | 偏移量 |
lag(x,n) | 超前或滞后值 |
between(x, left, right) | |
if_else(condition, true, false, missing = NULL) | |
near(x, y) | 判断浮点数相等 |
all_vars(expr),any_vars(expr) | Apply predicate to all variables |
vars() |
Select variables |
funs() |
Create a list of functions calls |
plyr
plyr: Tools for Splitting, Applying and Combining Data
函数统一格式:
* * ply |
a: array
l: list
d: data.frame
m: multiple inputs
r: repeat multiple times
_: nothing
for example
aaply(.data, .margins, .fun = NULL, ...,.progress = "none") |
参数.progress
in (“none”/”text”/”tk”/”win”)
评论