R: Использование dplyr для поиска и фильтрации строки во всем фрейме данных

#r #dplyr

Вопрос:

Мне нужно что-то вроде CTRL Fв Microsoft Excel, чтобы искать строку во всем фрейме данных (я предпочитаю dplyr решение, если это возможно).

Я изменил свой репрекс, основываясь на предложениях Ронака и Акруна. Они оба превосходны, один полагается на базу R, а другой-на str_detect. Лично я предпочитаю последнее только потому, что оно лучше работает с большими наборами данных на моей машине. Спасибо вам обоим!

 library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)

##Two functions suggested by Ronak

find_text <- function(df, tt, ...){

    res <- df %>%
        mutate(across(where(is.character), ~grepl(tt,.x, ...)))
        
return(res)
    
}




find_text_filter <- function(df, tt, ...){
  res <- df %>%
    filter(if_any(where(is.character), ~grepl(tt,.x, ...)))
  return(res)
}


### And now the str_detect variation by Akrun

find_text2 <- function(df, tt){

    res <- df %>%
        mutate(across(where(is.character), ~str_detect(.x,tt)))
        
return(res)
    
}




find_text_filter2 <- function(df, tt){
  res <- df %>%
    filter(if_any(where(is.character), ~str_detect(.x,tt)))
  return(res)
}


df <- tibble(a=seq(5), b=c("hfh", "gjgkjguk", "jyfyujyuj ygujyg", "uyyhjg",
                           "776uj"),
             d=c("ggg", "hhh", "gfrr", "67hn", "jnug"),
             e=c("gtdfdc", "  kjihi", "hgwjhfg", "ujyggg", "ut 089jhjm")    )



df1 <- df %>%
    find_text("gj")

df1 ## this works: I know in which text column and where the text appears
#> # A tibble: 5 x 4
#>       a b     d     e    
#>   <int> <lgl> <lgl> <lgl>
#> 1     1 FALSE FALSE FALSE
#> 2     2 TRUE  FALSE FALSE
#> 3     3 FALSE FALSE FALSE
#> 4     4 FALSE FALSE FALSE
#> 5     5 FALSE FALSE FALSE


## and now this also does
df2 <- df %>%
    find_text_filter("gj")

df2
#> # A tibble: 1 x 4
#>       a b        d     e        
#>   <int> <chr>    <chr> <chr>    
#> 1     2 gjgkjguk hhh   "  kjihi"



### same with the str_detect functions

df3 <- df %>%
    find_text2("gj")

df3  
#> # A tibble: 5 x 4
#>       a b     d     e    
#>   <int> <lgl> <lgl> <lgl>
#> 1     1 FALSE FALSE FALSE
#> 2     2 TRUE  FALSE FALSE
#> 3     3 FALSE FALSE FALSE
#> 4     4 FALSE FALSE FALSE
#> 5     5 FALSE FALSE FALSE


df4 <- df %>%
    find_text_filter2("gj")

df4 
#> # A tibble: 1 x 4
#>       a b        d     e        
#>   <int> <chr>    <chr> <chr>    
#> 1     2 gjgkjguk hhh   "  kjihi"

^{Создано 2021-05-20 пакетом reprex (v2.0.0)}

Ответ №1:

Мы могли бы использовать str_detect

 library(dplyr)
library(stringr)
find_text_filter <- function(df, tt){
   df %>%
    filter(if_any(where(is.character), ~str_detect(.x, tt)))

  }

-тестирование

 df %>%
     find_text_filter("gj")
# A tibble: 1 x 4
#      a b        d     e        
#  <int> <chr>    <chr> <chr>    
#1     2 gjgkjguk hhh   "  kjihi"

1. Тоже хорошая идея. Вероятно, быстрее, чем при использовании grepl.

2. Что-то не сходится. Если я попробую свой пример с str_detect вместо grepl, я не найду никакого соответствия…

3. str_ функции @larry77 stringr основаны на stringi , и это очень быстро

4. Я думаю, что проблема была в том, что у ... as str_detect есть только один дополнительный параметр negate , а остальные-это модифицирующие, то есть fixed(yourpattern)

5. Извините, я перевернул аргументы внутри str_detect. Починю почту завтра.

Ответ №2:

Вы можете воспользоваться if_any здесь :

 library(dplyr)

find_text_filter <- function(df, tt, ...){
  res <- df %>%
    filter(if_any(where(is.character), ~grepl(tt,.x, ...)))
  return(res)
}

df %>% find_text_filter("gj")

# A tibble: 1 x 4
#      a b        d     e        
#  <int> <chr>    <chr> <chr>    
#1     2 gjgkjguk hhh   "  kjihi"

Ответ №3:

Вы можете суммировать по строкам и выбирать только в том случае, если сумма > 0:

 find_text_filter <- function(df, tt, ...){

    res <- df %>%
        mutate(across(where(is.character), ~grepl(tt,.x, ...))) %>% rowwise() %>% 
        filter(max(sum(c_across(where(is.logical)))) == 1)
        
return(res)
    
}