Sometimes datasets are expected to be tidy but aren't, finding distinct rows of duplicated IDs is easy but finding why they're distinct in many column tables is less straight forward. This functions returns the values that resulted in any duplicated IDs in one of two forms either a named list or a tibble

Usage,
exposeDupes(x, grouping_var, listout = TRUE)

Arguments

x

Tibble or Dataframe

grouping_var

Column to look for duplicated values

listout

Flag to return either list or tibble

Value

List or Dataframe of results

Details

Named list of two-column tibbles for each value resulting in duplicate IDs

  • Grouping Variable

  • Distinct values

Tibble with the following columns

  • Grouping Variable

  • n * X.grpNdistinct number of distinct values for duplicated ID

  • n * X.values values for that duplicated ID

https://dplyr.tidyverse.org/articles/programming.html

Examples

# \donttest{
  df <- data.frame(name = sample(letters, 20, replace = TRUE),
               month = sample(month.name, 20, replace = TRUE),
            letters = sample(LETTERS[1:10], 20, replace = TRUE),
            nums = floor(runif(20, 1, 15)))
  dplyr::count(df, name)
#>    name n
#> 1     a 1
#> 2     b 2
#> 3     c 3
#> 4     d 1
#> 5     e 1
#> 6     f 1
#> 7     h 2
#> 8     i 1
#> 9     j 1
#> 10    k 1
#> 11    l 1
#> 12    n 1
#> 13    o 1
#> 14    q 1
#> 15    t 2
  exposeDupes(df, name)
#> $letters
#> # A tibble: 9 x 2
#>   name  letters
#>   <chr> <chr>  
#> 1 b     J      
#> 2 b     F      
#> 3 c     C      
#> 4 c     B      
#> 5 c     J      
#> 6 h     I      
#> 7 h     G      
#> 8 t     F      
#> 9 t     B      
#> 
#> $month
#> # A tibble: 7 x 2
#>   name  month    
#>   <chr> <chr>    
#> 1 b     March    
#> 2 b     August   
#> 3 c     September
#> 4 c     February 
#> 5 c     January  
#> 6 h     May      
#> 7 h     August   
#> 
#> $nums
#> # A tibble: 9 x 2
#>   name  nums 
#>   <chr> <chr>
#> 1 b     2    
#> 2 b     9    
#> 3 c     1    
#> 4 c     6    
#> 5 c     8    
#> 6 h     9    
#> 7 h     3    
#> 8 t     12   
#> 9 t     13   
#> 
  exposeDupes(df, name, listout = FALSE)
#> # A tibble: 9 x 7
#>   name  letters.grpNdistinct letters.values month.grpNdistinct month.values
#>   <chr>                <int> <chr>                       <int> <chr>       
#> 1 b                        2 J                               2 March       
#> 2 b                        2 F                               2 August      
#> 3 c                        3 C                               3 September   
#> 4 c                        3 B                               3 February    
#> 5 c                        3 J                               3 January     
#> 6 h                        2 I                               2 May         
#> 7 h                        2 G                               2 August      
#> 8 t                        2 F                               1 April       
#> 9 t                        2 B                               1 April       
#> # ... with 2 more variables: nums.grpNdistinct <int>, nums.values <dbl>
  # }