exposeDupes.Rd
Sometimes datasets are expected to be tidy but aren't, finding distinct rows of duplicated IDs is easy but finding why they're distinct in many column tables is less straight forward. This functions returns the values that resulted in any duplicated IDs in one of two forms either a named list or a tibble
exposeDupes(x, grouping_var, listout = TRUE)
Tibble or Dataframe
Column to look for duplicated values
Flag to return either list or tibble
List or Dataframe of results
Named list of two-column tibbles for each value resulting in duplicate IDs
Grouping Variable
Distinct values
Tibble with the following columns
Grouping Variable
n * X.grpNdistinct number of distinct values for duplicated ID
n * X.values values for that duplicated ID
https://dplyr.tidyverse.org/articles/programming.html
# \donttest{
df <- data.frame(name = sample(letters, 20, replace = TRUE),
month = sample(month.name, 20, replace = TRUE),
letters = sample(LETTERS[1:10], 20, replace = TRUE),
nums = floor(runif(20, 1, 15)))
dplyr::count(df, name)
#> name n
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 1
#> 5 e 1
#> 6 f 1
#> 7 h 2
#> 8 i 1
#> 9 j 1
#> 10 k 1
#> 11 l 1
#> 12 n 1
#> 13 o 1
#> 14 q 1
#> 15 t 2
exposeDupes(df, name)
#> $letters
#> # A tibble: 9 x 2
#> name letters
#> <chr> <chr>
#> 1 b J
#> 2 b F
#> 3 c C
#> 4 c B
#> 5 c J
#> 6 h I
#> 7 h G
#> 8 t F
#> 9 t B
#>
#> $month
#> # A tibble: 7 x 2
#> name month
#> <chr> <chr>
#> 1 b March
#> 2 b August
#> 3 c September
#> 4 c February
#> 5 c January
#> 6 h May
#> 7 h August
#>
#> $nums
#> # A tibble: 9 x 2
#> name nums
#> <chr> <chr>
#> 1 b 2
#> 2 b 9
#> 3 c 1
#> 4 c 6
#> 5 c 8
#> 6 h 9
#> 7 h 3
#> 8 t 12
#> 9 t 13
#>
exposeDupes(df, name, listout = FALSE)
#> # A tibble: 9 x 7
#> name letters.grpNdistinct letters.values month.grpNdistinct month.values
#> <chr> <int> <chr> <int> <chr>
#> 1 b 2 J 2 March
#> 2 b 2 F 2 August
#> 3 c 3 C 3 September
#> 4 c 3 B 3 February
#> 5 c 3 J 3 January
#> 6 h 2 I 2 May
#> 7 h 2 G 2 August
#> 8 t 2 F 1 April
#> 9 t 2 B 1 April
#> # ... with 2 more variables: nums.grpNdistinct <int>, nums.values <dbl>
# }