Read in SAM files as tibble


toy <- read_sam(system.file("extdata", "toy.sam", package = "utilitarian"))

toy
#> # A tibble: 13 x 12
#>    QNAME  FLAG RNAME   POS  MAPQ CIGAR   RNEXT PNEXT  TLEN SEQ     QUAL   X12   
#>    <chr> <dbl> <chr> <dbl> <dbl> <chr>   <chr> <dbl> <dbl> <chr>   <chr>  <chr> 
#>  1 r001    163 ref       7    30 8M4I4M~ =        37    39 TTAGAT~ *      XX:B:~
#>  2 r002      0 ref       9    30 1S2I6M~ *         0     0 AAAAGA~ *      NA    
#>  3 r003      0 ref       9    30 5H6M    *         0     0 AGCTAA  *      NA    
#>  4 r004      0 ref      16    30 6M14N1~ *         0     0 ATAGCT~ *      NA    
#>  5 r003     16 ref      29    30 6H5M    *         0     0 TAGGC   *      NA    
#>  6 r001     83 ref      37    30 5S9M6S  =         7   -39 CAGCGC~ *      NA    
#>  7 x1        0 ref2      1    30 20M     *         0     0 aggttt~ ?????~ NA    
#>  8 x2        0 ref2      2    30 21M     *         0     0 ggtttt~ ?????~ NA    
#>  9 x3        0 ref2      6    30 9M4I13M *         0     0 ttataa~ ?????~ NA    
#> 10 x4        0 ref2     10    30 25M     *         0     0 CaaaTa~ ?????~ NA    
#> 11 x5        0 ref2     12    30 24M     *         0     0 aaTaat~ ?????~ NA    
#> 12 x6        0 ref2     14    30 23M     *         0     0 Taatta~ ?????~ NA    
#> 13 x3        0 ref      22    30 4S5M2I~ *         0     0 ttataa~ ?????~ NA

Create Tidy CIGAR Data

Results in a long table of tidy data showing start and end positions of the operations along with their English translations.

Due to the facetting of CIGAR visualizations on Reference names (RNAME) it is wise to group them together if the organism is what matters. For example in the case of the human genome vs sars coronavirus grouping all the human chromosomes together is useful.

sam %>% 
  mutate(RNAME = case_when(RNAME == "MN908947.3" ~ "hcov",
                           RNAME == "*" ~ "unmapped",
                           TRUE ~ "human")) %>% 
  tidy_cigar()


toy.tc <- tidy_cigar(toy)

toy.tc
#> # A tibble: 39 x 9
#> # Groups:   QNAME, RNAME, POS [13]
#>    QNAME RNAME  MAPQ   POS POSstart POSend CIGARstart CIGARend Operation
#>    <chr> <chr> <dbl> <dbl>    <dbl>  <dbl>      <dbl>    <dbl> <chr>    
#>  1 r001  ref      30     7        8     15          1        8 Match    
#>  2 r001  ref      30     7       16     19          9       12 Insertion
#>  3 r001  ref      30     7       20     23         13       16 Match    
#>  4 r001  ref      30     7       24     24         17       17 Deletion 
#>  5 r001  ref      30     7       25     27         18       20 Match    
#>  6 r001  ref      30    37       38     42          1        5 Soft Clip
#>  7 r001  ref      30    37       43     51          6       14 Match    
#>  8 r001  ref      30    37       52     57         15       20 Soft Clip
#>  9 r002  ref      30     9       10     10          1        1 Soft Clip
#> 10 r002  ref      30     9       11     12          2        3 Insertion
#> # ... with 29 more rows

Plot a CIGAR

Plotting a cigar from a single read in the tidied cigar table is done using plot_cigar() and specifying the read. Plots are facetted based on the references and the mapped position of the read. Returns a plot.

plot_cigar(toy.tc, "r002")

Plot all CIGARs

A wrapper around calling plot_cigar() on all the reads in the SAM file. Will warn and stop if more than 20 elements to be created, override this with the lg=TRUE argument.

p <- plot_all_cigars(toy.tc)

toy[toy$QNAME == "r001", "CIGAR"]
#> # A tibble: 2 x 1
#>   CIGAR     
#>   <chr>     
#> 1 8M4I4M1D3M
#> 2 5S9M6S
p[["r001"]]

toy[toy$QNAME == "x3", "CIGAR"]
#> # A tibble: 2 x 1
#>   CIGAR       
#>   <chr>       
#> 1 9M4I13M     
#> 2 4S5M2I2D8M5S
p[["x3"]]

Postscript

SAM headers contain useful information but for the purposes of cigar visualization they are ignored. Having access to them can be helpful so there is an included function to pull headers (as I couldn’t figure out a way to do so with readr).

Show Reference Lengths


h <- read_sam_headers(system.file("extdata", "toy.sam", package = "utilitarian"))

h$References
#> # A tibble: 0 x 2
#> # ... with 2 variables: Name <chr>, Length <chr>

Show Command line params and software details


h$Mapping
#> # A tibble: 0 x 4
#> # ... with 4 variables: ID <chr>, PN <chr>, VN <chr>, CL <chr>