Learning new-tidyr and a cheeky steal from purrr

Ah, pivots…

So, ‘pivots’ have been part of my life in data analysis since my first interview, where I missed the point of the test (to make a pivot) and manually filtered loads of Excel data until I got there. Managed to get the job though (nothing to do with the sympathy vote for the broken shoulder I had at the time, of course …).

They are so easy to build in Excel, a real pain in SQL (as you need to specify the column names, or make it dynamic), but thank heavens for the tidyr package. Now, I’m not the first to suggest it could be a bit confusing, but I got my head around spread() and gather() and was a happy bunny. Then they changed it… Hadley Wickham and contributors made the wise choice to adopt the phrase ‘pivot’ in pivot_wider() and pivot_longer(). So I felt like a right idiot when I couldn’t figure it out…. I took the sensible approach of ignoring it until they forced my hand by removing it.

That said, I was working on a solution in some consultancy work today, and I needed to pivot the data but also to leave a strong markdown documentation trail. I felt it would have been a bit unfair if I left them with spread() knowing it would go the way of the Dodo soon. This brief post is just showing a working bit of pivot for my own memory and anyone else who’s interested.

I’m pivoting the NHSRdatasets LOS_models data.frame for organisation by columns, and applying function. This could be mean() or meadian() or similar, but I defined a function to add some Poisson distributed noise.

library(NHSRdatasets)

data("LOS_model")

library(tidyr)
library(dplyr)

set.seed(123) # For reproducibility

LOS_model %>% 
  pivot_wider(id_cols = ID
              , names_from = Organisation
              , values_from = LOS
              , values_fill = list(LOS = 0)
              , values_fn = list(LOS= function(x){x+ rpois(n = 1, lambda = 3)})
              )
## # A tibble: 300 × 11
##       ID Trust1 Trust2 Trust3 Trust4 Trust5 Trust6 Trust7 Trust8 Trust9 Trust10
##    <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>   <int>
##  1     1      4      0      0      0      0      0      0      0      0       0
##  2     2      0      5      0      0      0      0      0      0      0       0
##  3     3      0      0     14      0      0      0      0      0      0       0
##  4     4      0      0      0      8      0      0      0      0      0       0
##  5     5      0      0      0      0     17      0      0      0      0       0
##  6     6      0      0      0      0      0      7      0      0      0       0
##  7     7      0      0      0      0      0      0      7      0      0       0
##  8     8      0      0      0      0      0      0      0      9      0       0
##  9     9      0      0      0      0      0      0      0      0     10       0
## 10    10      0      0      0      0      0      0      0      0      0       4
## # … with 290 more rows

Bonus round:

I remember hearing Tom Jemmett @tomjemmett talk about the magic of purrr. Now I’m a bit slow off the mark with purrr because I couldn’t see why it was different to apply functions. Now I’ve hear more about it, I can see there’s more consistency and it allows some nice applications of functions. The thing that was really nice though was anonymous functions! They look so much cleaner, and are defined as a one-sided formula. I’ve replaced the aggregate function with an anonymous one in the example below:

library(purrr)

set.seed(123) # For reproducibility

LOS_model %>% 
  pivot_wider(id_cols = ID
              , names_from = Organisation
              , values_from = LOS
              , values_fill = list(LOS = 0)
              , values_fn = list(LOS= ~ . + rpois(n = 1, lambda = 3))
  )
## # A tibble: 300 × 11
##       ID Trust1 Trust2 Trust3 Trust4 Trust5 Trust6 Trust7 Trust8 Trust9 Trust10
##    <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>   <int>
##  1     1      4      0      0      0      0      0      0      0      0       0
##  2     2      0      5      0      0      0      0      0      0      0       0
##  3     3      0      0     14      0      0      0      0      0      0       0
##  4     4      0      0      0      8      0      0      0      0      0       0
##  5     5      0      0      0      0     17      0      0      0      0       0
##  6     6      0      0      0      0      0      7      0      0      0       0
##  7     7      0      0      0      0      0      0      7      0      0       0
##  8     8      0      0      0      0      0      0      0      9      0       0
##  9     9      0      0      0      0      0      0      0      0     10       0
## 10    10      0      0      0      0      0      0      0      0      0       4
## # … with 290 more rows

Maybe that’s of use to you. Hope so!

Chris Mainey
Chris Mainey
Deputy Director of Specialist Analytics

NHS Data Scientist and analytical leader with interests in statistical modelling and machine learning in healthcare data.

Related