class: center, middle, inverse, title-slide # Fitting interactive web graphics into a data science workflow ### Carson Sievert ### Slides:
https://talks.cpsievert.me
@cpsievert
@cpsievert
cpsievert1@gmail.com
https://cpsievert.me/
Slides released under
Creative Commons
--- class: principles ## About me * PhD in statistics at Iowa State with Dr. Heike Hofmann (Dec 2016) * Thesis: [Interfacing R with Web Technologies for Interactive Statistical Graphics and Computing with Data](http://lib.dr.iastate.edu/etd/15422/) * CEO of [Sievert Consulting](https://consulting.cpsievert.me/) LLC (Jan 2017) * Clients include: plotly, NOAA, O'Reilly, Sandia * Looking for more part-time projects * The majority of my work is in interactive data visualization * Primarily work in R, dabble in JavaScript and python. * Maintain numerous R packages including: plotly, LDAvis, pitchRx --- background-image: url(../20171207/workflow.svg) background-size: contain class: inverse # Data Science Workflow --- background-image: url(../20171207/workflow1.svg) background-size: contain class: inverse # Expository vis <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> ### The web is the preferred medium for communicating results ### Assuming you know _exactly_ what you want to visualize, many good data vis frameworks exist (e.g. [d3.js](https://d3js.org/)) --- background-image: url(../20171207/workflow2.svg) background-size: contain class: inverse # Exploratory vis <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> ### Web technologies weren't designed for data exploration ### People [seem to have forgotten](http://flowingdata.com/2016/03/08/what-i-use-to-visualize-data/) how interative graphics can augment exploration --- class: principles ## Interactive graphics augment exploration! * Identify structure that otherwise goes missing ([Tukey 1972](http://stat-graphics.org/movies/prim9.html)). * "Search for information quickly without fully specified questions" ([Unwin & Hofmann, 2000](https://www.researchgate.net/publication/2425912_GUI_and_Command-line_-_Conflict_or_Synergy)) * "[Multiple linked views] are the optimal framework for posing queries about data" ([Buja et al 1996](https://www.jstor.org/stable/1390754)) * What about inference? See visual ([Majumder et al 2013](http://amstat.tandfonline.com/doi/abs/10.1080/01621459.2013.808157?journalCode=uasa20#.Wl01_ZM-dTY)) and post-selection ([Berk et al 2013](https://projecteuclid.org/euclid.aos/1369836961)) inference. * Better understand/diagnose models ([Wickham, Cook, & Hofmann 2015](http://onlinelibrary.wiley.com/doi/10.1002/sam.11271/abstract)). .footnote[ --- Statisticians were building (very advanced!) int graphics systems decades ago -- <http://stat-graphics.org/movies/> ] --- class: principles ## Interactive graphics augment exploration! * Identify structure that otherwise goes missing ([Tukey 1972](http://stat-graphics.org/movies/prim9.html)). * "Search for information quickly without fully specified questions" ([Unwin & Hofmann, 2000](https://www.researchgate.net/publication/2425912_GUI_and_Command-line_-_Conflict_or_Synergy)) * "[Multiple linked views] are the optimal framework for posing queries about data" ([Buja et al 1996](https://www.jstor.org/stable/1390754)) * What about inference? See visual ([Majumder et al 2013](http://amstat.tandfonline.com/doi/abs/10.1080/01621459.2013.808157?journalCode=uasa20#.Wl01_ZM-dTY)) and post-selection ([Berk et al 2013](https://projecteuclid.org/euclid.aos/1369836961)) inference. * <div class="highlight">Better understand/diagnose models</div> ([Wickham, Cook, & Hofmann 2015](http://onlinelibrary.wiley.com/doi/10.1002/sam.11271/abstract)). --- ## Deeper look at linear model diagnostics ```r # choose a model by AIC stepping backwards mod <- step(lm(mpg ~ ., data = mtcars), trace = FALSE) formula(mod) ``` ```r mpg ~ wt + qsec + am ``` ```r # produce (static ggplot2) diagnostic plot library(GGally) pm <- ggnostic(mod, mapping = aes(color = am)) # make it interactive library(plotly) ggplotly(pm) ``` --- <iframe src="ggnostic.html" width="100%" height="650" scrolling="no" seamless="seamless" frameBorder="0"> </iframe> ??? The noticeably high value of cooksd suggests a large influence on the fitted model. Highlighting this point makes it obvious that it is influential since it has a unusually high response/residual in a fairly sparse region of the design space (i.e., it has a pretty high value of wt) and removing it would significantly reduce the estimated standard deviation (sigma). By comparison, the other two observations with similar values of wt have a response value very close to the overall mean, so even though their value of hat is high, their value of sigma is low. --- ### See relationships evolve over time ([made via ggplot2](https://plotly-book.cpsievert.me/linking-animated-views.html)) <a href="gapminder.html" target="_blank"> <img src="https://plotly-book.cpsievert.me/images/gapminder-highlight-animation.gif" width = "100%" /> </a> --- background-image: url(../20170803/tour.gif) background-size: contain class: middle ### Interactively [plot models in data space](tour.html) ([code](https://github.com/ropensci/plotly/blob/master/demo/animation-tour-USArrests.R)) --- class: middle, inverse # Designing, building, maintaining, and publishing a *general purpose* vis library (e.g. plotly) is very hard! ### Grad students: make your life easier...design a *domain specific* tool --- background-image: url(LDAvis.svg) background-size: contain ## LDAvis: interactive vis tools for interpreting topic models .footnote[ ### <https://github.com/cpsievert/LDAvis> ] --- class: principles ## Interactive graphics augment exploration! * Identify structure that otherwise goes missing ([Tukey 1972](http://stat-graphics.org/movies/prim9.html)). * Search for information quickly without fully specified questions ([Unwin & Hofmann, 2000](https://www.researchgate.net/publication/2425912_GUI_and_Command-line_-_Conflict_or_Synergy)) * <div class="highlight">[Multiple linked views] are the optimal framework for posing queries about data</div> ([Buja et al 1996](https://www.jstor.org/stable/1390754)) * What about inference? See visual ([Majumder et al 2013](http://amstat.tandfonline.com/doi/abs/10.1080/01621459.2013.808157?journalCode=uasa20#.Wl01_ZM-dTY)) and post-selection ([Berk et al 2013](https://projecteuclid.org/euclid.aos/1369836961)) inference. * Better understand/diagnose models ([Wickham, Cook, & Hofmann 2015](http://onlinelibrary.wiley.com/doi/10.1002/sam.11271/abstract)). --- background-image: url(../20171207/epl.png) background-size: contain class: inverse --- background-image: url(../20171207/epl.png) background-size: contain class: middle .pull-left[ # Neat, but I can't: <div class="principles"> • Query different years <br /> • Zoom/pan </div> ] --- background-image: url(../20171207/epl.gif) background-size: contain class: principles <div align="right"> <a href="epl.html" target="_blank">Live demo </a> (<a href="https://github.com/ropensci/plotly/blob/master/demo/crosstalk-highlight-epl.R">code</a>) </div> --- background-image: url(../20171207/epl.gif) background-size: contain # Nice for comparing across seasons within teams .footnote[ ### What about comparing across teams with seasons? ] --- background-image: url(https://i.imgur.com/Id2nVFG.gif) background-size: contain <div align="right"> <a href="epl2.html" target="_blank">Live demo </a> (<a href="https://github.com/ropensci/plotly/blob/master/demo/crosstalk-highlight-epl2.R">code</a>) </div> --- background-image: url(https://i.imgur.com/Id2nVFG.gif) background-size: contain # Generally useful for comparing within/across panels! --- ## An example with Texas housing prices ```r library(dplyr) tx <- txhousing %>% select(city, year, month, median) %>% filter(city %in% c("Galveston", "Midland", "Odessa", "South Padre Island")) tx ``` ```r #> # A tibble: 748 x 4 #> city year month median #> <chr> <int> <int> <dbl> #> 1 Galveston 2000 1 95000 #> 2 Galveston 2000 2 100000 #> 3 Galveston 2000 3 98300 #> 4 Galveston 2000 4 111100 #> 5 Galveston 2000 5 89200 #> 6 Galveston 2000 6 108600 #> 7 Galveston 2000 7 99000 #> 8 Galveston 2000 8 96200 #> 9 Galveston 2000 9 104000 #> 10 Galveston 2000 10 118800 #> # ... with 738 more rows ``` --- ### Compare within and across cities ```r *TX <- crosstalk::SharedData$new(tx, ~year) p <- ggplot(TX, aes(month, median, group = year)) + geom_line() + facet_wrap(~city, ncol = 2) highlight(ggplotly(p), dynamic = T, persistent = T, selectize = T) ``` <iframe src="../20171207/txhousing.html" width="100%" height="450" scrolling="no" seamless="seamless" frameBorder="0"> </iframe> --- class: inverse, middle, center # Note: 'crosstalk highlighting' works _without shiny_ (more later). Here are some examples _with shiny_ --- background-image: url(https://i.imgur.com/T7GSpv9.gif) background-size: contain class: principles <https://github.com/cpsievert/zikar> --- background-image: url(https://i.imgur.com/MGbR5AI.gif) background-size: contain class: principles <div align="right"> <a href='https://github.com/cpsievert/bcviz'>https://github.com/cpsievert/bcviz</a> </div> --- background-image: url(https://i.imgur.com/csIUJX0.gif) background-size: contain class: principles <https://github.com/cpsievert/apps> --- .pull-left[ <img src="../gifs/drowning.gif" width="80%"/> ] .pull-right[ # Lost in a sea of interactive techniques? ### Focus on the analysis task/question first ### Often many different ways to do the same thing! ] --- background-image: url(../20171207/taxonomy.svg) background-size: contain class: bottom, inverse ### _Interactive High-Dimensional Data Visualization_, [Buja et al 1996](https://www.jstor.org/stable/1390754); authors of [GGobi](http://ggobi.org) --- class: center, middle # Should an interactive graphics system be purely graphical (from a user perspective)? .footnote[ ### Over a system where users have to write code? ] --- class: inverse, center, middle ### From Hofmann & Unwin, [*GUI and Command-line - Conflict or Synergy?*](https://www.researchgate.net/publication/2425912_GUI_and_Command-line_-_Conflict_or_Synergy) # "Statistical software should be expressive, powerful, fast and reliable, but it should also be intuitive, easy to use, flexible, forgiving and consistent" --- class: inverse, center, middle # Statistical software should be <font color='red'>expressive, powerful, fast and reliable</font>, but it should also be <font color='#119dff'>intuitive, easy to use, flexible, forgiving and consistent</font> .footnote[ ### <font color='red'>Programming languages are good at this!</font> ### <font color='#119dff'>Graphical interfaces are good at this!</font> ] --- background-image: url(https://i.imgur.com/c7NJRa2.gif) background-size: contain class: inverse # The ideal 'power' system has best of both worlds --- background-image: url(../gifs/swamp.gif) background-size: contain class: inverse, center, bottom ## It is all too easy for statistical thinking to be **swamped by programming tasks.** -- Brian D. Ripley --- ## What's in the swamp? .pull-left[ ## Exploratory mud <div class="principles"> • Start-up <br/> • Iteration <br/> • Dead-end </div> ] .pull-right[ ## Expository mud <div class="principles"> • Deployment <br/> • Latency </div> ] --- class: inverse ## *How* to drain the swamp? ☝ 🍊 .pull-left[ ## Exploratory mud <div class="principles"> • Start-up <br/> • Iteration <br/> • Dead-end </div> ] .pull-right[ ## Expository mud <div class="principles"> • Deployment <br/> • Latency </div> ] .footnote[ ### What follows are some details & my suggestions about how one might go about it ] --- ## How to drain the swamp? <div class="highlight"> R 📦s that avoid </div> .pull-left[ ## Exploratory <div class="principles"> <div class="highlight"> • Start-up mud</div> <br/> • Iteration <br/> • Dead-end </div> ] .pull-right[ ## Expository <div class="principles"> • Deployment <br/> • Latency </div> ] <br/> <div class="highlight"> * Getting started should be easy! * Easy install (CRAN) and good documentation (e.g., [pkgdown](http://pkgdown.r-lib.org/) site) * Build upon existing work! <div> --- background-image: url(../20171207/europe.png) background-size: contain class: inverse --- ```r library(tidyverse) *# read and clean data d <- read_csv('GEOSTAT_grid_POP_1K_2011_V2_0_1.csv') %>% rbind(read_csv('JRC-GHSL_AIT-grid-POP_1K_2011.csv') %>% mutate(TOT_P_CON_DT = '')) %>% mutate( lat = as.numeric(gsub('.*N([0-9]+)[EW].*', '\\1', GRD_ID))/100, lng = as.numeric(gsub('.*[EW]([0-9]+)', '\\1', GRD_ID)) * ifelse(gsub('.*([EW]).*', '\\1', GRD_ID) == 'W', -1, 1) / 100 ) %>% filter(lng > 25, lng < 60) %>% group_by(lat = round(lat, 1), lng = round(lng, 1)) %>% summarize(value = sum(TOT_P, na.rm = T)) %>% ungroup() %>% tidyr::complete(lat, lng) *# visualize ggplot(d, aes(lng, lat + 5*(value / max(value, na.rm = T)))) + geom_line( aes(group = lat, text = paste("Population:", value)), size = 0.4, alpha = 0.8, color = '#5A3E37', na.rm = T ) ``` --- ```r library(tidyverse) *library(crosstalk) *library(plotly) d <- read_csv('GEOSTAT_grid_POP_1K_2011_V2_0_1.csv') %>% rbind(read_csv('JRC-GHSL_AIT-grid-POP_1K_2011.csv') %>% mutate(TOT_P_CON_DT = '')) %>% mutate( lat = as.numeric(gsub('.*N([0-9]+)[EW].*', '\\1', GRD_ID))/100, lng = as.numeric(gsub('.*[EW]([0-9]+)', '\\1', GRD_ID)) * ifelse(gsub('.*([EW]).*', '\\1', GRD_ID) == 'W', -1, 1) / 100 ) %>% filter(lng > 25, lng < 60) %>% group_by(lat = round(lat, 1), lng = round(lng, 1)) %>% summarize(value = sum(TOT_P, na.rm = T)) %>% ungroup() %>% tidyr::complete(lat, lng) *# define latitude as "interaction unit" *sd <- SharedData$new(d, ~lat) ggplot(sd, aes(lng, lat + 5*(value / max(value, na.rm = T)))) + geom_line( aes(group = lat, text = paste("Population:", value)), size = 0.4, alpha = 0.8, color = '#5A3E37', na.rm = T ) *ggplotly() ``` --- <iframe src="../20170803/europe.html" width="100%" height="750" scrolling="no" seamless="seamless" frameBorder="0"> </iframe> --- ## How to drain the swamp? <div class="highlight"> R 📦s that avoid </div> .pull-left[ ## Exploratory <div class="principles"> • Start-up <br/> • <div class="highlight">Iteration mud</div> <br/> • Dead-end </div> ] .pull-right[ ## Expository <div class="principles"> • Deployment <br/> • Latency </div> ] <br/> <div class="highlight"> * Reduce distance between data and visuals * Leverage [grammar of graphics](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448) (e.g. **ggplot2**, **vega**, etc) * Leverage tidy data principles (e.g., **dplyr**, **tidyr**, **broom**, etc) <div> --- ## Some motivation: query missing values by city <iframe src="../20171207/txhousing-missing.html" width="100%" height="500" scrolling="no" seamless="seamless" frameBorder="0"> </iframe> --- class: bottom, left background-image: url(../20171207/pipeline.svg) background-size: contain ## The 'data pipeline' --- ## The implementation ```r library(plotly) library(crosstalk) # define city as 'unit of interaction' sd <- SharedData$new(txhousing, ~city) # initiate plot object, ensuring one mark per group base <- plot_ly(sd, color = I("black")) %>% group_by(city) # transform plot object as if it's a data frame p1 <- base %>% summarise(miss = sum(is.na(median))) %>% filter(miss > 0) %>% arrange(miss) %>% add_bars(x = ~miss, y = ~factor(city, levels = city)) %>% layout(barmode = "overlay") p2 <- add_lines(base, x = ~date, y = ~median, alpha = 0.3) subplot(p1, p2, titleX = TRUE, widths = c(0.3, 0.7)) %>% highlight(dynamic = TRUE, persistent = TRUE, selectize = TRUE) ``` --- ## How to drain the swamp? <div class="highlight"> R 📦s that avoid </div> .pull-left[ ## Exploratory <div class="principles"> • Start-up <br/> • Iteration<br/> • Dead-end </div> ] .pull-right[ ## Expository <div class="principles"> <div class="highlight"> • Deployment mud </div> <br/> • Latency </div> ] <br/> <div class="highlight"> Produce a standalone web page whenever possible <div> --- background-image: url(../20171207/server-client.svg) background-size: contain --- background-image: url(../20171207/server-client.svg) background-size: contain class: middle, center # Standalone web pages are **much** easier to share, deploy, scale, and maintain. --- class: middle, bottom background-image: url(../20171207/pipeline.svg) background-size: contain # Where is the pipeline? --- background-image: url(../20171207/crosstalk.svg) background-size: contain ## The general model .footnote[ ### Links are specified in R, but the "updating logic" is JavaScript -- no server required! ] --- ## How to drain the swamp? <div class="highlight"> R 📦s that avoid </div> .pull-left[ ## Exploratory <div class="principles"> • Start-up <br/> • Iteration <br/> • Dead-end </div> ] .pull-right[ ## Expository <div class="principles"> • Deployment <br/> <div class="highlight">• Latency mud</div> </div> ] <br/> <div class="highlight"> * User activity drops w/ delay of 500ms (<a href="https://idl.cs.washington.edu/papers/latency/">Liu & Heer 2014</a>) * Support canvas-based (e.g. WebGL) as well as vector (e.g. SVG) * Provide API for modifying plots in R (& JavaScript!) <div> --- ## Base map layer does not require a redraw with a new fillcolor! <img src="http://i.imgur.com/PzDMqZ0.gif" /> --- ### Leverage any [plotly.js function](https://plot.ly/javascript/plotlyjs-function-reference/) to *modify* a plot via `plotlyProxy()` ```r library(shiny) library(plotly) ui <- fluidPage( selectInput("color", "Canada's fillcolor", colors()), plotlyOutput("map") ) server <- function(input, output, session) { output$map <- renderPlotly({ map_data("world", "canada") %>% group_by(group) %>% plot_mapbox(x = ~long, y = ~lat, color = I("black")) %>% add_polygons() }) observeEvent(input$color, { plotlyProxy("map", session) %>% plotlyProxyInvoke("restyle", list(fillcolor = toRGB(input$color))) }) } shinyApp(ui, server) ``` --- ## How to drain the swamp? <div class="highlight"> R 📦s that avoid </div> .pull-left[ ## Exploratory <div class="principles"> • Start-up <br/> • Iteration <br/> <div class="highlight">• Dead-end mud</div> </div> ] .pull-right[ ## Expository <div class="principles"> • Deployment <br/> • Latency </div> ] <br/> <div class="highlight"> * Integrate with other libraries that complement yours <div> --- background-image: url(../20170803/plotlyLeaflet.gif) background-size: contain ## plotly & leaflet --- background-image: url(../20171207/workflow2.svg) background-size: 250px background-position: 90% 8% class: inverse, center, middle ### No matter how complex and polished the individual operations are, it is often # the quality of the glue that most directly determines the power of the system. .footnote[ Quote from Hal Abelson -- part of the [tidyverse manifesto](https://cran.r-project.org/package=tidyverse) ] --- ```r library(plotly) library(leaflet) library(crosstalk) *# good glue requires uniform data structures *sd <- SharedData$new(quakes) p <- plot_ly(sd, x = ~depth, y = ~mag) %>% add_markers(alpha = 0.5) %>% highlight("plotly_selected", dynamic = TRUE) map <- leaflet(sd) %>% addTiles() %>% addCircles() bscols(widths = c(6, 6), p, map) ``` --- ## How to drain the swamp? <div class="highlight"> R 📦s that avoid </div> .pull-left[ ## Exploratory <div class="principles"> • Start-up <br/> • Iteration <br/> <div class="highlight">• Dead-end mud</div> </div> ] .pull-right[ ## Expository <div class="principles"> • Deployment <br/> • Latency </div> ] <br/> <div class="highlight"> * Every 📦 that reduces start-up/iteration will *always* have dead ends. * What you leave out is just as important as what you put in! * Ask yourself: can others do powerful things _quickly_? * Strive for 80/20 rule: 80% easy, other 20% possible <div> --- #### Provide tools for "power users" ``` plot_ly(mtcars, x = ~wt, y = ~mpg) %>% add_markers(text = ~paste0("http://google.com/#q=", rownames(mtcars))) %>% htmlwidgets::onRender("function(el, x) { el.on('plotly_click', function(d) { var pt = d.points[0]; var url = pt.data.text[pt.pointNumber]; window.open(url); }); }") ``` <iframe src="../20171207/onrender.html" width="100%" height="350" scrolling="no" seamless="seamless" frameBorder="0"> </iframe> --- ## How to drain the swamp? <div class="highlight"> R 📦s that avoid </div> .pull-left[ ## Exploratory <div class="principles"> • Start-up <br/> • Iteration <br/> <div class="highlight">• Dead-end mud</div> </div> ] .pull-right[ ## Expository <div class="principles"> • Deployment <br/> • Latency </div> ] <br/> <div class="highlight"> * Interactive graphics software must be opinionated * Not enough statisticians influence design nowadays * Tech landscape shifted from C/Java to JavaScript * We need more focus on techniques for data analysis tasks! <div> --- background-image: url(../20161212a/plotly.svg) background-size: contain ## The plotly ecosystem --- background-image: url(../20161212a/plotly.svg) background-size: contain ## TODO: highlight the interface --- ## Summary #### Interactive graphics *can* augment exploratory analysis, but are only *practically* useful when we can iterate quickly .pull-left[ ## Exploratory mud * Start-up * Iteration * Reduce distance between data and visuals * Leverage grammar of graphics and tidy-data principles * Dead-end * Integrate with other libraries that complement yours * Strive for 80/20 rule: 80% easy, other 20% possible * Interactive techniques should empower data analysis tasks ] .pull-right[ ## Expository mud * Deployment * Use standalone web pages whenever possible * Latency * User activity drops w/ delay of 500ms ] --- class: middle, center, principles ## Thanks! Questions? Slides: <https://talks.cpsievert.me> Learn more: <https://plotly-book.cpsievert.me> #### Contact
<a href='https://twitter.com/cpsievert'>@cpsievert</a> <br />
<a href='https://github.com/cpsievert'>@cpsievert</a> <br />
<cpsievert1@gmail.com> <br />
<https://cpsievert.me/>