Long ago at a real-life meetup (remember those?), I received a t-shirt which said: “biggeR than R”. I think it was by microsoft, who develop a special version of R with automatic parallel work. Anyways, I was thinking about bigness (is that a word? it is now!) of your data. Is your data becoming to big?

big data stupid gif

Your dataset becomes so big and unwieldy that operations take a long time. How long is too long? That depends on you, I get annoyed if I don’ t get feedback within 20 seconds (and I love it when a program shows me a progress bar at that point, at least I know how long it will take!), your boundary may lay at some other point. When you reach that point of annoyance or point of no longer being able to do your work. You should improve your workflow.

I will show you how to do some speedups by using other R packages, in python moving from pandas to polars, or leveraging databases. I see some hesitancy about moving to a database for analytical work, and that is too bad. Bad for two reasons, one: it is super simple, two it will save you a lot of time.

using dplyr (example for the rest of the examples)

Imagine we have a dataset of sales. (see github page for the dataset generation details). I imagine that analysts have to do some manipulation to figure out sales and plot the details for in rapports (If my approach looks stupid; I never do this kind of work). The end-result of this computation is a monthly table. See this github page with easy to copy code examples for all my steps.

source("datasetgeneration.R")
suppressPackageStartupMessages(
library(dplyr)  
)

Load in the dataset.

# this is where you would read in data, but I generate it.
sales <- 
  as_tibble(create_dataset(rows = 1E6))
sales %>% arrange(year, month, SKU) %>% head()

## # A tibble: 6 × 5
##   month  year sales_units SKU     item_price_eur
##   <chr> <int>       <dbl> <chr>            <dbl>
## 1 Apr    2001           1 1003456          49.9 
## 2 Apr    2001           1 1003456          43.6 
## 3 Apr    2001           1 1003456           9.04
## 4 Apr    2001           1 1003456          37.5 
## 5 Apr    2001           1 1003456          22.1 
## 6 Apr    2001           1 1003456          28.0

This is a dataset with 1.000.000 rows of sales where every row is a single sale sales in this case can be 1, 2 or -1 (return).You’d like to see monthly and yearly aggregates of sales per Stock Keeping Unit (SKU).

# create monthly aggregates
montly_sales <- 
  sales %>% 
  group_by(month, year, SKU) %>% 
  mutate(pos_sales = case_when(
    sales_units > 0 ~ sales_units,
    TRUE ~ 0
  )) %>% 
  summarise(
    total_revenue = sum(sales_units * item_price_eur),
    max_order_price = max(pos_sales * item_price_eur),
    avg_price_SKU = mean(item_price_eur),
    items_sold = n()
    )

## `summarise()` has grouped output by 'month', 'year'. You can override using the
## `.groups` argument.

# create yearly aggregates
yearly_sales <-
  sales %>% 
  group_by(year, SKU) %>% 
  mutate(pos_sales = case_when(
    sales_units > 0 ~ sales_units,
    TRUE ~ 0
  )) %>% 
  summarise(
    total_revenue = sum(sales_units * item_price_eur),
    max_order_price = max(pos_sales * item_price_eur),
    avg_price_SKU = mean(item_price_eur),
    items_sold = n()
  )

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

head(montly_sales)

## # A tibble: 6 × 7
## # Groups:   month, year [1]
##   month  year SKU     total_revenue max_order_price avg_price_SKU items_sold
##   <chr> <int> <chr>           <dbl>           <dbl>         <dbl>      <int>
## 1 Apr    2001 1003456       291261.           100.           27.6      10083
## 2 Apr    2001 109234         59375.            99.8          27.5       2053
## 3 Apr    2001 112348         87847             99.8          27.7       3053
## 4 Apr    2001 112354         30644.            99.5          27.4       1081
## 5 Apr    2001 123145         29485.            99.7          27.4        993
## 6 Apr    2001 123154         28366.            99.9          27.4       1005

The analist reports this data to the CEO in the form of an inappropriate bar graph (where a linegraph would be best, but you lost all of your bargaining power when you veto’d pie-charts lost week). This is a plot of just 2 of the products.

library(ggplot2)
ggplot(yearly_sales %>% 
         filter(SKU %in% c("112348", "109234")), 
       aes(year, total_revenue, fill = SKU))+
  geom_col(alpha = 2/3)+
  geom_line()+
  geom_point()+
  facet_wrap(~SKU)+
  labs(
    title = "Yearly revenue for two products",
    subtitle= "Clearly no one should give me an analist job",
    caption = "bars are inapropriate for this data, but sometimes it is just easier to give in ;)",
    y = "yearly revenue"
  )

Computation of this dataset took some time with 1E8 rows (See github page.) so I simplified it for this blogpost.

Improving speeeeeeeeeed!

Let’s use specialized libraries, for R, use data.table, for Python move from pandas to polars

Using data.table

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

source("datasetgeneration.R")
salesdt <- 
  as.data.table(create_dataset(10E5))

Pure datatable syntax for the total_revenue step (I think there are better ways to do this)

salesdt[, .(total_revenue = sum(sales_units * 
    item_price_eur)),keyby = .(year, 
    SKU)]

##     year     SKU total_revenue
##  1: 2001 1003456       3506290
##  2: 2001  109234        703168
##  3: 2001  112348       1058960
##  4: 2001  112354        352691
##  5: 2001  123145        342890
##  6: 2001  123154        350893
##  7: 2001  123194        174627
##  8: 2001  153246        350923
##  9: 2001 1923456        349300
## 10: 2002 1003456       3529040
## 11: 2002  109234        701677
## 12: 2002  112348       1047698
## 13: 2002  112354        354164
## 14: 2002  123145        348351
## 15: 2002  123154        355113
## 16: 2002  123194        177576
## 17: 2002  153246        355111
## 18: 2002 1923456        348666
## 19: 2003 1003456       3520253
## 20: 2003  109234        704738
## 21: 2003  112348       1043208
## 22: 2003  112354        355979
## 23: 2003  123145        350588
## 24: 2003  123154        350832
## 25: 2003  123194        178416
## 26: 2003  153246        356952
## 27: 2003 1923456        346832
## 28: 2004 1003456       3530158
## 29: 2004  109234        701551
## 30: 2004  112348       1053773
## 31: 2004  112354        353585
## 32: 2004  123145        362977
## 33: 2004  123154        355703
## 34: 2004  123194        175472
## 35: 2004  153246        352139
## 36: 2004 1923456        354396
##     year     SKU total_revenue

Using dplyr on top of data.table :

library(dtplyr)
salesdt <- 
  as.data.table(create_dataset(10E5))
salesdt %>% 
  group_by(year, SKU) %>% 
  mutate(pos_sales = case_when(
    sales_units > 0 ~ sales_units,
    TRUE ~ 0
  )) %>% 
  summarise(
    total_revenue = sum(sales_units * item_price_eur),
    max_order_price = max(pos_sales * item_price_eur),
    avg_price_SKU = mean(item_price_eur),
    items_sold = n()
  )

## `summarise()` has grouped output by 'year'. You can override using the `.groups`
## argument.

## Source: local data table [36 x 6]
## Groups: year
## Call:   copy(`_DT1`)[, `:=`(pos_sales = fcase(sales_units > 0, sales_units, 
##     rep(TRUE, .N), 0)), by = .(year, SKU)][, .(total_revenue = sum(sales_units * 
##     item_price_eur), max_order_price = max(pos_sales * item_price_eur), 
##     avg_price_SKU = mean(item_price_eur), items_sold = .N), keyby = .(year, 
##     SKU)]
## 
##    year SKU     total_revenue max_order_price avg_price_SKU items_sold
##   <int> <chr>           <dbl>           <dbl>         <dbl>      <int>
## 1  2001 1003456      3506290.           100            27.5     121462
## 2  2001 109234        703168.           100            27.5      24402
## 3  2001 112348       1058960.           100.           27.5      36759
## 4  2001 112354        352691.           100.           27.4      12323
## 5  2001 123145        342890.            99.8          27.4      11903
## 6  2001 123154        350893.           100.           27.5      12228
## # … with 30 more rows
## # ℹ Use `print(n = ...)` to see more rows
## 
## # Use as.data.table()/as.data.frame()/as_tibble() to access results

What if I use python locally

The pandas library has a lot of functionality but can be a bit slow at large data sizes.

# write csv so pandas and polars can read it in again.
# arrow is another way to transfer data.
readr::write_csv(sales, "sales.csv")

old version, see below

import pandas as pd
df = pd.read_csv("sales.csv")
df["pos_sales"] = 0
df['pos_sales'][df["sales_units"] >0] = df["sales_units"][df["sales_units"] >0]
sales["euros"] = sales["sales_units"] * sales["item_price_eur"]
sales.groupby(["month", "year", "SKU"]).agg({
"item_price_eur":["mean"],
"euros":["sum", "max"]
}).reset_index()

    month  year      SKU item_price_eur        euros
                                   mean          sum     max
0     Apr  2001   109234      27.538506   5876923.23  100.00
1     Apr  2001   112348      27.506314   8774064.08  100.00
2     Apr  2001   112354      27.436687   2945084.13  100.00
3     Apr  2001   123145      27.594551   2943957.39   99.98
4     Apr  2001   123154      27.555665   2931884.68  100.00
..    ...   ...      ...            ...          ...     ...
427   Sep  2004   123154      27.508490   2932012.98  100.00
428   Sep  2004   123194      27.515314   1467008.19   99.98
429   Sep  2004   153246      27.491941   2949899.86  100.00
430   Sep  2004  1003456      27.530511  29326323.18  100.00
431   Sep  2004  1923456      27.483273   2927890.77  100.00

[432 rows x 6 columns]

Newer version, suggested by Andrea Dalsano is quite faster

import pandas as pd
sales = pd.read_csv("sales.csv")
(sales.assign(
    pos_sales= lambda x: x['sales_units'].where(x['sales_units'] > 0, 0),
    euros = lambda x : x["sales_units"] * x["item_price_eur"],
    euros_sku = lambda x : x["pos_sales"] * x["item_price_eur"] )
    .groupby(["month", "year", "SKU"], as_index=False)
    .agg({
        "item_price_eur":[("avg_price_SKU","mean")],
        "euros":[("total_revenue","sum")],
        "euros_sku":[("max_price_SKU","max"), ("items_sold","count")]
     }))

There is a python version of data.table (it is all C or C++? so it is quite portable). There is also a new pandas replacement that is called polars and is superfast!

import polars as pl
sales = pl.read_csv("sales.csv")
# 44 sec read time.
sales["euros"] = sales["sales_units"] * sales["item_price_eur"]
sales.groupby(["month", "year", "SKU"]).agg({
"item_price_eur":["mean"],
"euros":["sum", "max"]
})

shape: (432, 6)
┌───────┬──────┬─────────┬─────────────────────┬──────────────────────┬───────────┐
│ month ┆ year ┆ SKU     ┆ item_price_eur_mean ┆ euros_sum            ┆ euros_max │
│ ---   ┆ ---  ┆ ---     ┆ ---                 ┆ ---                  ┆ ---       │
│ str   ┆ i64  ┆ i64     ┆ f64                 ┆ f64                  ┆ f64       │
╞═══════╪══════╪═════════╪═════════════════════╪══════════════════════╪═══════════╡
│ Mar   ┆ 2002 ┆ 123154  ┆ 27.483172388110916  ┆ 2.946295520000007e6  ┆ 100       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Jun   ┆ 2004 ┆ 1923456 ┆ 27.491890680384582  ┆ 2.9289146600000123e6 ┆ 100       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Feb   ┆ 2003 ┆ 1003456 ┆ 27.50122395426729   ┆ 2.9425509809999317e7 ┆ 100       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Jul   ┆ 2003 ┆ 1923456 ┆ 27.515498919450454  ┆ 2.9408777300000447e6 ┆ 100       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ...   ┆ ...  ┆ ...     ┆ ...                 ┆ ...                  ┆ ...       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Sep   ┆ 2003 ┆ 109234  ┆ 27.47832064931681   ┆ 5.875787689999974e6  ┆ 100       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Oct   ┆ 2004 ┆ 123145  ┆ 27.51980323559326   ┆ 2.9235666999999783e6 ┆ 100       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Mar   ┆ 2004 ┆ 123145  ┆ 27.532764418358507  ┆ 2.9523948500000383e6 ┆ 100       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ May   ┆ 2003 ┆ 1003456 ┆ 27.496404438507874  ┆ 2.9371373149999738e7 ┆ 100       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ May   ┆ 2004 ┆ 109234  ┆ 27.47367882104357   ┆ 5.862501800000172e6  ┆ 100       │
└───────┴──────┴─────────┴─────────────────────┴──────────────────────┴───────────┘

Combine python and R

Alternatively you could do a part of your data work in R and parts in python and share the data using the Apache Arrow fileformat. You can write the results in Arrow in R, and read them in through python. Alternatively you can use parquet files, those are very optimized too. See this post by Rich Pauloo for amazing fast examples!

Using a local database

There comes a point where your data becomes too big, or you have to make use of several datasets that, together, are bigger in size than your memory.

We can make use of all the brainwork that went into database design since the 1980s. A lot of people spent a lot of time on making sure these things work.

Easiest to start with is SQLite a supersimple database that can run in memory or through disk and needs nothing from R except the package {RSQLite}. In fact SQLite is used so much in computing you have probably dozens of sqlite databases on your computer or smartphone.

# Example with SQLite
# Write data set to sqlite
source("datasetgeneration.R")
con <- DBI::dbConnect(RSQLite::SQLite(), "sales.db")
DBI::dbWriteTable(con, name = "sales",value = sales)

# write sql yourself 
# it is a bit slow. 
head(DBI::dbGetQuery(con, "SELECT SKU, year, sales_units * item_price_eur AS total_revenue FROM sales GROUP BY year, SKU"))

The R community has made sure that almost every database can talk to the Database Interface Package (DBI). Other packages can talk to DBI and that combination allows R to do things you cannot do in python; you can use the same code to run a query against a dataframe (or data.table) in memory, or in the database!

library(dplyr)
library(dbplyr)
sales_tbl <- tbl(con, "sales") # link to table in database on disk
sales_tbl %>% # Now dplyr talks to the database.
  group_by(year, SKU) %>% 
  mutate(pos_sales = case_when(
    sales_units > 0 ~ sales_units,
    TRUE ~ 0
  )) %>% 
  summarise(
    total_revenue = sum(sales_units * item_price_eur),
    max_order_price = max(pos_sales * item_price_eur),
    avg_price_SKU = mean(item_price_eur),
    items_sold = n()
  )

Recently duckdb came out, it is also a database you can run inside your R or python process with no frills. So while I used to recommend SQLite, and you can still use it, I now recommend duckdb for most analysis work. SQLite is amazing for transactional work, so for instance many shiny apps work very nicely with sqlite.

source("datasetgeneration.R")
#(you also don't want to load all the data like this, it is usually better to load directly into duckdb, read the docs for more info)
duck = DBI::dbConnect(duckdb::duckdb(), dbdir="duck.db", read_only=FALSE)
DBI::dbWriteTable(duck, name = "sales",value = sales)

library(dplyr)
# SQL queries work exactly the same as SQLite, so I'm not going to show it.
# It's just an amazing piece of technology!
sales_duck <- tbl(duck, "sales")
sales_duck %>% 
  group_by(year, SKU) %>% 
  mutate(pos_sales = case_when(
    sales_units > 0 ~ sales_units,
    TRUE ~ 0
  )) %>% 
  summarise(
    total_revenue = sum(sales_units * item_price_eur),
    max_order_price = max(pos_sales * item_price_eur),
    avg_price_SKU = mean(item_price_eur),
    items_sold = n()
  )
DBI::dbDisconnect(duck)

The results are the same, but duckdb is way faster for most analytics queries (sums, aggregates etc).

You can use sqlite and duckdb in memory only too! that is even faster, but of course you need the data to fit into memory, which was our problem from the start…

So what is the point where you should move from data.table to sqlite/duckdb? I think when you start to have multiple datasets or when you want to make use of several columns in one table and other columns in another table you should consider going the local database route.

Dedicated databases

In practice you work with data from a business, that data already sits inside a database. Hopefully in a data warehouse that you can access. For example many companies use cloud datawarehouses like Amazon Redshift, Google Bigquery, (Azure Synapse Analytics?) or Snowflake to enable analytics in the company.

Or when you work on prem there are dedicated analytics databases like monetDB or the newer and faster russion kid on the block Clickhouse.

DBI has connectors to all of those databases. It is just a matter of writing the correct configuration and you can create a tbl() connection to that database table and work with it like you would locally!

What if I use python? There is no {dbplyr} equivalent in python so in practice you have to write SQL to get your data (there are tools to make that easier). Still it is super useful to push as much computation and pre work into the database and let your python session do only the things that databases can not do.

Clean up

file.remove('sales.db')
file.remove("duck.db")

Reproducibility

At the moment of creation (when I knitted this document ) this was the state of my machine: click here to expand

sessioninfo::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.2.0 (2022-04-22)
##  os       macOS Big Sur/Monterey 10.16
##  system   x86_64, darwin17.0
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Europe/Amsterdam
##  date     2022-11-09
##  pandoc   2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
##  blogdown      1.10    2022-05-10 [1] CRAN (R 4.2.0)
##  bookdown      0.27    2022-06-14 [1] CRAN (R 4.2.0)
##  bslib         0.4.0   2022-07-16 [1] CRAN (R 4.2.0)
##  cachem        1.0.6   2021-08-19 [1] CRAN (R 4.2.0)
##  cli           3.3.0   2022-04-25 [1] CRAN (R 4.2.0)
##  colorspace    2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
##  crayon        1.5.2   2022-09-29 [1] CRAN (R 4.2.0)
##  data.table  * 1.14.2  2021-09-27 [1] CRAN (R 4.2.0)
##  DBI           1.1.3   2022-06-18 [1] CRAN (R 4.2.0)
##  digest        0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
##  dplyr       * 1.0.9   2022-04-28 [1] CRAN (R 4.2.0)
##  dtplyr      * 1.2.1   2022-01-19 [1] CRAN (R 4.2.0)
##  evaluate      0.15    2022-02-18 [1] CRAN (R 4.2.0)
##  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
##  farver        2.1.1   2022-07-06 [1] CRAN (R 4.2.0)
##  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
##  generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
##  ggplot2     * 3.3.6   2022-05-03 [1] CRAN (R 4.2.0)
##  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
##  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.2.0)
##  highr         0.9     2021-04-16 [1] CRAN (R 4.2.0)
##  htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.2.0)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.2.0)
##  jsonlite      1.8.0   2022-02-22 [1] CRAN (R 4.2.0)
##  knitr         1.39    2022-04-26 [1] CRAN (R 4.2.0)
##  labeling      0.4.2   2020-10-20 [1] CRAN (R 4.2.0)
##  lattice       0.20-45 2021-09-22 [1] CRAN (R 4.2.0)
##  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.2.0)
##  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
##  Matrix        1.4-1   2022-03-23 [1] CRAN (R 4.2.0)
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
##  pillar        1.8.0   2022-07-18 [1] CRAN (R 4.2.0)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
##  png           0.1-7   2013-12-03 [1] CRAN (R 4.2.0)
##  purrr         0.3.5   2022-10-06 [1] CRAN (R 4.2.0)
##  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.2.0)
##  reticulate    1.26    2022-08-31 [1] CRAN (R 4.2.0)
##  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.2.0)
##  rmarkdown     2.14    2022-04-25 [1] CRAN (R 4.2.0)
##  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.2.0)
##  sass          0.4.2   2022-07-16 [1] CRAN (R 4.2.0)
##  scales        1.2.0   2022-04-13 [1] CRAN (R 4.2.0)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
##  stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.0)
##  stringr       1.4.1   2022-08-20 [1] CRAN (R 4.2.0)
##  tibble        3.1.8   2022-07-22 [1] CRAN (R 4.2.0)
##  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.2.0)
##  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
##  vctrs         0.4.1   2022-04-13 [1] CRAN (R 4.2.0)
##  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
##  xfun          0.31    2022-05-10 [1] CRAN (R 4.2.0)
##  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
## 
##  [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Notes and references

Should I Move to a Database?

Posted on November 8, 2021 by | Roel M. Hogervorst

More posts of level: intermediate advanced