Reading in an epub (ebook) file with the pubcrawl package

In this tutorial I show how to read in a epub file (f.i. from your ebook collection on you computer) into R with the pubcrawl package. In emoji speak: πŸΊπŸ“–πŸ“¦ . I will show the reading in part, (one line of code) and some other actions you might want to perform on textfiles before they are ready for text analysis.

After you read in your epub file you can do some cool analyses on it, but that is part of the next blogpost. See how cool this is?

A look at the top 2 words (tf-idf) in every chapter of Hitchhikers guide to the Galaxy

A look at the top 2 words (tf-idf) in every chapter of Hitchhikers guide to the Galaxy

a short diversion into how the package came to be (not required)

Recently I wanted to read in an epub book format with R. There was no such package!

Twitter #rstats hyve-mind to the rescue:

I did some digging and found out that epub is a relatively easy format, it is a zip file (compressed file) with xml files in it (incidently that looks like words docx file format). I went to work and before my day was over Bob Rudis had already created a package to read in epub format files!

So here is the link: https://github.com/hrbrmstr/pubcrawl where you can download the package. It returns the files in a nice tidy format.

Any epub contains in the zip (a compressed folder) several xml documents(a sort of website like formatted documents), the pubcrawl package unpackes the archive and places these files into a row per document.

Preperation

  • Install the pubcrawl package (see below)
  • load the pubcrawl package
  • load the tidyverse package
  • locate the epub file you want to read in and point to it
library(pubcrawl)
suppressPackageStartupMessages(library(tidyverse))  

In my case I cannot share the real file with you, because it is copyrighted, but it is the Hitchhikers guide to the galaxy, the first of the series and a lovely book.

Exploration

HH1 <- epub_to_text(epublocation)
HH1
## # A tibble: 73 x 4
##    path            size date                content                       
##    <chr>          <dbl> <dttm>              <chr>                         
##  1 OEBPS/part1.x…  4826 2010-06-03 17:20:56 "HH1 - Hitchhiker's Guide to …
##  2 OEBPS/part10_…   678 2010-06-03 17:20:56 HH1 - Hitchhiker's Guide to t…
##  3 OEBPS/part10_… 11867 2010-06-03 17:20:56 "CHAPTER 9\n      A computer …
##  4 OEBPS/part11_…   678 2010-06-03 17:20:56 HH1 - Hitchhiker's Guide to t…
##  5 OEBPS/part11_…  3281 2010-06-03 17:20:56 "CHAPTER 10\n      The Infini…
##  6 OEBPS/part12_…   678 2010-06-03 17:20:56 HH1 - Hitchhiker's Guide to t…
##  7 OEBPS/part12_… 16537 2010-06-03 17:20:56 "CHAPTER 11\n      The Improb…
##  8 OEBPS/part13_…   678 2010-06-03 17:20:56 HH1 - Hitchhiker's Guide to t…
##  9 OEBPS/part13_… 11399 2010-06-03 17:20:56 "CHAPTER 12\n      A loud cla…
## 10 OEBPS/part14_…   678 2010-06-03 17:20:56 HH1 - Hitchhiker's Guide to t…
## # ... with 63 more rows

As you can see there is a path, size, date and content column. The files are not the same size, so after loading the epub you are most likely not done. You need to work a bit to get it in a nice format for text analyses, such is life.

Lets explore one of the files: file number 2: β€˜part10_…’

If you have only worked with tidyverse verbs this can be a bit difficult to understand: I asked the second row and first till second column. it would be equivalent to HH1 %>% filter(path == β€œOEBPS/part1.xhtml”) %>% select(path,size)

HH1[2,1:2] # base R to the rescue!
## # A tibble: 1 x 2
##   path                          size
##   <chr>                        <dbl>
## 1 OEBPS/part10_split_000.xhtml   678
HH1[2,4]
## # A tibble: 1 x 1
##   content                               
##   <chr>                                 
## 1 HH1 - Hitchhiker's Guide to the Galaxy

hmm, There is an almost empty page before every chapter it seems. It just says the booktitle.

Let’s check another page:

HH1[3,4]
## # A tibble: 1 x 1
##   content                                                                  
##   <chr>                                                                    
## 1 "CHAPTER 9\n      A computer chatted to itself in alarm as it noticed an…

how many characters are there in this thingy?

#HH1[3,4] %>% nchar()  # old way
HH1[3,4] %>% str_length()  # stringr way
## [1] 8867
HH1[2,4] %>% str_length()  # stringr way
## [1] 38

Right in the second row there are 38 characters, and in the third row 8867.

Filtering on filename

We could select the rows with more than a certain amount of characters, but there is also another way. I noticed that the filenames in path are structered in a certain way.

There are files like this: β€œOEBPS/part10_split_000.xhtml” and like this OEBPS/part20_split_001.xhtml. only the files with split_001.. in it contain the text.

so we can filter on name in β€˜path’

HH1 %>% filter(str_detect(path, "split_001.xhtml"))

This will only return rows where somewhere in the path column the string β€˜split_001.xhtml’ is found. That leaves us with less rows and another peculiarity

extracting the chapter numbers

HH1 %>% 
    filter(str_detect(path, "split_001.xhtml")) %>% 
    select(content) %>% head(3)
## # A tibble: 3 x 1
##   content                                                                  
##   <chr>                                                                    
## 1 "CHAPTER 9\n      A computer chatted to itself in alarm as it noticed an…
## 2 "CHAPTER 10\n      The Infinite Improbability Drive is a wonderful new m…
## 3 "CHAPTER 11\n      The Improbability-proof control cabin of the Heart of…

Every chapter starts with CHAPTER followed by a number.

We can use regular expressions for that!

Some people, when confronted with a problem, think β€œI know, I’ll use regular expressions.” Now they have two problems. – Jamie Zawinski (1997)

Don’t be afraid, it is not the use of regex1 that is a problem, but the overuse of it. Let’s see if we can extract the chapter, put it in a different column and remove that part from the main text. A regular expression tells the computer what to search for, in fact I already used one before: the β€˜split_001’ part. But in our case such a precise match is not what we need. We need something to match β€œCHAPTER” followed by ANY number. The regex code for numbers is like this β€œ[0-9]{1,3}”, which means: any number between 0 and 9, one up to and including three times so it matches 9 but also 10 or 100 (there are not hundred chapters but I was a bit cautious)

HH1 %>% 
    filter(str_detect(path, "split_001.xhtml")) %>% 
    mutate(chapter = str_extract(content, "CHAPTER [0-9]{1,3}"))

But we are not yet there, I actually only want the number, but I don’t want to match any number in the text, only numbers from the phrase CHAPTER [0-9]. So let’s cut the number from there, I now use a pipe IN a mutate, it might be confusing but I think it still is very useful.

HH1 %>% 
    filter(str_detect(path, "split_001.xhtml")) %>% 
mutate(chapter = str_extract(content, "CHAPTER [0-9]{1,3}") %>% 
               str_extract("[0-9]{1,3}") %>% 
               as.integer())

The first str_extract pulls the β€œCHAPTER 3”-like text parts out. From that, I pull out the number alone, and finally I convert that to an integer (because chapters are never negative and usually in steps of 1).

taking out the rebundant information

The chapter number is now in a seperate column, but it’s also in the text. That will not do.

HH1 %>% 
    filter(str_detect(path, "split_001.xhtml")) %>% 
mutate(chapter = str_extract(content, "CHAPTER [0-9]{1,3}") %>% 
               str_extract("[0-9]{1,3}") %>% 
               as.integer(),
           content = str_remove(content, "CHAPTER [0-9]{1,3}"))

Now the chapters work out nicely. However, while checking the results I found that there is stil a piece of annoying markup in the texts:

# A tibble: 35 x 5
   path             size date                content                                  chapter
   <chr>           <dbl> <dttm>              <chr>                                      <int>
 1 OEBPS/part10_s… 11867 2010-06-03 17:20:56 "\n      A computer chatted to itself i…       9
 2 OEBPS/part11_s…  3281 2010-06-03 17:20:56 "\n      The Infinite Improbability Dri…      10

\n translates to newline. But when we read in the file with tidytext newlines are automatically removed. Every chapter ends though, with this markup: β€œUnknownUnknown”

If we do a text analysis than Unknown will be frequently found word while it is actually an artefact. Let’s remove that:

HH1 %>% 
    filter(str_detect(path, "split_001.xhtml")) %>% 
mutate(chapter = str_extract(content, "CHAPTER [0-9]{1,3}") %>% 
               str_extract("[0-9]{1,3}") %>% 
               as.integer(),
           content = str_remove(content, "CHAPTER [0-9]{1,3}"),
           content = str_remove(content, "Unknown\n      Unknown")) 

Rearanging and keeping only relevant information

I want the chapternumber first, the tibble ordered by it, and only chapternumber and content. so the final steps are:

prevous stuff %>% 
    arrange(chapter) %>% 
    select(chapter, content)

Let’s take a step back, creating a function out of the pipeline

We have whole set of instructions. What if I want to do this on multible books? I can copy the entire set of instructions 5 times and replace the source, but we can also create a function.

Cleaning up the file

We can copy the entire pipeline and make it function.

Normally when we make a function it goes something like this

nameoffunctoin <- function(argument){
    do  something with the argument
    return something
}

But in this case we can also create a function when we don’t start with a dataframe, but with a dot (= . ) and assign the entire chain to a name.

This creates a functional sequence (fseq for short), but you only have to remember that this is incredibly useful and saves you time in the future.

extract_TEXT <-  . %>% 
    filter(str_detect(path, "split_001.xhtml")) %>% 
    mutate(chapter = str_extract(content, "CHAPTER [0-9]{1,3}") %>% 
               str_extract("[0-9]{1,3}") %>% 
               as.integer(),
           content = str_remove(content, "CHAPTER [0-9]{1,3}"),
           content = str_remove(content, "Unknown\n      Unknown")) %>% 
    arrange(chapter) %>% 
    select(chapter, content)
class(extract_TEXT)
## [1] "fseq"     "function"

I now have a function that cleans up the entire datafile. If this was a larger project I would place functions like this in a seperate utilities.R file and load it at the top of this document.

HH1_cleaned <- 
    HH1 %>% 
    extract_TEXT()

A small tidytext exploration

This is a bit fast for beginners, but I will pay more attention to this process in a follow up blog post so bear with me.

What are the most typical words for every chapter (as in, more typical for that chapter compared to the the entire book, known as tf-idf)?

I have split the file into pieces of chapter

library(tidytext)
dataset <- HH1_cleaned %>% 
    unnest_tokens(output = word, input = content, token = "words") %>% 
    group_by(chapter) %>% 
    count(word) %>% 
    bind_tf_idf(term = word, document = chapter, n = n) %>% 
    top_n(5, wt = tf_idf) %>% 
    ungroup() %>% 
    mutate(word = reorder(word, tf_idf), Chapter = as.factor(chapter)) 

dataset %>% 
  filter(chapter < 8)  %>%
    ggplot(aes(word, tf_idf, fill = chapter))+
    geom_col(show.legend = FALSE)+
    facet_wrap(~Chapter,scales = "free")+
    coord_flip()+
    labs(
        title = "Hitchhiker's Guide to the Galaxy",
        subtitle = "Top 5 most typical words per chapter (first 7)",
        x = "", y = "", caption = "Roel M. Hogervorst 2018 - clean code blog"
    )

dataset %>% 
  filter(chapter > 7, chapter <15)  %>%
    ggplot(aes(word, tf_idf, fill = chapter))+
    geom_col(show.legend = FALSE)+
    facet_wrap(~Chapter,scales = "free")+
    coord_flip()+
    labs(
        title = "Hitchhiker's Guide to the Galaxy",
        subtitle = "Top 5 most typical words per chapter (second 7 chapters)",
        x = "", y = "", caption = "Roel M. Hogervorst 2018 - clean code blog"
    )

dataset %>% 
  filter(chapter >=15 , chapter < 22)  %>%
    ggplot(aes(word, tf_idf, fill = chapter))+
    geom_col(show.legend = FALSE)+
    facet_wrap(~Chapter,scales = "free")+
    coord_flip()+
    labs(
        title = "Hitchhiker's Guide to the Galaxy",
        subtitle = "Top 5 most typical words per chapter (third 7 chapters)",
        x = "", y = "", caption = "Roel M. Hogervorst 2018 - clean code blog"
    )

dataset %>% 
  filter(chapter >=22 , chapter < 29)  %>%
    ggplot(aes(word, tf_idf, fill = chapter))+
    geom_col(show.legend = FALSE)+
    facet_wrap(~Chapter,scales = "free")+
    coord_flip()+
    labs(
        title = "Hitchhiker's Guide to the Galaxy",
        subtitle = "Top 5 most typical words per chapter (fourth 7 chapters)",
        x = "", y = "", caption = "Roel M. Hogervorst 2018 - clean code blog"
    )    

dataset %>% 
  filter(chapter >=29 , chapter < 36)  %>%
    ggplot(aes(word, tf_idf, fill = chapter))+
    geom_col(show.legend = FALSE)+
    facet_wrap(~Chapter,scales = "free")+
    coord_flip()+
    labs(
        title = "Hitchhiker's Guide to the Galaxy",
        subtitle = "Top 5 most typical words per chapter (fifth 7 chapters)",
        x = "", y = "", caption = "Roel M. Hogervorst 2018 - clean code blog"
    )

How do I install it?

go to https://github.com/hrbrmstr/pubcrawl and see instructions there, which will say something like: devtools::install_github("hrbrmstr/pubcrawl")

Resources, references and more

State of the machine

At the moment of creation (when I knitted this document ) this was the state of my machine:click here (it will fold out)

sessioninfo::session_info()
## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.1 (2018-07-02)
##  os       Ubuntu 16.04.4 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_US                       
##  collate  en_US.UTF-8                 
##  tz       Europe/Amsterdam            
##  date     2018-07-20                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version    date       source                            
##  archive       1.0.0      2018-07-03 Github (jimhester/archive@11e65d7)
##  assertthat    0.2.0      2017-04-11 CRAN (R 3.5.0)                    
##  backports     1.1.2      2017-12-13 CRAN (R 3.5.0)                    
##  bindr         0.1.1      2018-03-13 CRAN (R 3.5.0)                    
##  bindrcpp    * 0.2.2      2018-03-29 CRAN (R 3.5.0)                    
##  blogdown      0.8        2018-07-15 CRAN (R 3.5.1)                    
##  bookdown      0.7        2018-02-18 CRAN (R 3.5.0)                    
##  broom         0.4.5      2018-07-03 CRAN (R 3.5.1)                    
##  cellranger    1.1.0      2016-07-27 CRAN (R 3.5.0)                    
##  cli           1.0.0      2017-11-05 CRAN (R 3.5.0)                    
##  clisymbols    1.2.0      2017-05-21 CRAN (R 3.5.0)                    
##  colorspace    1.3-2      2016-12-14 CRAN (R 3.5.0)                    
##  crayon        1.3.4      2017-09-16 CRAN (R 3.5.0)                    
##  digest        0.6.15     2018-01-28 CRAN (R 3.5.0)                    
##  dplyr       * 0.7.6      2018-06-29 CRAN (R 3.5.1)                    
##  emo           0.0.0.9000 2018-07-18 Github (hadley/emo@02a5206)       
##  evaluate      0.10.1     2017-06-24 CRAN (R 3.5.0)                    
##  fansi         0.2.3      2018-05-06 CRAN (R 3.5.1)                    
##  forcats     * 0.3.0      2018-02-19 CRAN (R 3.5.0)                    
##  foreign       0.8-70     2018-04-23 CRAN (R 3.5.0)                    
##  ggplot2     * 3.0.0      2018-07-03 cran (@3.0.0)                     
##  glue          1.3.0      2018-07-18 Github (tidyverse/glue@66de125)   
##  gtable        0.2.0      2016-02-26 CRAN (R 3.5.0)                    
##  haven         1.1.2      2018-06-27 CRAN (R 3.5.1)                    
##  hms           0.4.2      2018-03-10 CRAN (R 3.5.0)                    
##  htmltools     0.3.6      2017-04-28 CRAN (R 3.5.0)                    
##  httr          1.3.1      2017-08-20 CRAN (R 3.5.0)                    
##  janeaustenr   0.1.5      2017-06-10 CRAN (R 3.5.0)                    
##  jsonlite      1.5        2017-06-01 CRAN (R 3.5.0)                    
##  knitr         1.20       2018-02-20 CRAN (R 3.5.0)                    
##  labeling      0.3        2014-08-23 CRAN (R 3.5.0)                    
##  lattice       0.20-35    2017-03-25 CRAN (R 3.5.0)                    
##  lazyeval      0.2.1      2017-10-29 CRAN (R 3.5.0)                    
##  lubridate     1.7.4      2018-04-11 CRAN (R 3.5.0)                    
##  magrittr      1.5        2014-11-22 CRAN (R 3.5.0)                    
##  Matrix        1.2-14     2018-04-09 CRAN (R 3.5.0)                    
##  mnormt        1.5-5      2016-10-15 CRAN (R 3.5.0)                    
##  modelr        0.1.2      2018-05-11 CRAN (R 3.5.0)                    
##  munsell       0.5.0      2018-06-12 CRAN (R 3.5.0)                    
##  nlme          3.1-137    2018-04-07 CRAN (R 3.5.0)                    
##  pillar        1.3.0      2018-07-14 CRAN (R 3.5.1)                    
##  pkgconfig     2.0.1      2017-03-21 CRAN (R 3.5.0)                    
##  plyr          1.8.4      2016-06-08 CRAN (R 3.5.0)                    
##  psych         1.8.4      2018-05-06 CRAN (R 3.5.0)                    
##  pubcrawl    * 0.1.0      2018-07-03 Github (hrbrmstr/pubcrawl@a977f3b)
##  purrr       * 0.2.5      2018-05-29 CRAN (R 3.5.0)                    
##  R6            2.2.2      2017-06-17 CRAN (R 3.5.0)                    
##  Rcpp          0.12.17    2018-05-18 CRAN (R 3.5.0)                    
##  readr       * 1.1.1      2017-05-16 CRAN (R 3.5.0)                    
##  readxl        1.1.0      2018-04-20 CRAN (R 3.5.0)                    
##  reshape2      1.4.3      2017-12-11 CRAN (R 3.5.0)                    
##  rlang         0.2.1      2018-05-30 CRAN (R 3.5.0)                    
##  rmarkdown     1.10       2018-06-11 CRAN (R 3.5.0)                    
##  rprojroot     1.3-2      2018-01-03 CRAN (R 3.5.0)                    
##  rstudioapi    0.7        2017-09-07 CRAN (R 3.5.0)                    
##  rvest         0.3.2      2016-06-17 CRAN (R 3.5.0)                    
##  scales        0.5.0      2017-08-24 CRAN (R 3.5.0)                    
##  sessioninfo   1.0.0      2017-06-21 CRAN (R 3.5.1)                    
##  SnowballC     0.5.1      2014-08-09 CRAN (R 3.5.0)                    
##  stringi       1.2.3      2018-06-12 CRAN (R 3.5.0)                    
##  stringr     * 1.3.1      2018-05-10 CRAN (R 3.5.0)                    
##  tibble      * 1.4.2      2018-01-22 CRAN (R 3.5.0)                    
##  tidyr       * 0.8.1      2018-05-18 CRAN (R 3.5.0)                    
##  tidyselect    0.2.4      2018-02-26 CRAN (R 3.5.0)                    
##  tidytext    * 0.1.9      2018-05-29 CRAN (R 3.5.0)                    
##  tidyverse   * 1.2.1      2017-11-14 CRAN (R 3.5.0)                    
##  tokenizers    0.2.1      2018-03-29 CRAN (R 3.5.0)                    
##  utf8          1.1.4      2018-05-24 CRAN (R 3.5.0)                    
##  withr         2.1.2      2018-03-15 CRAN (R 3.5.0)                    
##  xfun          0.3        2018-07-06 CRAN (R 3.5.1)                    
##  xml2          1.2.0      2018-01-24 CRAN (R 3.5.0)                    
##  xslt          1.3        2017-11-18 CRAN (R 3.5.0)                    
##  yaml          2.1.19     2018-05-01 CRAN (R 3.5.0)

How did I make the plot at the top? I created it seperately and added the image later on top.

{HH1_cleaned %>% 
  unnest_tokens(output = word, input = content, token = "words") %>% 
  group_by(chapter) %>% 
  count(word) %>% 
  bind_tf_idf(term = word, document = chapter, n = n) %>% 
  top_n(2, wt = tf_idf) %>% 
  ungroup() %>%  
  mutate(word = reorder(word, tf_idf), Chapter = as.factor(chapter)) %>% 
  ggplot(aes(word, tf_idf, fill = chapter))+
  geom_col(show.legend = FALSE)+
  facet_wrap(~Chapter,scales = "free")+
  coord_flip()+
  labs(
    title = "Hitchhiker's Guide to the Galaxy - Douglas Adams: what is each chapter about?",
    subtitle = "Top 2 most typical words per chapter (TF-IDF scores)",
    x = "", y = "", caption = "Roel M. Hogervorst 2018 - clean code blog"
  ) } %>% ggsave(filename = "trie2.png",plot = ., width = 9, height = 6, dpi = "screen")


  1. as we call it in the biz↩