Monday, October 26, 2020

Scraping Rweekly with rvest (learning from Giora Simchoni post)

 I am now reading a superb post by Giora Simchoni (http://giorasimchoni.com/2018/09/17/2018-09-17-the-actual-tidyverse/) on scraping with rvest. I will try to modify her functions to scrape links and descriptions from RWeekly.

pacman::p_load(rvest, tidyverse, glue)

I modified the original function to get both links and descriptions from a nodes.

Using unnest

Now I would like to get links for a longer period. My only modification over the original scripts by Giora Simchoni is to unnest the embedded tibbles in a links column to links and descriptions.

Now let’s try getting two years of links. You will see that each entry in link column contain a tibble.

rweekly_links <- tibble(year = c(2017, 2018),
                        weeknum = list(1:52, 1:37)) %>%
  unnest(weeknum) %>%
  mutate(link = map2(year, weeknum,  fn_get_rweekly_links))

head(rweekly_links)
## # A tibble: 6 x 3
##    year weeknum link             
##   <dbl>   <int> <list>           
## 1  2017       1 <tibble [73 x 2]>
## 2  2017       2 <tibble [80 x 2]>
## 3  2017       3 <tibble [86 x 2]>
## 4  2017       4 <tibble [76 x 2]>
## 5  2017       5 <tibble [81 x 2]>
## 6  2017       6 <tibble [84 x 2]>

Let’s extract the tibbles to the two columns with unnest

rweekly_links_unp <- 
  rweekly_links %>% 
  unnest(link)

head(rweekly_links_unp)
## # A tibble: 6 x 4
##    year weeknum links                                               descriptions
##   <dbl>   <int> <chr>                                               <chr>       
## 1  2017       1 https://github.com/rweekly/rweekly.org/blob/gh-pag~ ""          
## 2  2017       1 https://rweekly.fireside.fm/                        "Podcast"   
## 3  2017       1 https://feedburner.google.com/fb/a/mailverify?uri=~ "Mail"      
## 4  2017       1 https://www.facebook.com/rweekly                    ""          
## 5  2017       1 https://twitter.com/rweekly_live                    ""          
## 6  2017       1 https://www.patreon.com/rweekly                     ""

I recommend studying the post further. The author extracts and analyses R code in articles recommended by RWeekly.

Sunday, October 25, 2020

First steps in text scraping with rvest

A small exercise in text scraping with rvest

pacman::p_load(tidyverse, rvest, textclean)

Scrape one page

html_address <- "https://www.r-bloggers.com/2020/10/daylight-charts-with-r/"
xpath1 <- "//article/div"

sometext <- 
  #read whole page
  read_html(x = html_address) %>%
  #cut out with xpath
  html_nodes(xpath = xpath1) %>% 
  #get interesting text from paragraphs and lists
  html_nodes(c("p,li")) %>% 
  #make sure it is only text
  html_text() %>% 
  #remove non-asci 
  replace_non_ascii2() %>%
  #remove all html markup
  replace_html() 

Build a function to do it

#' Funkcja "Ściagnij tekst"
#'
#' @param html_address  - url do strony
#' @param xpath_to_text - sciezka xpath do tekstów, poprzedz '//' zeby wyszukiwac
#'
#' @return text extracted from page
#' @export
#'
#' @examples fn_sciagnij_tekst(html_address = "https://www.r-bloggers.com/2020/10/rapid-internationalization-of-shiny-apps-shiny-i18n-version-0-2/")
#' 
fn_sciagnij_tekst <- function(html_address, xpath_to_text = "//article/div") {

  sometext <- 
    read_html(x = html_address) %>%
    html_nodes(xpath =xpath_to_text) %>% 
    html_nodes(c("p,li")) %>% 
    html_text() %>% 
    replace_non_ascii2() %>%
    replace_html() 
  
  return(sometext)
}

Test a function

one_read <- 
  fn_sciagnij_tekst(
  html_address = "https://www.r-bloggers.com/2020/10/rapid-internationalization-of-shiny-apps-shiny-i18n-version-0-2/",
  )

Test it on more url’s

adresy <- 
  c("https://r-bloggers.com/2020/10/rapid-internationalization-of-shiny-apps-shiny-i18n-version-0-2/",
    "https://www.r-bloggers.com/2018/07/pca-vs-autoencoders-for-dimensionality-reduction/",
    "https://www.r-bloggers.com/2020/10/little-useless-useful-r-function-r-jobs-title-generator/"
    )

use for syntax

df <- data.frame()

for (i in seq_along(adresy)) {

  df_temp <- data_frame()
  df_temp <- data.frame(adresy = adresy[i],
                        zawartosc = paste0(fn_sciagnij_tekst(adresy[i]),collapse = "\n"))
  df <- bind_rows(df, df_temp)
}
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

use map syntax.

df <- 
  map_df(adresy, 
       ~ c(adresy = .x, 
           teksty = paste0(fn_sciagnij_tekst(.x),collapse = "\n") ))
# TO DO: it needs improvement for some error handling, eg. as in here with `possibly` 
https://stackoverflow.com/questions/50486527/how-to-use-map-with-possibly
Even better it would be to use safely and then `rectangle the embedded lists into regular columns `result` and `error. Suitable functions: 
https://tidyr.tidyverse.org/reference/hoist.html
https://purrr.tidyverse.org/reference/flatten.html

An example of a bat file that shows dialogues

@echo off setlocal :: Prompt user for input file names set /p jpgfile="Enter the name of the JPG file: " set /p archive="Ent...