Monday, October 26, 2020

Scraping Rweekly with rvest (learning from Giora Simchoni post)

 I am now reading a superb post by Giora Simchoni (http://giorasimchoni.com/2018/09/17/2018-09-17-the-actual-tidyverse/) on scraping with rvest. I will try to modify her functions to scrape links and descriptions from RWeekly.

pacman::p_load(rvest, tidyverse, glue)

I modified the original function to get both links and descriptions from a nodes.

Using unnest

Now I would like to get links for a longer period. My only modification over the original scripts by Giora Simchoni is to unnest the embedded tibbles in a links column to links and descriptions.

Now let’s try getting two years of links. You will see that each entry in link column contain a tibble.

rweekly_links <- tibble(year = c(2017, 2018),
                        weeknum = list(1:52, 1:37)) %>%
  unnest(weeknum) %>%
  mutate(link = map2(year, weeknum,  fn_get_rweekly_links))

head(rweekly_links)
## # A tibble: 6 x 3
##    year weeknum link             
##   <dbl>   <int> <list>           
## 1  2017       1 <tibble [73 x 2]>
## 2  2017       2 <tibble [80 x 2]>
## 3  2017       3 <tibble [86 x 2]>
## 4  2017       4 <tibble [76 x 2]>
## 5  2017       5 <tibble [81 x 2]>
## 6  2017       6 <tibble [84 x 2]>

Let’s extract the tibbles to the two columns with unnest

rweekly_links_unp <- 
  rweekly_links %>% 
  unnest(link)

head(rweekly_links_unp)
## # A tibble: 6 x 4
##    year weeknum links                                               descriptions
##   <dbl>   <int> <chr>                                               <chr>       
## 1  2017       1 https://github.com/rweekly/rweekly.org/blob/gh-pag~ ""          
## 2  2017       1 https://rweekly.fireside.fm/                        "Podcast"   
## 3  2017       1 https://feedburner.google.com/fb/a/mailverify?uri=~ "Mail"      
## 4  2017       1 https://www.facebook.com/rweekly                    ""          
## 5  2017       1 https://twitter.com/rweekly_live                    ""          
## 6  2017       1 https://www.patreon.com/rweekly                     ""

I recommend studying the post further. The author extracts and analyses R code in articles recommended by RWeekly.

No comments:

Post a Comment

An example of a bat file that shows dialogues

@echo off setlocal :: Prompt user for input file names set /p jpgfile="Enter the name of the JPG file: " set /p archive="Ent...