I am now reading a superb post by Giora Simchoni (http://giorasimchoni.com/2018/09/17/2018-09-17-the-actual-tidyverse/) on scraping with rvest
. I will try to modify her functions to scrape links and descriptions from RWeekly.
pacman::p_load(rvest, tidyverse, glue)
I modified the original function to get both links and descriptions from a nodes.
A function to get all the links from one post.
fn_get_rweekly_links <- function(year, weeknum) {
Sys.sleep(1)
url <- glue("https://rweekly.org/{year}-{weeknum}.html")
nodes <- read_html(url) %>%
html_nodes("a")
# extract links from nodes
links <- nodes %>%
html_attr("href")
# extract descriptions
descriptions <- nodes %>%
html_text()
return(tibble(links, descriptions) %>%
filter(str_detect(links, "^http")))
}
# lets test if it works
fn_get_rweekly_links(2018,5) %>% head(10)
## # A tibble: 10 x 2
## links descriptions
## <chr> <chr>
## 1 https://github.com/rweekly/rweekly.~ ""
## 2 https://rweekly.fireside.fm/ "Podcast"
## 3 https://feedburner.google.com/fb/a/~ "Mail"
## 4 https://www.facebook.com/rweekly ""
## 5 https://twitter.com/rweekly_live ""
## 6 https://www.patreon.com/rweekly ""
## 7 https://user2018.r-project.org/blog~ "Interview with Roger Peng"
## 8 https://www.datacamp.com/community/~ "DataFramed DataCamp’s official podcast~
## 9 https://www.youtube.com/watch?v=kTG~ "Video record of pdxrlang meetup: Andre~
## 10 https://www.datacamp.com/community/~ "DataFramed DataCamp’s official podcast~
Using unnest
Now I would like to get links for a longer period. My only modification over the original scripts by Giora Simchoni is to unnest the embedded tibbles in a links column to links and descriptions.
Now let’s try getting two years of links. You will see that each entry in link
column contain a tibble.
rweekly_links <- tibble(year = c(2017, 2018),
weeknum = list(1:52, 1:37)) %>%
unnest(weeknum) %>%
mutate(link = map2(year, weeknum, fn_get_rweekly_links))
head(rweekly_links)
## # A tibble: 6 x 3
## year weeknum link
## <dbl> <int> <list>
## 1 2017 1 <tibble [73 x 2]>
## 2 2017 2 <tibble [80 x 2]>
## 3 2017 3 <tibble [86 x 2]>
## 4 2017 4 <tibble [76 x 2]>
## 5 2017 5 <tibble [81 x 2]>
## 6 2017 6 <tibble [84 x 2]>
Let’s extract the tibbles to the two columns with unnest
rweekly_links_unp <-
rweekly_links %>%
unnest(link)
head(rweekly_links_unp)
## # A tibble: 6 x 4
## year weeknum links descriptions
## <dbl> <int> <chr> <chr>
## 1 2017 1 https://github.com/rweekly/rweekly.org/blob/gh-pag~ ""
## 2 2017 1 https://rweekly.fireside.fm/ "Podcast"
## 3 2017 1 https://feedburner.google.com/fb/a/mailverify?uri=~ "Mail"
## 4 2017 1 https://www.facebook.com/rweekly ""
## 5 2017 1 https://twitter.com/rweekly_live ""
## 6 2017 1 https://www.patreon.com/rweekly ""
I recommend studying the post further. The author extracts and analyses R code in articles recommended by RWeekly.