11 Playground
11.1 Building a Hacker News scraper with 8 lines of R code using rvest library
- Cool, simple idea for web scraping in R, but the character strings created by this method didn’t line up (there were links with no link_domain or score)
#install.packages('rvest')
library(rvest)
## Loading required package: xml2
url <- 'https://news.ycombinator.com/'
#further pages
#url2 <- 'https://news.ycombinator.com/news?p=2'
content <- read_html(url)
#News Title
title <- content %>% html_nodes('a.storylink') %>% html_text()
#News Link Domain
link_domain <- content %>% html_nodes('span.sitestr') %>% html_text()
#Link Score / Upvote
score <- content %>% html_nodes('span.score') %>% html_text()
#Link Age (submission time)
age <- content %>% html_nodes('span.age') %>% html_text()
#Final Dataframe
#df <- data.frame(title = title, link_domain = link_domain, score = score, age = age)
#Naive way of extracting the entire page content with this table
#tb <- content %>% html_node('table.itemlist') %>% html_text()
11.2 R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
11.3 Including Plots
You can also embed plots, for example:
Note that the echo = FALSE
parameter was added to the code chunk to prevent printing of the R code that generated the plot.