11 Playground

11.1 Building a Hacker News scraper with 8 lines of R code using rvest library

  • Cool, simple idea for web scraping in R, but the character strings created by this method didn’t line up (there were links with no link_domain or score)
#install.packages('rvest')
library(rvest)
## Loading required package: xml2
url <- 'https://news.ycombinator.com/'

#further pages 
#url2 <- 'https://news.ycombinator.com/news?p=2'

content <- read_html(url)

#News Title

title <- content %>% html_nodes('a.storylink') %>% html_text()

#News Link Domain

link_domain <- content %>% html_nodes('span.sitestr') %>% html_text()

#Link Score / Upvote

score <- content %>% html_nodes('span.score') %>% html_text()

#Link Age (submission time)

age <- content %>% html_nodes('span.age') %>% html_text()

#Final Dataframe

#df <- data.frame(title = title, link_domain = link_domain, score = score, age = age)
#Naive way of extracting the entire page content with this table
#tb <- content %>% html_node('table.itemlist') %>% html_text()

11.2 R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

11.3 Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.