Using Web scraping in literature reviews

Data scraping is the practice of extracting data in an automated way. Web-scraping simply means to extract this information from website(s) on the internet. Scraping can be very useful to quickly extract vital pieces of information that would otherwise take hours of manual work, and can be prone to human error.

If data presents itself in an easy way to scrape, then I recommend doing it (provided the website allows for it). How do we find out if a website is easy to scrape? Right-click on the website page you are interested in scraping and click on “View page source”. If you can see the elements you want to extract in the html or javacript code, then web-scraping should be “easy”.

To scrape data, I use the rvest package in R.

install.packages("rvest")
library(rvest)

As usual, the first time I do something I follow a tutorial. To illustrate how to web-scrape, I will extract basic data about some scientific articles. These are: Tuhkanen et al. (2016), Menendez-Carbo et al. (2020), Östberg et al. (2012), Whitehead and Wicker (2018), and Armbrecht, J. (2014). The basic data I want are title, abstract, year of publication, and whatever else I can “easily” extract from these articles.

Before I write more R code, I look up the page source to understand the code behind the page. The elements (title, year, title, etc) are preceeded by “<meta name=”, so “meta” will be an important indicator to identify these elements.

Let me start with Tuhkanen et al. (2016). I copy-paste the article’s webpage in the read_html function:

base_webpage <- read_html("https://www.sciencedirect.com/science/article/abs/pii/S0048969716302224")

data <- base_webpage %>%
  html_nodes("meta") %>% 
  html_attr("content")

The data object has 21 observations. This completes the easy part of the code: web-scraping. The complicated part is extracting the relevant information from each observation. When I look at the data object, I can see where the information is stored. I can save the relevant pieces of information in a dataframe called df, as such:

df <- data.frame(NA)
df$journal [1]  <- data[7]
df$title   [1]  <- data[14]
df$year    [1]  <- str_match(data[17], "\\d+")[1]
df$abstract[1]  <- data[12]
df$volumes [1]  <- data[3]
df$doi     [1]  <- data[9]

I created a dataframe that stores the journal, title, year, abstract, volumes and doi of this specific article.

In this case, the “data” (article’s title, year, etc) appears in a clean way, but in other cases it might be recorded in R in a more chaotic way.

In those cases, when extracted information comes in an “unclean” way, there are shortcuts to extract specific strings of data. For example, to obtain the year I had to use the str_match function. This function looks for patterns, in this case all digits (“d” for numbers and “+” for more than one). To look for other expressions, use the R documentation.

Now that I have a working code, I run a loop to web-scrape from multiple webpages. I copy-paste the weblink to each page into a list called link, and then use my code to web-scrape and extract the relevant information from each webpage.

link <- list("https://www.sciencedirect.com/science/article/abs/pii/S0048969716302224",
             "https://www.sciencedirect.com/science/article/abs/pii/S2211973620300945?casa_token=AR2fJhsoA04AAAAA:XxmgUhc3_9NjXFTqxn-5MJDqvcm_Sn_uAITVNdK-2JzvZnlha6c7NAxmljUZjfcaNCs-Y0FqXn4",
             "https://www.sciencedirect.com/science/article/pii/S0301479712003271",
             "https://www.sciencedirect.com/science/article/abs/pii/S0261517717302182?casa_token=Czxla95zDUcAAAAA:3d_HA7lS-VmYqynq1N4hRRdr_NZHlSY3KKDAsfTZzVko-V1qyZ4koLy3rVmMmShpvDkhFJixk0k",
             "https://www.sciencedirect.com/science/article/abs/pii/S0261517713002082?casa_token=xVlKnVTTOmUAAAAA:Yr0c6ZsLG9KrsudrP_lG2i6qTuS2OBrgzsNcJ3Gdq4fWxDB-yz3hoaexQDMqm-KokpwHYgfGKeA")

for(i in 2:5){
base_webpage <- read_html(link[[i]])

data <- base_webpage %>%
  html_nodes("meta") %>% 
  html_attr("content")

df[i,2]  <- data[7]
df[i,3]  <- data[14]
df[i,4]  <- str_match(data[17], "\\d+")[1]
df[i,5]  <- data[12]
df[i,6]  <- data[3]
df[i,7]  <- data[9]

}

I added four more lines to the df dataframe. I saved the dataframe as a csv file, that looks like this:

NA.journaltitleyearabstractvolumesdoi
NAScience of The Total EnvironmentValuing the benefits of improved marine environmental quality under multiple stressors2016Many marine ecosystems are under increasing pressure from multiple stressors. In the Baltic Sea, these stressors include oil and chemical spills from …551-55210.1016/j.scitotenv.2016.02.011
NATourism Management PerspectivesThe economic value of Malecón 2000 in Guayaquil, Ecuador: An application of the travel cost method2020Malecón 2000 is one of the most important urban parks in Ecuador. It is a recreational and ecological park that combines history, trade, culture, and …2211-973610.1016/j.tmp.2020.100727
NAJournal of Environmental ManagementNon-market valuation of the coastal environment – Uniting political aims, ecological and economic knowledge2012In this paper, we examine the feasibility of using an approach for estimating Willingness-To-Pay for marine environmental improvements, based on a hol…11010.1016/j.jenvman.2012.06.012
NATourism ManagementEstimating willingness to pay for a cycling event using a willingness to travel approach2017This study examines the monetary value of nonmarket benefits to participants of an active sport tourism event, such as happiness and pride from partic…6510.1016/j.tourman.2017.09.023
NATourism ManagementUse value of cultural experiences: A comparison of contingent valuation and travel cost2013Few applications to assess the value of cultural experiences exist. This is particularly frustrating for cultural institutions, as it provides them wi…4210.1016/j.tourman.2013.11.010

The code runs very well for each link, since these are all papers from the same editor (“Elsevier”) whose webpage exhibits scientific articles using the same base html code. This table can be expanded to include as many articles as wanted, as long as the weblinks are declared somewhere in R (and you only get articles from Elsevier). For other editors, you can follow the same logic but the code will not work without some changes.

References:

Tuhkanen, H., Piirsalu, E., Nõmmann, T., Karlõševa, A., Nõmmann, S., Czajkowski, M., & Hanley, N. (2016). Valuing the benefits of improved marine environmental quality under multiple stressors. Science of The Total Environment551, 367-375.

Menendez-Carbo, S., Ruano, M. A., & Zambrano-Monserrate, M. A. (2020). The economic value of Malecón 2000 in Guayaquil, Ecuador: An application of the travel cost method. Tourism Management Perspectives36, 100727.

Östberg, K., Hasselström, L., & Håkansson, C. (2012). Non-market valuation of the coastal environment–Uniting political aims, ecological and economic knowledge. Journal of environmental management110, 166-178.

Whitehead, J. C., & Wicker, P. (2018). Estimating willingness to pay for a cycling event using a willingness to travel approach. Tourism Management65, 160-169.

Armbrecht, J. (2014). Use value of cultural experiences: A comparison of contingent valuation and travel cost. Tourism Management42, 141-148.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s