Today I will be illustrating how to extract specific types of data from scientific publications. I have been trying to automate literature synthezes and extraction of core data.
As economists, we are usually interested in the money values reported in each paper. For example, the GDP, average income, average prices, or in the case of environmental valuation, the willingness to pay for avoiding degradation or obtaining an environmental improvement. My problem is: can I extract the money values from a scientific paper for further analysis?
To extract these values, I need to import text data into R and then analyze it. This requires some functions from text mining. For those interesting in getting started with text mining, Feinerer (2013) offers a good introduction to using the tm package, which offers many functions to facilitate text analysis. I will be using the same package:
The goal is to extract text data from a scientific paper. The article I use is the seminal paper by Carson et al. (2003) that estimates the use and non-use values lost due to the Exxon Valdez oil spill. First I choose the directory of the corresponding PDF file:
pdf_1 <- file.path("DIRECTORY", "Carson et al. - 2003 - Contingent valuation and lost passive use damages.pdf")
The file.path function above requires my directory and then the name of the pdf file. This produces a value object in R (called pdf_1 in my case). From this object, I want to produce a corpus object for further analysis with the tm package:
carson_2003 <- Corpus(URISource(pdf_1), readerControl = list(reader=readPDF))
The object carson_2003 is a list which contains the text from the scientific paper. To see the text in R, I can use the inspect function:
Now the object is in a format I can use for further analysis. Carson et al. (2003) was conducted in the United states, so the currency of interest is US dollars. The goal here is to find and extract expressions in the body of text that contain the “$” symbol.
First, I want to find out which pages have the dollar sign. I use the functions grepl and which. The function grepl returns TRUE or FALSE statements that represent whether the page contains the dollar symbol or not. the function which will tell me which pages were TRUE (i.e. contained the dollar symbol).
> grepl("\\$", carson_2003[])  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE  FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE  FALSE FALSE FALSE > which(grepl("\\$", carson_2003[]))  13 14 15 16 17 20 21 22 26 27
(Observation: One problem I face is that the dollar symbol is special when searching for data within the text. In the code above, if I write “$” rather than “\\$”. If I write only “$”, the code will not find the dollar sign within the text. When doing string searching, the dollar sign “$” is used to “match at the end of a line” (Yarberry, 2021).)
There are a few pages with the dollar sign: pages 13 to 17, 20 to 22, and 26 to 27. This is not enough informatio yet; I want to extract the numbers that show up after the dollar sign.
I can achieve this with the str_extract and str_extract_all functions from the stringr package. I am using the str_extract_all because I want to retrieve all money values, rather than the first one.
To write the correct expression to find the values that I want, I had to get some help from my colleague. I want to retrieve all numbers after the dollar sign. Some of these numbers contain a full stop, followed by more numbers. All of this needs somehow to be included when I write the expression that will enable R to find all money values. I ended up with the following expression:
Like I mentioned “\\$” finds the dollar symbol in the text, “\\d+” finds all the (one or more) numbers and “\\.” finds the full stop. I use these conditions as the argument within the str_extract_all function. This fuction requires specifying the text object (carson_2003[]) and the string expression to look for. This yields the following output:
> str_extract_all(carson_2003[], "\\$\\d+\\.\\d+") [] character(0) [] character(0) [] character(0) [] character(0) [] character(0) [] character(0) [] character(0) [] character(0) [] character(0) [] character(0) [] character(0) [] character(0) [] character(0) [] character(0) [] character(0) []  "$60.33" "$53.60" "$53.60" []  "$30.30" "$26.18" "$35.08" "$97.18" "$85.82" "$108.54" "$79.20" "$67.93" "$90.47" [] character(0) [] character(0) [] character(0) [] character(0) []  "$2.8" [] character(0) [] character(0) [] character(0) [] character(0) []  "$2.71" "$48.28" "$58.91" [] character(0) [] character(0) [] character(0)
Most pages do not contain any money values in dollars. I find a total of 16 money values in Carson et al. (2003) that include decimals. Pages 16, 17, 22 and 27 report these values. Some of these are the WTP estimates I was interested in retrieving. While this is not the end of this particular data extraction, it serves as a starting point for further analysis.
Carson, R. T., Mitchell, R. C., Hanemann, M., Kopp, R. J., Presser, S., & Ruud, P. A. (2003). Contingent valuation and lost passive use: damages from the Exxon Valdez oil spill. Environmental and resource economics, 25(3), 257-286.
Feinerer, I. (2013). Introduction to the tm Package Text Mining in R. Accessible en ligne: http://cran. r-project. org/web/packages/tm/vignettes/tm. pdf.
Yarberry, W. (2021). Regular Expressions in Stringr. In CRAN Recipes (pp. 219-220). Apress, Berkeley, CA.