Extracting meta data from scientific papers (2): extracting text data from multiple pdf files

I have gotten involved in a large literature review effort. After screening the literature, we found many potentially interesting papers. I am talking hundreds or thousands of papers that can be relevant. To find out which are more relevant for the research question, one can manually screen the titles, abstracts or skim-read the papers.

This initial screening is informative, yet tedious. Instead, my goal is to automate the literature review as much as possible to obtain basic data from each paper. I am an economist, so perhaps I am most interested about the papers that mention economic impacts. Once I know which papers within the large literature review mention economic impacts, perhaps I can then read them.

I am able to read papers into R because I have a local folder with all the pdf files from the literature review. After defining the directory of this folder with the setwd function, I use the following function (list.files) to create a list with the names of the files within that folder:

files <- list.files(pattern = "pdf$")

This created a list with 117 elements. Each element within the list is the reference of a paper. To read each pdf file, I need the pdftools package in R:

library(pdftools)

I can now use the pdf_text function within this package to extract the text.

To read all the pdf files, I created a loop that generates 117 files containing the text from each pdf, called text_##, where ## represents a number from 1 to 117.

for(i in 1:length(files)){
assign( paste("pdf" , i, sep = "_")     ,   file.path(dir, files[i]) )
assign( paste("text", i, sep = "_")     ,   pdf_text(get(paste("pdf", i, sep = "_")) )          )
}

I now have 117 character objects. These are called text_1, text_2, etc, until text_117. Each text object has a certain number of elements within, where each element is a page in the pdf.

I can now start the text analysis. Let us say I an interesting in knowing which of the 117 papers mentions words related to economics, such as “economics”, “economy”, “economical”, “financial”, “finances”, “profitable” or “profits”. I can start by investigating the first three papers and whether they mention these keywords at all.

To accomplish this, I found a neat function called text_locate within the corpus package. For every instance certain words are mentioned, it will also show some text before and afterwards. That way I might be able to understand in which context the word was used. It might also come in handy if I want to categorize what kind of economic values the papers are talking about.

To use the text_locate function, I need the name of the text object containing the paper’s text and a vector of words to search for. I also use the option “stemmer=”en””. That way, instead of specifying all the words that I need (E.g. “economy”, “economic”, etc), I can just specify “econom” and text_locate will look for words that start with “econom”.

Here is the output:

> library(corpus)
> text_locate(text_1, c("econom", "profit", "financ"), stemmer = "en")
text                 before                 instance                 after                 
(0 rows)
> text_locate(text_2, c("econom", "profit", "financ"), stemmer = "en")
  text                before                 instance                 after                 
1 11   …\ntems and the essential social and  economic  benefits they                    26: …
> text_locate(text_3, c("econom", "profit", "financ"), stemmer = "en")
  text                before                 instance                 after                 
1 6    …problem of            environmental– economic –cultural disasters; however,\nglobal…

From the output I can see that no instances of these words show up in the first paper, but one instance shows up in both papers 2 and 3. The code will report all instances the words appear. In the second paper, the word “economic” shows up in page 11, and is associated with benefits. This is not much information about the context, but if I want to know more, I know where to look for (paper 2, page 11) to see which context does this word appear in. Similar analysis applies to paper 3: again the word “economic” shows up in page 6 and it is used in the context of disasters.

Another interesting thing I can do with text_locate is to investigate how many times these words are used in each paper. I can do this perhaps more simply with other commands, but I am lazy and I would rather use the command I already have. My trick is, in the text_locate output, count the number of rows with the function nrow. This will count the number of instances in which the words occur.

I generate a loop that fills up each element (from 1 to 117) of a new variable ind_econ with the number of instances these words are mentioned. I first create the ind_econ variable and then run my loop. Here is the output:

> ind_econ <- 99
> for(i in 1:length(files)){
+     ind_econ[i]        <- nrow(text_locate(get(paste("text", i, sep = "_")), c("econom", "profit", "financ"), stemmer = "en"))
+ }
> ind_econ
  [1]  0  1  1  2  0  1  2  1  0  1  0  0  1  0  0  1  2  0  7  2  0  3  0  1  2  1  0  0 53
 [30]  2  2  0  0  4  1  0  0  0  1  0  0  0  0  0  6  0  1  5  0  0  2  0 11  2  0  2  6 18
 [59]  3  1  0  0  1  0  1  4  2  1  0  0 63  4  5  1  0  2  0  3  0  7  0  0  0  1  0  0  6
 [88]  0  0  0 10  5  1  0  0  0  0  1  0  3  0  1  0  0  2  2  2  0  0  0  0  1  0  0  3  0
[117]  0

The results in a vector with 117 elements, wherein each element represents the number of times the related words are mentioned. I can already see that a lot of papers do not mention these words at all, but there are two papers that probably focus on economic impacts, and mention these words 53 and 63 times. When I went back and checked the titles, I can in fact see these are resource economics papers, and are worth focusing on.

I find this way of automating literature searches very useful, especially if 1) there is an excessive amount of papers to go through and 2) we have a faint idea of which keywords they mention but no idea how prevalent they are. I recommend automating as much as possible, but always remembering that automating is no substitute to manually reading, analysing and interpreting each paper. There is (yet!) no substitute to the human brain.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s