Text Analysis: What will we be talking about at the EAERE conference?

Next week, the EAERE conference, which is one of my favorite conferences, will be held online for the first time. Many papers will be presented from the 23rd June until the 3rd July 2020. The conference is free, but to attend it one needs to register at the program webpage. You can also look up the program after you have created an account to see if any presentation matches your interests.

The conference’s website reports an extraordinary amount of papers to be presented during this week: 570 papers.

Similarly to my previous analysis of the themes discussed at the EARE conference, I want to get an overview of what kind of topics will be discussed at the EAERE 2020 conference. My goal is to do text analysis by using the titles of presentations and understand the most frequently used words and understand word correlations.

Extracting the text data may seem dauting, but it was surpisingly easy. In my past application, I had problems with cleaning the text data. This time it was a lot simpler: I logged in the fleximeets platform and copy-pasted all the presentations into an excel document. The result is a messy excel document wherein each row is a sentence. I do not share the document here in the blog because it has the researchers’ names and zoom links, which is sensitive information. But if you want to do the same, it should take you no longer than 10 minutes. My dataset is called eaere_conference.

I decided to clean this excel document with R instead of manually deleting unnecessary information. I want to keep only the presentation titles. Fortunately, the rows of presentations’ titles start with a digit followed by a full stop, e.g. “1.”. So I can write an R script that will only keep the rows that contain “1.”, “2.”, “3.” and so on. I use the grepl function to find these rows:

grepl("0. ", eaere_conference[i,], fixed=TRUE)

The code above is set to find rows with the text “0. “. Its output is a vector of TRUE of FALSE denoting whether the row contains the expression “0. ” or not. I repeated the code above for other digits (“1. “, “2. “, etc) to find all presentation titles. The result is a dataframe with 649 rows. Each row represents a conference title.

> dim(eaere_conference)
[1] 649   1

My first task is to check the most frequent words used in presentation titles.

In order to do so, I need to create a tibble with the dplyr package including the text data. I define the number of lines in line=1:dim(eaere_conference)[1] and the dataframe containing the text in text =.

library(dplyr)
text_df <- tibble(line = 1:dim(eaere_conference)[1], text = eaere_conference)

The text_df tibble looks like this:

> head(text_df)
# A tibble: 6 x 2
   line text$`DATABASE (unpresented papers)`                                                                                               
  <int> <chr>                                                                                                                              
1     1 1. Access to and consumption of natural gas: spatial and socio-demographic drivers                                                 
2     2 2. Agri-environmental investment support: adoption and neighborhood effects                                                        
3     3 3. Analysis of Public Preferences toward Water Ecosystem Services: a Hybrid Latent Class Approach to Model Attitudinal Factors in ~
4     4 4. Are Emissions Trading Schemes Really Cost-effective?                                                                            
5     5 5. Asset prices and incentives for climate policy                                                                                  
6     6 6. Blue Sky or Bright Light?: Empirical Analysis for a Campaign-Style Environmental Enforcement in China                           

The next step is to convert this tibble in a much longer object where each row is a word. That way, I can count how many times each word appears. I use the tibble commands as well, by specifying in the unnest_tokens() command that I want to unnest by words.

text_df <- text_df %>%
  mutate_all(as.character) %>% 
  unnest_tokens(word, text)

This results in a text_df object with 5534023 rows, that is 5534023 words. Before analyzing this data, I would like to get rid of common words in the English language which are not useful for analysis, such as “and”, “to”, “about” or “the”. These are included in the stop_words tibble, that includes 1149 common words. I delete them from my text_df tibble, using the anti_join function:

data(stop_words)
text_df <- text_df %>%
  anti_join(stop_words)

The result is a tibble object with 3808332 words. In order to check for the most frequent words, I typed:

> text_df %>%
+   count(word, sort = TRUE)
# A tibble: 1,869 x 2
   word              n
   <chr>         <int>
 1 climate       62953
 2 evidence      57761
 3 environmental 49973
 4 carbon        33099
 5 energy        31152
 6 effects       28556
 7 change        27258
 8 policy        25311
 9 economic      24662
10 electricity   22715
# ... with 1,859 more rows

It seems that in all 649 titles, the word climate shows up 62953 times, followed by evidence and environmental. The most frequent words are also illustrated in the graph below:

The second thing I wanted to show is word correlations. That is, which words are associated with each other in the titles of EAERE presentations? In order to do this analysis, I follow the tutorial in the UC Business Analytics R Programming Guide. You can find the tutorial on text analysis here.

I create a tibble called ps_words.

library(dplyr)

ps_words <- tibble(title = seq_along(eaere_conference),
                   text = eaere_conference) %>%
    mutate_all(as.character) %>% 
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word)

In order to calculate correlation between pairs of words, I use the widyr package. The pairwise_cor function allows me to calculate correlations between pairs of words, provided they are included in the same title.

library(widyr)
(word_cor <- ps_words %>%
    group_by(word) %>%
    pairwise_cor(word, title) %>%
    filter(!is.na(correlation)))

This way, I can identify pairs of words that have high correlation, such as “hybrid” and “latent”, or “attitudinal” and “class”. All these pairs have a correlation coefficient of one, which means that they always appear in titles of papers together.

> word_cor %>%
+   arrange(desc(correlation))
# A tibble: 3,483,822 x 3
   item1       item2       correlation
   <chr>       <chr>             <dbl>
 1 latent      hybrid               1.
 2 class       hybrid               1.
 3 attitudinal hybrid               1.
 4 hybrid      latent               1.
 5 class       latent               1.
 6 attitudinal latent               1.
 7 hybrid      class                1.
 8 latent      class                1.
 9 attitudinal class                1.
10 hybrid      attitudinal          1.
# ... with 3,483,812 more rows

In order to plot the correlation in a network, I use the ggraph package. In order to produce the graph, I tell the code to use the ps_words tibble. To avoid an inelligible graph, I only want words to appear in the network if they are present in titles of presentations 12 times or more. I also only want correlations to show up if the correlation coefficient is higher than 0.15. The functions geom_edge_link, geom_node_point and geom_node_text define how the lines, dots and text in the network will appear.

library(ggraph)
ps_words %>%
  group_by(word) %>%
  filter(n() >= 12) %>%
  pairwise_cor(word, title) %>%
  filter(!is.na(correlation),
         correlation > 0.15) %>%
  ggraph(layout = "fr") +
  geom_edge_link()+
  geom_node_point(color = "lightblue", size = 3) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

The code above yields the following graph (note: when plotting this graph yourself, it might differ because it is randomly generated):

If there is a line connecting two dots, that means that the two words have a correlation coefficient higher than 0.15. That is, they appear frequently in the same title.

The graph above is exploratory but it does show interesting insights about the different topics to be discussed at EAERE. Starting at the grid on bottom left of the picture, there are several words associated with energy markets. Some presentations seem to be associated with renewable energy, since “Wind power”, “renewable” and “electricity” seem to be correlated. Other presentations seem to focus more on the more policy-oriented side of energy topics, with terms such as “carbon”, “pricing/prices” and “emission”.

On the bottom right, there is a smaller network of words associated with “air pollution”, with is interlinked with “health”. Other minor word associations are “local” and “global”, and “model” and “dynamic”.

On the right side of the picture, there are a few connected words on the topic of “risk” and “insurance”. To assess risk, it is common to conduct “experiments”, which is another term associated with this cluster.

Finally, the cluster on the top of the picture covers perhaps three remaining topics. The word interlinking all three is “choice”. One big topic is “stated preferences” and “experiment”. Relatedly, the second topic seems to be “ecosystem service valuation”, perhaps differing from the topic above by being more applied, since “ecosystem service valuation” is also associated with the word “application”. Finally, “climate change adaptation” and “environmental policy” show up as separate topics, albeit relating to the topics on energy.

Overall, the EAERE conference will cover a wide range of relevant topics. Since climate (change) is such a hot topic, it is the theme of many conference presentations. I look forward to it!