This week I was presented with the challenge of handling data observed at different points in time. Since I was surprinsingly unsuccessful at finding the data in the specific format that I wanted (per day), I wanted to share with you some R codes to transform data specified at a smaller scale to get it in a different date format.
Time can be measured in different ways, ranging from billions of years to microseconds. Most economic analysis applications take into consideration data at year, month, or even day level. In my previous blog post, I looked at the impact of the Kyoto protocol on pollution given yearly data. Since my variable of interest (the introduction of the Kyoto protocol) only varied across years, it did not make sense to use more disaggregated data with months or days.
This blog post shows how to get data specified at a smaller time scale to a larger time scale. For example, transforming daily data into monthly data, or monthly data into yearly data. The specific application is that of weather data.
Finding weather data is a “nightmare” in a very positive way.
There are thousands of sources for weather data. By weather data I mean a lot of important variables for environmental economists, for example, temperature, wind speed, precipitation, etc. This discussion on ResearchGate summarizes very nicely the many data sources for weather data publicly available.
And the main problem I had with finding the information I need is that I could not find both daily and country level data. I thought it would be straightforward, but in fact weather data originates from measuring stations located in different places in a country. Instead of gathering data from all measuring stations, I decided to do the lazy thing and collect weather data from the measuring stations at the capital of each country I was interested in.
And then came the second problem that my post provides a solution to. It was very complicated to find weather data in the measurement I wanted: daily data. Most weather (free) databases provide data at hourly level, and if they had data at daily level, they did not always have data for the period I was interested in, nor all the weather variables I needed.
While there are many websites to get weather data, I downloaded my data from this website. You can download any data you want by selecting a city and then clicking on “weather archive” (right up corner). You then choose the time period you are interested in.
From hourly to daily data
I extracted weather data from measuring station close to Vilnius (the capital of Lithuania). This is an excel file, so I will open it in R with the readxl package and call this data frame “df”. (If you are using the same website as me, you have to delete the first rows in the excel file before you import it to R.)
library(readxl) df <- read_excel("DIRECTORY/Vilnius weather.xls")
The data looks like the picture below. Each row is about a different date and hour, e.g. “22.04.2020 21:00”.
Instead, I would like to have daily data. That is, each row corresponds to a day. So the first observation should correspond to the 22nd of April (“22.04.2020”).
Let us say that I am interested in the first variable “T”. This is the temperature, which changes throughout the day. I want to estimate the mean for each day. Instead of mean, the code below can be easily adapted to obtain the minimum, maximum or any other statistic from the variable T.
This solution is an adaptation from the solution shared here.
Since the first variable is a date, we need a special package to deal with it. I will use the xts package. Don’t forget to install it before you try using it.
The first step is to create a variable that R recognizes as a date. I’ll do this with the as.POSIXct function and call this variable “datehour”.
Notice that our time variable (df$`Local time in Vilnius (airport)`) separates days, months and years using a dot, and the hours using a colon symbol. For example: “22.04.2020 13:00”. So in the command below you should correctly specify the format of your data. Accordingly, I have specified the format as %d.%m.%Y %H:%M.
In breaks= you should specify the time scale you want to covert your data to. I want this hourly data into daily data, so I simply specified breaks=”day”.
require(xts) df$datehour <- cut(as.POSIXct(paste(df$`Local time in Vilnius (airport)`), format="%d.%m.%Y %H:%M"), breaks="day")
Once I run this code, I obtain the datehour variable which looks this this:
> head(df$datehour)  2020-04-22 01:00:00 2020-04-22 01:00:00 2020-04-22 01:00:00 2020-04-22 01:00:00 2020-04-22 01:00:00 2020-04-22 01:00:00 478 Levels: 2019-01-01 00:00:00 2019-01-02 00:00:00 2019-01-03 00:00:00 2019-01-04 00:00:00 2019-01-05 00:00:00 2019-01-06 00:00:00 ... 2020-04-22 01:00:00
This datehour variable takes 478 different values, one for each day in the data.
What I want is to estimate the average temperature per day. So I will create a new dataframe called “means” that calculates the mean temperature (“T”) per day.
means <- aggregate(T ~ datehour, df, mean)
> head(means) datehour T 1 2019-01-01 00:00:00 0.4625 2 2019-01-02 00:00:00 -0.0625 3 2019-01-03 00:00:00 -4.3250 4 2019-01-04 00:00:00 -4.9375 5 2019-01-05 00:00:00 -3.1250 6 2019-01-06 00:00:00 -5.8500
The data frame means now records the average temperature for each day. Let me just double-check that this variable is correct. The summary statistics for temperature on the 22nd April was 10.31 degrees Celsius:
> summary(df$T[df$datehour=="2020-04-22 01:00:00"]) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 3.30 6.95 12.30 10.31 13.85 15.00 2
In the means data frame, this corresponds to a single observation:
> means$T[means$datehour=="2020-04-22 01:00:00"]  10.31429
Fortunately, the means are the same in both data sets. I now have the data in the format I needed, that is each row corresponds to a day.
The following code can be adapted to for example extract the minimum or maximum temperature by changing the following function here:
aggregate(T ~ datehour, df, min)
aggregate(T ~ datehour, df, max)
And of course you extract daily averages or data of any variable of interest.
I could have calculated daily temperature averages by hand, but it’s convenient to create a code to do so because I have several weather variables as well as locations that are of interest to me. In the long-run, coding this task saves time.
The next step for me is to merge this data frame with another one to create a column for daily temperature as well as other weather variables of interest. I can then use temperature and other weather variables in my analysis.