How Can I Do a Principal Component Analysis in R?

In the previous article, I showed how I had generated a triangular network on a polygon.

The resulting network is shown below. I plotted the positions of the observation sites in red dots. The solar radiation data collected at these sites are what I wish to interpolate.

Why Do I Want EOFs of the Data?

Now, moving a step forward, I need to find the empirical orthonormal functions of my data set. Another way to say this is to do a principal component analysis. This will be the precursor for a set of functional orthonormal bases.

The EOFs will also tell me about the nature of the data. The bases with higher eigenvalues indicate where most of the variation in the data lies.

Last but not the least, the scores (inner product of the EOFs and the data) of the EOFs are independent to each other, which brings some great statistic properties I can use to estimate the distribution of my interpolation in the end.

The Packages I Used in This Article

Only the package “rgl” will be used in this article for the 3d scatterplot.

Preparation of Data

I obtained the solar radiation data between July and August 2016 of 24 meteorology sites in Taiwan.

metadata=read.csv(“局屬測站座標.csv”)
series=c(2:7,9:12,16:26,28:30)
coor_site=cbind(metadata$經度[series],metadata$緯度[series])
meteorology_201607=read.csv(“201607_cwb_hr.txt”,sep=””,na.strings = -9999)
meteorology_201608=read.csv(“201608_cwb_hr.txt”,sep=””,na.strings = -9999)
A glimpse of the raw data

Then I sorted it into an array with the dimensions of m*n*o, where m is the number of sites, n the total number of days, and o the 24 hours in a day. I named the array solar_all_A.

Because what I was interested is the spatial covariation, so I computed the hourly average of each site first:

solar_hour_mean=matrix(ncol=32,nrow=24)
for(i in 1:32){
for(j in 1:24){
solar_hour_mean[j,i]=mean(solar_all_A[i,,j],na.rm=T)
}
}

Subtracting this from the original data, I got the anomalies I shall be analyzing.

delta_solar_all_A=array(dim=c(32,62,24))
for(i in 1:32){
for(j in 1:62){
for(k in 1:24){
delta_solar_all_A[i,j,k]=solar_all_A[i,j,k]-solar_hour_mean[k,i]
}
}
}

Find the (Unbiased) Covariance Matrix

There is actually already a function in R that can perform principal component analysis for you. However, since the coding is quite simple, I did the analysis on my own.

Basically to do a PCA, we have to find the covariance matrix of a data set.

COV_solar=matrix(nrow=length(series),ncol=length(series))
for(i in 1:length(series)){
for(j in 1:length(series)){
if(i<=j){
COV_solar[i,j]=var(c(delta_solar_all_A[series[i],,]),c(delta_solar_all_A[series[j],,]),na.rm=T)
}
else{
COV_solar[i,j]=COV_solar[j,i]
}
}
}
image.plot(COV_solar,asp=1)

The original matrix was a little bit messy. I sorted the order of sites a little bit according to their locations to obtain another covariance matrix which captured the pattern better.

series_sort=series[c(c(17,23,21,7,9,10),c(6,3,5,2,4,1,8),c(19,15,24),c(13,12,11,20,14),18,16,22)]
COV_solar_sort=matrix(nrow=length(series),ncol=length(series))
for(i in 1:length(series)){
for(j in 1:length(series)){
if(i<j){
COV_solar_sort[i,j]=var(c(delta_solar_all_A[series_sort[i],,]),c(delta_solar_all_A[series_sort[j],,]),na.rm=T)
}
else{
COV_solar_sort[i,j]=COV_solar_sort[j,i]
}
}
}
image.plot(COV_solar_sort,asp=1)

Sorting or not does not affect the result of the eigen-decomposition, so this was merely for better visualization.

Now because there were only 24 sites, directly eigen-decompose it was a reasonable way. I did however, tried to find the variance of the uncorrelated measurement errors. I did so by assuming all the values on the matrix with magnitude smaller than 0.1 were actually composed of a constant plus a normal distributed error.

error=c()
for(i in 1:length(COV_solar_sort)){
if(!is.na(COV_solar_sort[i])){
if(COV_solar_sort[i]<.1){
error[i]=COV_solar_sort[i]
}
else{
error[i]=NA
}
}
else{
error[i]=NA
}
}

The threshold 0.1 was arbitrary but not without reasoning. About 50% of the covariance values fell below this threshold.

It turned out that the variance of the uncorrelated error was quite small, about 0.000917 (the minimum variance of the data set was about 0.2). I subtracted it anyway and then eigen-decomposed it.

COV_solar_unbiased=COV_solar
for(i in 1:nrow(COV_solar_unbiased)){
COV_solar_unbiased[i,i]=COV_solar[i,i]-var(error,na.rm=T)
}
inform_solar=eigen(COV_solar_unbiased)

There are more robust ways to determine the variance of the measurement errors of a data set, which I have demonstrated in another article:

The Resulting Eigenvectors

The function eigen() returns the eigenvalues and eigenvectors of a matrix. The columns of the matrix ~$vector are the eigenvectors.

(Were we to have a much larger matrix, it might be preferable to use iterative methods.)

The first five eigenvalues constituted about 72% of the total variance, which is quite low compared with other data sets. This implies that solar radiation might have more local variations and characteristics.

Percentage of Variance Each Eigenvector Represents

To see what these eigenvectors are like, I interpolated the values on all the triangular points using a radius kernel with a bandwidth of 0.5.

bw=.5
dis_points=matrix(ncol=length(series),nrow=nrow(points))
for(i in 1:nrow(dis_points)){
for(j in 1:ncol(dis_points)){
dis_points[i,j]=sqrt((points[i,1]-coor_site[j,1])²+(points[i,2]-coor_site[j,2])²)
}
}
weight_points=exp^(-dis_points/2/bw²)
for(i in 1:nrow(weight_points)){
weight_points[i,]=weight_points[i,]/sum(weight_points[i,])
}
z_in=weight_points%*%inform_solar$vectors
plot3d(points[,1],points[,2],z_in[,1])

This was what the first principal component looked like:

It obviously represented the differences of solar radiation between the mountains and the plain regions of the island. Although due to the narrower east-west distance, central Taiwan was “lifted” too much.

There are other kernels that take into account the distance to coastline. That might fix this problem, but I decided I would stick to this result first.

This was what the second principal component looked like:

It was also obvious to tell that this was the difference caused by latitude variation.

I show some of the other EOFs here, some are more difficult to guess their origins.

The third EOF didn’t seem any different from EOF 1.
The fourth EOF seemed to represent the difference between east and west.
The fifth EOF

What is Next?

The EOFs we contained are only orthonormal on the original data set. This is already a very good property, but I would like to develop a set of EOFs that are orthonormal on the entire domain. That is, I want to build a set of functional orthonormal bases that are also EOFs.

We will see if I can do that in my next article.

About This Series…

This article is part of the series My Coding Diary with R. To know how the series has been developing, check the intro article out:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Tony Yen

Tony Yen

A Taiwanese student who studied Renewable Energy in Freiburg. Now studying smart distribution grids / energy systems in Trondheim.