Projects‎ > ‎Solar activity‎ > ‎

3. Data analysis & results

Time series of solar activity

We load the results of the batch processing script. This is a data frame with for each image (or each day) some statistics.


We can easily plot the variables in the data frame A as follows. This is the mean value, for example:

plot(A$date, A$mean, type="l")
which results in the following plot.

More fancy plotting is available through the ggplot2 package (see here). The code below constructs a plot which visualizes the distribution of pixel intensities over the two year period covered by our data set. The bottom and top quintiles are omitted for clarity.

myCol = brewer.pal(5,"Blues")

p = ggplot(A, aes(x=date))
p = p + geom_ribbon(aes(ymin=q20,ymax=q80),fill=myCol[2]) 
p = p + geom_ribbon(aes(ymin=q30,ymax=q70),fill=myCol[3]) 
p = p + geom_ribbon(aes(ymin=q40,ymax=q60),fill=myCol[4]) 
p = p + geom_line(aes(y=q50)) + xlab('date') + ylab('intensity')
p + scale_x_date(major = "months",format="%m-%y") 

This code first loads the required libraries, then constructs the plot. This is the result:

The black line represents the median value. The darkest shade of blue ranges from the 40% to the 60% quantile, the lighter shade from 30% to 40 % and 60% to 70%, and the lightest shade from 20% to 30% and 70% to 80%. Hence 60% of all values are in the shaded area. The tails of the distribution - the lower and upper quintiles - are not shown in this plot. The tick marks on the horizontal axis are put at the beginning of each month. For example '05-11' really is May 1st, 2011. We clearly see periodic changes, as well as prolonged periods of more intense activity. 

Correlations and unusual highs

We have 11 quantiles for each day. The correlation between these is obtained as:


The result is this matrix:

The q100 value, which is the maximum value, correlates the least with all others. We investigate this some further, by looking at how much greater the maximum is compared to the 90% quantile. The following code plots the distribution of this difference for the approx. two years worth of data that we have.

hist(A$q100 - A$q90, breaks=40) 

The distribution looks like a normal distribution, with some additional unusually high numbers. Using methods like Gaussian mixture models we could try to identify some optimal decision boundary separating the normal distribution from the higher values, but here we pick a sensible value by eye, for simplicity: 11500. The red line in the image above shows the location of this value. We write a little function that will generate a plot marking these unusual values.

myPlot = function(Asub) {
    Asub$index = 1:(dim(Asub)[1])
    unusualIndex = Asub[Asub$unusual,"index"]
    p = ggplot() 
    p = p + geom_ribbon(data=Asub, mapping=aes(x=date,ymin=q90,ymax=q100),fill=myCol[2]) 
    for (i in unusualIndex) {
        p = p + geom_area(data=Asub[(i-1):(i+1),],mapping=aes(x = date, y = q100), fill="red")
    p = p + geom_line(data=Asub,mapping=aes(x=date,y=q100)) + ylab('intensity')
    p = p + geom_line(data=Asub,mapping=aes(x=date,y=q90)) 
    p + scale_x_date(major = "months",format="%m-%y") 

A$unusual = (A$q100 - A$q90)
myPlot(subset(A, date >= "2011-7-1" & date < "2011-12-31"))

This allows us to easily restrict the date range covered in the plot. The example shows only a 6 month period, since that is nicer than all two years in one plot. This is what the plot looks like:

The top black line is the maximum value, the bottom black line the 90% quantile, with the area in between shaded in light blue. In red the days are indicated with unusually high activity, in the sense that the difference between the maximum and the 90% quantile is unlikely to come from the normal distribution governing this quantity most of the time.

Comparing with daily sunspot numbers

We download the international daily sunspot number data for the year 2011. Information about sunspot indices can be found at the NOAA National Geophysical Data Center. The international sunspot number is produced by the Solar Influences Data Analysis Center (SIDC). We downloaded data from this NOAA FTP site. Data credits: SIDC, RWC Belgium, World Data Center for the Sunspot Index, Royal Observatory of Belgium, `year(s)-of-data'.

First we rearrange the data in a .csv text file for easy reading in R. We combine the sunspot number with our data frame A, thereby restricting our data to the year 2011. We put all this in data frame B:

S = read.csv2("2011daily.csv", stringsAsFactors = FALSE) # a simple csv file with 2 columns: date and number
m = match(A$datechar,S$date)
A$ssn = S[m,"number"]
B = subset(A,!

Consider some correlations between our data and the ssn (sunspot number).

cor(B$ssn, B$mean)

The correlation with the mean is highest, 0.80. Second highest comes the 90% quantile with 0.77. We plot the mean and the ssn together (below). The values on the horizontal axis now refer to the days in 2011. The black line are the sunspot numbers, the red is our daily mean.


The combined correlation of the mean and q90 (the 90% quantile) with the ssn can be analyzed using linear modeling.

fit = lm(ssn ~ q90 + mean, B)
summary(fit) # gives some useful details
cor(B$ssn, fit$fitted.values)

This fits a model with the ssn as dependent variable, and q90 and mean as predictors. The correlation between the predicted values and the ssn, 0.81, is only a little larger than between the mean alone and the ssn. In the plot below, the black are again the sunspot numbers, and the red are our predictions under the model. Our predictions seem to be smoother than the original sunspot numbers, with fewer extremes. The predictions correspond fairly well to the sunspot numbers, although not all the time.


Ideas for further work

  • Downloading data colected at other wavelengths, deriving similar statistics, and doing a correlation analysis.
  • Using the time series data for forecasting, using some time series modeling technique.
  • ...(please add your suggestions below).