Daily Wind Aggregation

How and what do we measure in the wind?

Since 1972 at the H.J. Andrews Experimental Forest, we’ve been collecting data about wind speed and direction. Over these years, the methods we’ve used have also changed with changes in technology and personnel. Our earliest measurements were done using an R.M. Young cup-type anemometer and recording the results on a Rustrak Strip Chart recorder. Subsequently, we used various forms of automated Campbell dataloggers and R.M. Young propeller anemometers. Currently, we are using more advanced Campbell data loggers, and Campbell CSAT3 sonic anemometers.

Sonic anemometers are the most advanced wind measurement device. Older anemometers as mentioned above were based on the mechanical principle of transforming a wind path into a measurable rotational velocity through a known function based on the size and working range of the instrument. This is problematic at low wind speeds, because a certain amount of inertia was required for the instrument to measure a wind velocity difference (Foken, 2008). When windspeeds are low or when winds are turbulent, the path required to change the rotational velocity of the mechanical anemometers may not be met, resulting in over or underspeeding. Although certain corrections can be done to adjust wind speeds in non-ideal conditions, these are different across different instruments and in varying conditions. Sonic anemometers, on the other hand, measure the time a sonic signal takes to be transmitted across a measuring path. A signal is emitted from both sides of the path simultaneously, but it is not received simultaneously due to wind velocity. The discrepancy in transmission time is used in an algorithm based on known velocity of sound and path length to determine the velocity and direction of wind. Sonic anemometers can be two or three dimensional, and can have varying path lengths. While sorter paths allow us to resolve smaller wind patterns, they also generate higher precision inputs which are computationally intensive. The balance between data quantity and data quality is very literal: shorter paths and longer floating-point numbers leads to longer computation time and more data.

The output from the CSAT3 Sonic Anemometer that we bring in through our telemetry is a five minute aggregation of one Hertz values. In the past we had been sampling at one-quarter Hertz on our propellor anemometers, so this is a massive increase in both the number of values that are incorporated into each computational operation. On the data logger, a five-minute mean is computed that reduces the the 300 one second values down to one mean five minute value. While this alleviates some work on our post-collection data flow, the time constraints of performing this operation (i.e. for each measurement, both the raw value and its time or index must be given an address in the logger memory so that later computations can be done) on the logger is an issue. While in the past, when we were collecting less frequently, we also could do computations on the logger of daily summaries, now these have to be done after the data is collected. Unfortunately, in making this methodological change, we no longer collected certain interesting daily results. The old daily summaries included maximum wind speed (which we call max gust), with its direction and time, resultant wind speed and direction, and an average wind speed to represent the day based on the high resolution data.

Here I will discuss only the later two of these three daily wind components. While collecting maximum gust is important, the means for gathering this data after-the-fact is simply reprogramming the logger to get this information. The only note of interest here is that taking the maximum of five minute mean wind speeds does not equal to getting the instantaneous max gust. As one would imagine knowing the simples of statistics, a maximum value represents a distinctive outlier, and as such the information it contains is not part of the mean.

The average wind speed, however, is a different story. What is important to note here is that average wind speed and resultant wind speed are not computed in the same way, and represent two very different, but important components of a daily wind summary.While the average wind speed is a simple arithmetic mean of the high-resolution wind velocities, regardless of their directions, the resultant wind vector is a vector average. A vector average is exactly what it sounds like; visualize each wind observation over the course of the day as a vector. The first vector start with its tail on the origin and points out in its direction with a length of its speed. The second vector puts its tail on the tip of that vector and points off it in its direction (oriented to the origin), with a magnitude equal to its speed. If you add up all these vectors into one giant vector, the direction of that vector is the resultant direction and the magnitude of that vector divided by the number of observations (we would have one per second, so ideally 86400, but since we are only outputting five minute data, we have 288) is the resultant speed.

This math might seem like a high-school physics problem, but it gets more complicated. One of the key issues is that meteorological direction is not the same as trigonometric coordinates which is not the same as the cardinal direction we use on a map. Additionally, different sensors orient differently to the meteorological direction. For example, on our Campbell sonic CSAT3, the “x vector” runs from north to south, with south-blowing wind to be positive, and north-blowing wind to be negative. The “y vector” runs from west to east, with east-blowing wind to be positive, and west-blowing wind to be negative. Some other sensors, such as the Metek sensor, do not obey these same positive and negative conventions. Additionally, some other sensors may have to be installed at a certain angle physically, resulting in the need to add an additional directional offset. So for every sensor, there is a very different methodology needed.

For the Campbell sonic CSAT3, the goal is to take the sonic output of x-vector mean and the y-vector mean which are given at five minutes, add the component wise vectors along each axis, solve for the resultant wind speed, and then decompose that resultant wind speed to get the daily resultant wind direction. We understand that a lot of the interesting variations as sub-five minute resolution are lost with this approach, however, we do not have the capacity to salvage that data. We will then divide our resultant vector by the number of observations included in it, to get the vector averaged daily wind speed.

As such, we need a function that can compute windspeed, and a function that can compute wind direction. We’ll also need a few ancillary functions to group data by days, perform the appropriate reductions on other columns, and to flag missing or questionable values. While I do not present all the code here, it can be found in this repository as chicago.jl.

The language I will show this in is Julia, a new technical computing language which is similar to R and Python. Although it is still in development, the flexibility of data structures and types in Julia lets us harness all the functions we’ll need to do these computations efficiently. I will assume that the reader does not want to know how to find and isolate columns in a pesky developmental language, and instead focus on the tools that are critical for these results, and that could be extended to other analyses.


Date grouping

Probably the most common analysis tool I use is some form of is mapping high-resolution values onto less frequent aggregate dates.

Let me begin by saying that date-math is not fun or beautiful, and that you should never, ever trust a serial date time. If you are to find yourself doing this also, let me suggest that you trust a very reliable library named “Joda” and nothing else. Joda is a Java-based date library that allows language designers to implement dates and times using object-oriented concepts that humans understand, such as “durations”,”days”, and “intervals”. Because it is based on conversions of strings into meaningful “ticks” rather than on accumulated ticks since an epochal time, dealing with all the oddities of “when did the epoch start” disappear, and this is nice. To aggregate values into dates, I find the best way to do this is to create a data structure such as a “dictionary” (in Python semantics) whose keys are the dates at the aggregate frequency, and whose values are lists or other data structures containing the information or observations you need to compute your solutions. In Python this is pretty much as easy as eating a pie. In Julia, it’s actually somewhat annoying to do, but worth it in the end. Of course, I should have just used @pycall, which lets you use python functions natively in Julia, but what fun is that?

For this problem, I created this function:

    function date_to_dict(col, d2)
        #= create an empty dictionary, then key it on the date, without hhmm=#
        emptyDict = Dict()
        for index = 1:length(d2)
            # if the key is not in the dictionary, put it in and give value of that column
            if ! haskey(emptyDict, (year(d2[index]), month(d2[index]), day(d2[index])))
                emptyDict[(year(d2[index]), month(d2[index]), day(d2[index]))] = [array2[index,col]]
            # if the key is in the dictionary, append the new value to the end of the existing array
            elseif haskey(emptyDict, (year(d2[index]), month(d2[index]), day(d2[index])))
                push!(emptyDict[(year(d2[index]), month(d2[index]), day(d2[index]))], array2[index,col])
        return emptyDict

There, I created an empty data structure, and then walked over our data which is in some form of delimited file that can be column indexed. (The term “col” here refers to a column of data). If that date is not already in the data structure, we make it a key, and give it a list containing its observation as a value. If the date is already in the dictionary, we push the new observation onto that list of observations for the day. Ultimately, we end up with something like this:

    Dict{Any,Any} with 625 entries:
    (2013,11,20) => [0.2,0.1,0.1,0.1,0.1,...
    (2014,11,12) => [-0.5,-0.5,-0.5,-0.5,...

In short, we have 625 days in this record, now keyed on the day. The data I show here is the x-vector data. So, on 2013-11-20, one five minute interval measured 0.2. Note that we don’t know which interval that was. Also note that the dictionary is not in “order”. Don’t worry, though, once the mapping is established, it doesn’t matter.

Calculating the windspeed

The x-vector and the y-vector form a right triangle with the resultant vector being the hypotenuse. Due to the component method of vector addition, these components can be added on each axis, in any order, as long as the direction (positive or negative) is included.

The previous method created for us two date-maps of x-vectors and y-vectors for each day. We can now add up the vectors in each axis. To do this efficiently, we go ahead and include a day-mapping to keep this analysis indexed on the date as well:

    vector_sums = [x=>reduce(+,input[x]) for x in keys(input)]

This instruction tells us to take each “key” (date) and shorthand it to “x”, then reduce all the inputs that are associated with that key (“input[x]”) to a single value using the operator specified (“+”), generating a map of daily vector sums for both the x- and y- vectors.

The output, before averaging, now appears as

     Dict{Any,Any} with 625 entries:
      (2013,8,19)  => 60.3
      (2013,11,20) => -286.4
      (2014,11,12) => -106.4
      (2013,8,9)   => 39.4

Which I have rounded down for your viewing pleasure.

Now, we need to compute the windspeed. This just means using the pythagorean theorem, and dividing by the number of entries on each day. Each x-vector/y-vector pair counts as one observation. We define a simple function to compute windspeed.

    function windspeed(ux, uy)
        #= computes windspeed based on pythagorean theorem =#
        wspd = round(sqrt(ux^2 + uy^2)*100)/100
        return wspd

And apply it to each date in both date mappings, using the unique date values to match across the two structures.

    function wind_iter(wux_dict, wuy_dict)
        #= computes windspeed for each day, dividing by number of observations =#
        wux_sums = reduce_dicts(wux_dict)
        wuy_sums = reduct_dicts(wuy_dict)
        wspd = [x=>windspeed(wux_sums[x], wux_sums[x])/length(wux_dict[x]) for x in keys(wux_dict)]
        return wspd

Resultant wind speed is returned!

  (2013,8,19)  => 0.25
  (2013,11,20) => 0.92
  (2014,11,12) => 0.42
  (2013,8,9)   => 0.20
  (2014,2,25)  => 0.66
  (2014,5,23)  => 0.31

Calculating the wind direction

Similar to calculating the wind speed, we now need to compute the wind direction for the day. This, however, is a little trickier because of the conventions associated with Campbell instruments, meteorological coordinates, and the Cartesian system.

To walk through this simply, we define a function for wind direction that takes as arguments the x- and y- vector of the day. For the sake of it being a Campbell instrument, we invert the y-vector.

    function winddirection(ux, uy)
        #= Because we have a Campbell we invert the uy component=#
        ux = ux
        uy = -uy
        if uy > 0
            # q1 and q2, and add 180 to go to cardinal direction
            output = (270-atan(ux/uy)*(360/(2*pi))+180)+180
        elseif uy < 0 
            # q3 and q4, and add 180 to go to cardinal direction
            output = (90-atan(ux/uy)*(360/(2*pi))+180)+180
        elseif uy == 0
            if ux > 0 
                output = 360
            elseif ux <= 0
                output = 0
        return output

In the above, the normal transformation from Cartesian coordinates to meteorological coordinates is the first part of the statement, using the 270- arc tangent of the x-vector over the y-vector, times a radians to degrees conversion, plus 180 degrees. This puts the result in the meteorological system. After this, 180 degrees is added to flip the result into the azimuthal system we are familiar with.

One issue with the meteorological system is that it ranges from 180 to 540, and adding the extra 180 makes the results often much larger than the 360 degree circle. To return to the azimuth, we just apply a filter that returns all outputs to the conventional scale:

    function convention(output)
        if output > 360
            wdir = output - 360
        elseif output < 0
            wdir = output + 360
            wdir = output
        return wdir

Now, wrapping the wind direction function in the filter, we apply it to the sums of the x- and y- vectors for each day;

wdir = [x=>convention(winddirection(wux_sums[x], wuy_sums[x])) for x in keys(wux_dict)]

Wind direction, vector averaged for the day, is returned!

     (2013,8,19)   56.5077 
     (2013,11,20)  212.561  
     (2014,11,12)  205.537  
     (2013,8,9)    41.0781 
     (2014,2,25)   235.073  
     (2014,5,23)   16.0642 
     (2013,10,3)   304.902  
     (2014,7,2)    27.2403 

Carrying forward the other metrics

Based on our decisions as a group, we assimilated some of the other five minute metrics into daily statistics. For example, we compute a daily mean wind speed based on the mean of five minute wind speeds, which are based on the higher resolution data we do not have available at this scale. We compute a daily maximum based on the maximum of the five minute gusts. We compute a daily standard deviation by calculating the standard deviation of the five minute means (a method which I tested versus the opposite of computing the mean of the five-minute deviations, and found this method to replicate what would have been output by a logger doing a daily summary


We have a specific daily flagging algorithm we implement, which proceeds in this order of precedence:

(1) If there are more than 20 percent missing values, our day is coded “M” for missing.

(2) If there are more than 5 percent questionable values, our day is coded “Q” for questionable

(3) If there are more than 5 percent estimated values, our day is coded “E” for estimated.

(4) If less than 5 percent of values are questionable, estimated, or missing, the day is coded “A” for acceptable.

(5) Otherwise, the day is questionable.

After many fun times of implementing this in conditional statements, I’ve found the best way to implement it is to quickly make and compare true/false lists for the three “bad” flags, looking at the proportion of “trues” to false to determine the daily flag.

    function flags(flag_results, col, array_x, d2_x)
        #= runs the flagging algorithm on a column=#
        flag_dict = date_to_dict(array_x, col, d2_x)
        # total numner of values in flag dict
        rejected = [x=>length(flag_dict[x][flag_dict[x].=="M"].==true)/(length(flag_dict[x])-1) for x in keys(flag_dict)]
        questioned = [x=>length(flag_dict[x][flag_dict[x].=="Q"].==true)/(length(flag_dict[x])-1) for x in keys(flag_dict)]
        estimated = [x=>length(flag_dict[x][flag_dict[x].=="E"].==true)/(length(flag_dict[x])-1) for x in keys(flag_dict)]
        #= Our flags =#
        #rejected = [x=>length(pri_wux_flag[x][pri_wux_flag[x].=="M"].==true)/(length(pri_wux_flag[x])-1) for x in keys(pri_wux_flag)]
        #questioned = [x=>length(pri_wux_flag[x][pri_wux_flag[x].=="Q"].==true)/(length(pri_wux_flag[x])-1) for x in keys(pri_wux_flag)]
        #estimated = [x=>length(pri_wux_flag[x][pri_wux_flag[x].=="E"].==true)/(length(pri_wux_flag[x])-1) for x in keys(pri_wux_flag)]
        for (k,v) in rejected
           if v >= 0.2
           elseif v < 0.2
               if questioned[k] + v > 0.05
                   flag_results[k] ="Q"
               elseif estimated[k] > 0.05
                   flag_results[k] = "E"
               elseif questioned[k] + estimated[k] + v < 0.05
                   flag_results[k] = "A"
                   flag_results[k] = "H"
       return flag_results    

All in all, we generally find that most days are acceptable. Here’s an example from a day that got some missing codes for comparison. This is a line of data from an output .csv file.

    MS043,24,PRIMET,WND011,1000,1A,WNDPRI02,2013-08-19,.295,M,7,M,213.492,M,125.931,M,.209,M,.478,M,-.138,M,.439,M,13.442,M,11.188, M 

In the raw data, I found that only 202 of the 288 daily values were actually captured on this day. In general when quickly checking a day, I look to see how many rows are captured in a transact sql statement. If that number is much less than how many observations should fall in a day, that’s an immediate indication that the day will be an “M”.

    select * from OURDATABASE where TmStamp > '2013-08-19' and TmStamp < '2013-08-20'
    (202 rows affected)

Final thoughts

For the second part of this, how to “go the other way” and return the data from high-resolution speed and direction to resultant for the day, check here. It’s clearer, and in friendly friendly Python!

In short I have learned from this:

  1. Vectorization is not a great approach for dealing with discrete date-stamped data. It’s great for flux math, but we’re not doing flux math.
  2. There are a ton of coordinate systems and no one seems to have written a good package to convert between them (Javascript, I’m looking at you, I think I see a mini-project in my future to learn you).
  3. When in doubt and working in the enterprise, it’s best not to just go changing methods. First, check with your data manager. If you’re really eager, ADD NEW METHODS but do not remove the old ones! Memory isn’t free but it’s cheap. The adage of “it’s easier to seek forgiveness than permission” IS NOT PART OF DATA MANAGEMENT.

Book references

Foken, Thomas. 2008. Micrometeorology. University of Bayreuth.