This tutorial will introduce various tools offered by PostgreSQL, and SQL in general – like custom functions, window functions, aggregate functions, WITH clause (or CTE for Common Table Expression) – for the purpose of implementing a program which imputes numeric observations within a column applying linear interpolation where possible and forward and backward padding where necessary. I’m going to progressively add and explain those constructs, step by step, so no problem if you are new to the scene. I am very much interested in input regarding potential downsides of the implementation and possible improvements.
Imputing Observations
A standard situation for somebody working with data is being faced with missing observations. Depending on the type of observation imputation might be possible. The most simple methods of padding those gaps are to forward or backfill observations up and down an ordered and related sequence of records. This is often referred to as LOCF (last observation carried forward) and FOCB (first observation carried backward). Another often reasonable method is to assume a linear transformation between observations.
(GitHub / SQL Fiddle – at the beginning of every section you will find a link to the code on GitHub and an SQL Fiddle session)
We start with a table ( tbl) of four columns t::float, a::int, b::int and v::float. a and b hold the keys for identifiying related records – this is what we will partition over. t represents the time and v the observations that we want to impute. For example, think of b as a town in country a where temperature v was measured on 12pm on date t.
(GitHub / SQL Fiddle)
An LOCF is at the end of the day just a COALESCE(b,a) of two observations with a having been observed before b. If b is not NULL we take it, if it is we use the value of the observation before, i.e. a. If we perform those steps iteratively starting with an initial observation (initial condition) of NULL then we will inductively fill all gaps after stumbling upon a first non-NULL observation. The following function implements this mechanism, but first we will apply it in a non-aggregating way. The script will just add another column v_or_t which holds v in case v IS NOT NULL and t if v IS NULL.
CREATE OR REPLACE FUNCTION fun(a FLOAT, b FLOAT)RETURNS FLOATLANGUAGE SQLAS ‘ SELECT COALESCE(b, a)’; select a,b,t,v, fun(t,v) as v_or_tfrom tblorder by a,b,t; |
RETURNS FLOAT
AS ‘
’;
select a,b,t,v,
from tbl
;
Here we CREATE OR REPLACE a FUNCTION named fun, which takes two FLOAT typed arguments a and b and RETURNS FLOAT as well. The function is implemented by means of the LANGUAGE SQL (instead of PL/pgSQL or some other language). An SQL programmed function will return the result of the last evaluated SELECT statement. The code itself has to be delimited by single quotes ( ‘) or a string like f.x. “$$” and follows after the keyword AS.
(GitHub / SQL Fiddle)
Now instead of applying the function (which I renamed to locf) horizontally within a record we tell PostgreSQL to apply it iteratively in a specified ORDER and within specified boundaries (the PARTITION) vertically across two records:
123456789101112131415161718 | create or replace function locf_s(a float, b float)returns floatlanguage sqlas ‘ select coalesce(b, a)’; drop aggregate if exists locf(float);CREATE AGGREGATE locf(FLOAT) ( SFUNC = locf_s, STYPE = FLOAT); select a,b,t,v, locf(v) over (PARTITION by a,b ORDER by t) as v_locffrom tblorder by a,b,t; |
2
4
6
8
10
12
14
16
18
returns float
as ‘
’;
drop aggregate if exists locf(float);
SFUNC = locf_s,
);
select a,b,t,v,
from tbl
;
As you can see the process of creating the LOCF function is now split into two steps. First we define the transformation of two values into one ( locf_s) and then we create the actual aggregating function ( locf) by specifying how to apply the transformation. Here we just define what in this context is referred to as the “state transition function” ( SFUNC) which returns a value of the “state type” ( STYPE) FLOAT.
In the final query we apply locf() over the PARTITION defined by a and b and traverse the records by ordering t ascending.
(GitHub / SQL Fiddle)
Now we go one step further and fill all the missing oberservations by padding a missing oberservation preferredly with the last seen value and if that is not possible we use the first available value. Technically we fill one column v_locf with the LOCF values, another column v_focb with the FOCB values and then choose for v_final the value from v_locf if it is not NULL and otherwise the value from v_focb.
select a,b,t,v, locf(v) over t_asc as v_locf, locf(v) over t_desc as v_focb, COALESCE( locf(v) over t_asc, locf(v) over t_desc ) as v_finalfrom tblWINDOW t_asc as (partition by a,b order by t), t_desc as (partition by a,b order by t desc)order by a,b,t; |
locf(v) over t_asc as v_locf,
COALESCE(
locf(v) over t_desc
from tbl
t_asc as (partition by a,b order by t),
order by a,b,t
First of all, the WINDOW keyword simply allows us to to name an ordered partition and then use it in the corresponding select query. Second of all, the FOCB mechanism is just LOCF applied to the inversely ordered partition ( t_desc). And finally, it might be worth noting that you can use multiple aggregate functions within another function – COALESCE in this case.
(GitHub / SQL Fiddle –* if you think SQL Fiddle is pretty cool – why not donate a few bucks to support the project?!*)
This query is composed of two main steps. First we unite all the necessary values spread across multiple rows, so those values are available locally to every row and then we calculate row based the imputed value.
123456789101112131415161718192021222324252627282930313233 | WITH tbl0 as ( SELECT a, b, t, v, locf(v) over t_asc as v_locf, locf(v) over t_desc as v_focb, locf(case when v is not null then t else null end) over t_asc as t_locf, locf(case when v is not null then t else null end) over t_desc as t_focb from tbl window t_asc as (partition by a,b order by t), t_desc as (partition by a,b order by t desc)),tbl1 as ( SELECT a, b, t, v, v_locf, v_focb, t_locf, t_focb, case when t_focb != t_locf and v_locf is not null and v_focb is not null then v_locf + ( (v_focb - v_locf) * (t - t_locf) / (t_focb - t_locf) ) else coalesce(v_locf, v_focb) end as v_final from tbl0 order by a,b,t)SELECT * from tbl1 order by a,b,t; |
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
tbl0 as (
locf(v) over t_asc as v_locf,
locf(case when v is not null then t else null end)
from tbl
t_desc as (partition by a,b order by t desc)
tbl1 as (
case
t_focb != t_locf
and v_focb is not null
v_locf + (
*
(t_focb - t_locf)
else coalesce(v_locf, v_focb)
from tbl0
)
The WITH clause is simply a way to chain temporary tables. The result of the first SELECT is named tbl0 and used within the second SELECT whose result set is named tbl1 and referred to by the final (technically not necessary) SELECT query. The statements formulated using a WITH clause are also referred to as CTEs or Common Table Expressions. They can make complex queries more readable and allow the formulation of recursive queries, for example to traverse a tree. CTEs are controversial with respect to their performance (“optimization fence“).
Now with the different parameters assembled we can interpolate with the following simple formula:
Now if we subsitute above’s last SELECT query with the following UPDATE statement then the imputed set of values for v are directly written to tbl:
update tbl as tbset v = t1.v_finalfrom t1where tb.a = t1.a and tb.b = t1.b and tb.t = t1.t |
set v = t1.v_final
where
(original article published on www.joyofdata.de)