Early Access
Like the software, this book is still a work in progress, so some sections are marked as "missing content". Check back in future for further content.
Introduction
Columnal is a tool for processing tables of data. While it is intended to be operatable without reading a giant tutorial first, it does help at several points to understand the underlying concepts behind the software. Data processing can be complex, and no amount of software interface design can substitute for understanding useful concepts of data processing, and types in particular.
This mini book is intended to be a reference to explain some of the concepts in Columnal, which will help when designing data tables and processing that data. It is not intended as a full tutorial on how to operate the software.
Types
Why types? (missing content)
Missing content.
Tagged types
What are tagged types?
Tagged types allow you to specify that a value can have several different options. Whenever you describe data as being a limited choice between something or something else, the "or" is a sign you may want a tagged type. For example:
-
A car may be left-hand drive or right-hand drive.
-
A person’s height may have been recorded in centimetres or in feet and inches.
-
A survey response may be missing or have a numeric score.
These represent three major uses of tagged types (categories, choices, and optional), which will be covered in the next three sections. To understand why tagged types are useful, it can help to look at the alternatives.
A researcher may collect data from a patient such as age. But there may be a participant who does not know their age, which means a mechanism is needed to record "unknown". But they have chosen the Number type for age. The researcher decides to use a dummy numeric value to mean missing. They pick 99 as a code for missing. But later they may encounter a patient who is actually 99; it becomes impossible to distinguish the magic number 99 from the real data 99. Later they use a better code, 999, but then they calculate an average of this column and the one 999 value among many patients secretly inflates the average without them spotting the issue.
The fix is to use a tagged type. Age is either a number or it is a special missing tag. This eliminates confusion on entry and forces the analysis to confront the issue of missing data.
Tagged types for categories
Many data columns have a very limited set of values they can take. For example, day of the week only has seven possible values; a soccer result can only be win, loss or draw; a product might be classified into available, out of stock and discontinued. Tagged types are ideal for these scenarios. The type contains a list of the individual tags.
A market researcher is asking people whether they prefer a new product in red or black. The data is recorded into a Text column, which ends up with values such as "Red", "red.", "R" and so on. When it comes time to analyse the results, effort must be spent on classifying each text response.
The fix is to use a tagged type with two possibilities, Red and Black. That way no other data can enter the system and cause such confusion.
Tagged types for choices
Missing content.
Tagged types for optional items
Missing content.
Units
Numbers are often a measure in a particular unit. You might have a price in dollars, a length in centimetres, a duration in seconds, an angle in degrees, or a staff count in number of people.
Columnal lets you explicitly state the units for a particular column. This gives several benefits:
-
If you add two numbers, they are required to have the same unit. This prevents accidents such as adding a length in centimetres to one in inches, or adding a duration in seconds to one in minutes.
-
There are built-in functions to convert between units, so you can convert a distance in miles to kilometres without needing to look up the conversion yourself.
-
Units are tracked through multiplication in division. If you divide a number in metres by a number of seconds, the result is automatically marked as metres per second.
Several units are provided by default, and you can also create your own units.
Creating units
Missing content
Converting between units
Missing content
Date and time types
Columnal has a set of types specifically to deal with dates and times. Some of these types require some further understanding - time zones in particular are surprisingly complex.
Local times and dates
The Time
type is straightforward: a Time
is a time of day, like 03:10:23 or 14:15. The time stores seconds and fractions of seconds, although this may be zero. Internally Columnal deals with the 24-hour clock, although conversion functions can deal with AM/PM times instead. Only valid times are allowed: 35:15
The Date
type is a date such as 8th February 2002. The main problem with dates is that different countries represent dates differently: 3/9/19 is in March to an American but in September to a Brit. Internally Columnal uses the ISO 8601 date representation, which uses year-month-day, so would represent 8th February 2002 as 2002-02-08. Conversion functions can deal with (English) month names, which are preferred to ambiguous date formats such as the 3/9/19 above. Only valid dates are permitted: you cannot have 40th January 2006 or 29th February 2019 (which was not a leap year).
Sometimes you only need a year and month, for example if you are keeping track of sales during a whole month, you don’t want a specific date like 1st June 2018, you just want June 2018. The DateYM
type is a date with just a year (the Y) and a month (the M). Internally Columnal uses the year-month format, like 2018-06.
The DateTime
type glues together a date and a time, for example 2005-03-21 14:24:02.993. It uses the internal representations for date and time discussed above.
Time zones
The DateTime
type has no mention of time zone. Often if you are dealing with date-times in one location the time zone is irrelevant. If you own a shop and are recording the time of sale, you may be fine to just assume it is always local time. However, if you have a data set of social media messages which are recording what time they are sent, and you want to order this, it’s important to know which time zone the date-time was in. That’s where DateTimeZoned
is useful.
DateTimeZoned
combines a date-time with a time zone. This allows for conversion to other time zones or correct comparisons of values across different time zones.
Record types
Sometimes a value will have multiple consituent parts. For example, a GPS position will have latitude and longitude. A colour may be represented as red, green and blue values. It can be useful to represent such compound values as a record type.
A record type is a collection of fields. Each field has a name and a type. So (latitude: Number, longitude: Number)
is a record type with two fields. One is called latitude, the other longitude, and they are both numbers.
There are two approaches to storing compound data like this. One approach is to use a separate column for each part, for example a Number column for longitude and a second Number column for latitude. The alternate approach is to use a record type that combines these two into a single column. The record approach is particularly useful if the record type will be used as part of another type, for example inside an Optional type or inside a list.
Lists
Lists are a sequence of items of the same type. For example, ["A", "B", "C"]
or [3, 2, 1]
. Lists are useful when you have a collection of items, but, unlike a record, you don’t know how many values you might have.
Columns of lists
One powerful feature that lists enable is the ability to have a list (or even nested lists) in each entry in a table column. This allows for more complex data to be easily represented in tables.
This feature is complicated to achieve in spreadsheets and databases because it is not possible to have an arbitrary number of values on each row.
Varying types
If you have a list where you need the items to have varying types, first define a tagged type with all the possibilities, then use a list of that tagged type.
Types: when to use which
The previous sections have explained the variety of types available in Columnal. This section provides a guide and examples of which type to use in different circumstances.
If you have a quantity or measurement use a Number
.
Some things that look like numbers should instead be stored as text. For example, phone numbers and ISBNs of books should be stored as text to preserve leading zeroes. A good rule is: would I ever add this value to another number? If the answer is no, storing as Text
may be the better option.
Text
is relatively straightforward, but be careful not to over-use it. If your text has a limited set of possible values, you may instead want to use Boolean
or a tagged type.
If you have something that is yes/no, on/off or true/false, Boolean
is suitable. For example, if you want to track if a signature has been received, whether or not an item is in stock, or whether an email has been read. Beware of overusing Boolean
, though - just because something has two possible values doesn’t mean it has to be boolean. If you want to record something like whether someone is left-handed or right-handed, you are probably better to use a tagged type with Left and Right tags, which is clearer than a boolean true/false where it is difficult to remember if true is left-handed or true is right-handed.
A tagged type is useful when you want to represent values which can be something or something else. There’s three main uses for tagged types:
-
Storing categories or types with a limited number of values. For example, storing whether a product is Available or Discontinued, whether a customer discount is None, Senior or Child.
-
Storing values which may have a choice of types. For example, the supplier for a part may either be a numeric identifier of a known supplier, or a text name of a one-off supplier.
-
Storing values which can be missing. The simplest case is to use the built-in Optional type which either has a value or the None tag. It can be useful to build your own such types when you need more information about why it is missing; for example in a survey you may want to distinguish did-not-answer from does-not-know when asking someone’s blood type.
A record type is useful if you need to group values together. For example, a global location is represented as latitude and longitude so it makes some sense to pair them into a record. This is especially true if you need to represent a location which may be missing, which can be represented with a record wrapped in an Optional. Frequently, record types make most sense to use when they are nested inside another type, such as a tagged type or a list.
A list is useful if you need to store a varying number of values. For example, you may store a list of regularly recorded heart rates for each workout, or a list of date-and-weight records for each weight-loss patient.
Data tables and transformations
Import and export (missing content)
Missing content.
Immutability
Immutability means that data values are not changed (it’s the opposite of mutable, which comes from mutate, meaning to change). Columnal is deliberately designed to support immutability. The ideal in Columnal is that you enter or import an initial data table (or tables), but thereafter that data is never modified directly.
Why is immutability useful?
In some spreadsheet software, once you sort some data, that operation cannot be reversed later on, especially after you have made further changes to the spreadsheet. This can cause serious problems if you need to restore the original data ordering. Columnal instead leaves the data in place, and creates a sorted copy. If you need to get the original data back, you still have it.
The original data being unchanged within Columnal also means that if you need to re-import the data (for example, because you have updated sales figures, or collected data from more participants) the other transformations will re-run automatically on the new data.
What if the data needs to be manually altered?
Although never altering the data sounds good, there are several circumstances where you may want to correct imported data. The data may contain a mistake that needs correcting. The same immutability principle still applies: rather than correcting the imported data table, the best thing to do is leave the mistake visible, but then use a "manual edit" transform to make the correction. This has two particular advantages:
-
The correction is made obvious in the file for later on, rather than hidden in an invisible direct edit.
-
If the data is re-imported, the correction is automatically re-applied to the new data, rather than being forgotten about.
TODO
Transformations
The last section explained the principle of immutability. All data processing in Columnal leaves the source data table(s) untouched and instead produces a new table with the results of the calculation. This processing is called a transformation, and there are several different types. They are explained in some more detail in the following sections, but summarised here:
-
Sorting copies the source rows and reorders them by values of a particular column(s).
-
Filtering copies only the source rows that match a given criteria (but does not reorder them).
-
Calculation calculates a new expression (like a formula in a spreadsheet) for each row in the source table.
-
Aggregate collects together rows that share the value in a particular column(s) and performs a calculation on that group of rows.
-
Concatenate joins two or more tables vertically, giving all the rows from the first source table, followed by all the rows from the second source table and so on.
-
Join joins two tables by matching values in a particular identifier column then joining those rows together.
-
Check calculates a boolean expression, and is useful for things like sanity checks on data (for example, are all prices above zero).
-
Excluding columns copies all rows, but only retains some of the columns.
Filter and sort (missing content)
Missing content.
Calculate (missing content)
Missing content.
Aggregate (missing content)
Missing content.
Concat and join (missing content)
Missing content.
Reshaping summary — what to use when (missing content)
Missing content.
Expressions
Expressions
An expression specifies the calculation of a value. For example, 1 + 2
is an expression; Price * 1.2
is another. Expressions are similar to formulas in spreadsheets. Expressions can be made up of several constituent parts:
-
Literals. These are single values such as
64.2
or"Hello"
with no further calculation needed. -
Operators. These are things such as
+
,*
and several more specific to Columnal. -
Functions. These are functions such as
text length
ormaximum
which perform useful commonly-needed calculations. -
Control structures. These are things such as if-then-else which perform a check and then choose which expression to evaluate.
-
Other structures. These allow defining new variables or new functions, which is useful in large expressions.
These parts are explained in more detail in the following sections. The important thing to first understand about Columnal’s expressions is that they are type-checked.
Type-checking
Every expression in Columnal has a type. For example, 482.245
has type Number
as does 3 / 4.5
, "Hello"
has type Text
, 1 < 2
has type Boolean
.
These types are checked when you write an expression. This is done to highlight mistakes as early as possible; a mistake has less impact it is fixed immediately rather than fixed later on. If the types do not check successfully, you will get an error. For example, 1 + "Hi"
is a mistake, because it makes no sense to add a number to text, and will give a type error.
Operators (missing content)
Missing content.
Expression Literals
A literal is a value that needs no further calculation. For example 3
is a literal, but 1 + 2
is not. As with every expression in Columnal, every literal has a type. This section briefly outlines the way to write literals of the different types.
Number literals
Numbers are written in the standard way, for example 34
, 1.2
, -36.7
. You must use dot as the decimal separator; the European style of using comma is not supported.
Writing numbers with exponent notation
There is no "E-notation" for numbers; you must write out all the digits. So Avogadro’s constant would be written 602214150000000000000000 as a literal. If you want to use E-notation, you can instead multiply by 10 raised to a power (although this is technically no longer a literal); the previous number can be written 6.02214076*(10^23)
.
Writing numbers with units.
Number literals can have units. These are written in curly brackets after the number. For example 10{m}
is 10 metres, 30{mile/hour}
is 30 miles per hour, 0.12{foot^3}
is 0.12 cubic feet.
Text literals
Text literals are written in double-quotes. For example "Hello"
, "Goodbye!"
. All text items are used exactly as they are written, including spaces, accented characters and so on, with the exception of a few escape characters.
Escape characters
Certain characters are escape characters and treated specially. If you want to write a double-quote in a text literal, you must write ^q
instead. If you want a line-break, you must write ^n
. If you want the ^
character itself, you must write ^c
to distinguish it from the other escapes.
Boolean literals
Boolean literals are very straightforward: true
is true and false
is false. They are case-sensitive: if you write True
you will get an unknown-variable error.
Date/time literals
Date/time literals are written in a special way. If you want to write a time, write it like this: time{11:21}
. There is some flexibility in how you write these literals — the best way to explain is through some examples:
-
dateym{April 2003}
-
dateym{Apr 2003}
-
dateym{04/2003}
-
date{10 April 2003}
-
date{2004-04-10}
-
date{Apr 10 2003}
-
time{08:57}
-
time{9:12:03 PM}
-
time{23:59:59.999999}
-
datetime{10 April 2003 08:57}
-
datetime{2010-11-28 9:12:03 PM}
-
datetime{Sep 3 1921 23:59:59.999999}
-
datetimezoned{10 April 2003 08:57+03:00}
-
datetimezoned{2010-11-28 9:12:03 PM Asia/Jakarta}
-
datetimezoned{Sep 3 1921 23:59:59.999999 Africa/Khartoum}
Functions (missing content)
Missing content.
Control structures (missing content)
Missing content.
Define (missing content)
Missing content.
User-defined functions (missing content)
Missing content.
Version control (missing content)
Missing content.