Lesson 5: Sorting Dates

In this lesson we sort strings that we may find in DATE tags if we can make sense of them.

The motivation here is that eventually we would like to order arbitrary events and attributes which have dates. Logic, of course, will help eliminate the need for comparing dates of some events. The BIRT event should come before the CHR event, we hope the DEAT event comes before BURI or CREM. But if we are doing family history and not just proving up studbooks, the events in between are what really interest us. We want to get them in order and we want to compare even the easy ones between individuals.

Executive Summary

sort() can sort a list, but sorting values of different types may produce surprising results. When two arguments are given to sort() both lists are sorted according to the values in the second list. stddate() can make sense of Standard compliant date strings (and many nonstandard dates) to produce a single string which can be used for sorting dates.

Simple Sorting

Lifelines can sort lists. Lists can contain any value type that Lifelines has, but it is not clear that sorting lists of different types of variables is very useful, so for the most part we will be interested in sorting lists composed of values of the same type. The value type affects the way the sort works out. This can be illustrated with the following report:


/*  * @progname    Sort Demo
 * @version        1.0
 * @author         YOUR NAME
 * @category       programing tutorial
 * @output         STDOUT
 * @description    Tutorial
 
To sort a list of strings

*/

proc main(){
list(numers)
enqueue(numers,"8")
enqueue(numers,"10")
enqueue(numers,"9")
enqueue(numers,"25")
enqueue(numers,"3")
enqueue(numers,"100")
print(nl())
forlist(numers,item,n){
print(item,nl())
}
sort(numers)
print(nl())
print(nl())
forlist(numers,item,n){
print(item,nl())
}
}

The outcome here will not be very surprising to anyone who has ever seen a directory listing of a directory in which many files have the same name except for an appended or prepended number. 10 and 100 come before 2 and 25 comes before 3 because strings are sorted by considering the first (leftmost) character and then the one to the right of that and so forth. When a string runs out of characters, it is less than any string it was tied with that has more characters.

As an exercise, remove the quotation marks from the values in the enqueue() functions in the program above to show that integer values are sorted correctly. (Remember you can only print strings, so use the d() function as you see fit.)

Tandem Sorting

Sorting in itself is useful up to a point. No doubt it would be useful to have a sorted list of the names in your database. (For an advanced exercise, write a report to produce that [Answer not given].) But it is much more useful when you can sort two things together. We have used the sort() function with one argument, but it can take two. Here is an example of that:


/*  * @progname    Sort Demo
 * @version        1.2
 * @author         YOUR NAME
 * @category       programing tutorial
 * @output         STDOUT
 * @description    Tutorial
 
To sort two lists

*/

proc main(){
list(ints)
enqueue(ints,8)
enqueue(ints,10)
enqueue(ints,9)
enqueue(ints,25)
enqueue(ints,3)
enqueue(ints,100)
list(words)
enqueue(words,"eight")
enqueue(words,"ten")
enqueue(words,"nine")
enqueue(words,"twenty-five")
enqueue(words,"three")
enqueue(words,"one hundred")
print(nl())
forlist(words,item,n){
print(item,nl())
}
sort(words,ints)
print(nl())
print(nl())
forlist(words,item,n){
print(item,nl())
}
}

What happened was that both lists were sorted according to the values of the second list. As an exercise, write a report to show that the second list was sorted, too.

What is the largest integer that card() will convert into a word string? When card() gets an integer that is bigger than that, what does it do?

As an advanced exercise, write a report to output a sorted list of REFNs for individuals with their full names and a sorted list of fullnames with their REFNs.

Date Problem

Obviously sorting dates would not be too much of a problem if we could associate a particular number with every kind of string we found in a date tag. To make this problem even simpler, we are going to assume that all dates belong to Gregorian calendar. You do not actually have to go back too far in genealogy to find dates on documents that are not expressed in the Gregorian calendar. If you could find George Washington's birth certificate, you might not recognize his birthday (assuming you thought you knew what it was in spite of the observance being collapsed into President's Day and being moved to make a convenient three-day weekend). However, the conversion happened at different times in different places and there is no certainty that everyone who made written records in a given place adopted the Gregorian system when he was supposed to. This is a can of worms we do not need to open. We are just going to assume that you did the research and converted the dates if necessary, so everything is in the Gregorian calendar.

If all of our dates were exact and in the Gregorian calendar we could easily find a way to sort them. We could make a string composed of the year in 4 digits with the number of the month in two digits appended and the number of the day of the month in two digits appended to that. so 5 March 1889 would be "18890305". And we could sort on strings like that so long as the years had 4 digits. If the year were greater than zero, but fewer than four digits, we could get Lifelines to do the heavy lifting of converting the strings to integers and sort on the integers. (If your genealogy is reaching before the year 0 CE, I am impressed and you can figure out how to handle your own dates.)

The string (or integer we got by converting the string) is not really an arithmetic number — meaning even if you converted it to an integer, you could not perform arithmetic with it in a meaningful way. For example, you cannot tell the number of days between 1 Mar 1999 and 1 Apr 1999 by subtracting because you would have 19990401 - 19990301 = 100, which is clearly wrong. You cannot tell the date that is 50 days after 30 Jun 2001 by addition, because 20010630 + 50 would be 20010680, or 80 Jun 2001. Similar problems exist with finding the date halfway between two given dates, and so forth. But if we sort on this number, it will get dates in the proper order, and for now that is all we want. Even if we counted the days from the day 0 in some calendar, that would not entirely solve the problem of doing arithmetic with dates because no calendar began with the beginning of time.

But we have a way to get a number that will give us order in the Current Era — if we had complete exact dates to begin with. But we do not.

The Standard allows various stuff in date fields, including: ABT, EST, CAL BEF AFT BET AND FROM TO. I may have missed one or two there. We can expect all of those in lowercase or title case or spelled out. We can also expect some nonstandard stuff like UKN, unknown, question marks, b and a, blank fields, dashes, hyphens, and periods used willy-nilly with things that might be abbreviations or not. And, no doubt there will be strings that we cannot reasonably be expected to compute with such as "just after Harry died." (Actually, we probably could work with that last one, but we will not do it here.)

How much of this stuff do we need to parse? Maybe none of it. If we are willing to accept using the earliest date in a range, period, or approximation as the date for sorting and are willing to pass on some possible nonstandard strings, the function stddate() can hand an amazing number of forms. Run this torture test to see how stddate() does:


/*  * @progname    stddate Demo
 * @version        1.4
 * @author         YOUR NAME
 * @category       programing tutorial
 * @output         STDOUT
 * @description    Tutorial

Torture test of stddate()

*/

proc main(){

dateformat(11)
yearformat(1)
monthformat(1)
dayformat(1)
list(dates)
enqueue(dates,"bef 1948")
enqueue(dates,"1947")
enqueue(dates,"aft 1948")
enqueue(dates,"01 May 1948")
enqueue(dates,"frm 15 Jun 1948")
enqueue(dates,"1 January 1949")
enqueue(dates,"bet 30 Mar 1936 and 1 January 1949")
enqueue(dates,"April 12 1939 -- 1941")
enqueue(dates,"7/22/1942")
enqueue(dates,"25/12/1941")
enqueue(dates,"4/3/1948")
enqueue(dates,"12-7-1941")
enqueue(dates,"1941-12-7")
enqueue(dates,"1941-7-12")
enqueue(dates,"1940--1942")
enqueue(dates,"1936-1937")
enqueue(dates,"1987?")
enqueue(dates,"c. 1966")
enqueue(dates,"c.1957")
enqueue(dates,"unknown")
print(nl())
forlist(dates,date,n){
print(date,":  ",stddate(date),nl())
}
}

Consult the Manual for other date formating options. These govern what stddate() will translate the date to. stddate() will try to make sense of any string you give it in whatever format. Evntually we are goint to sort on the output of stddate() and we are going sort that output as strings, so we want the leading zeros when each of date parts comes up with fewer than the maximum digits. In many cases stddate() will only come up with a year because that is all there is in the date string. Dates with only a year will come out of the sort before any date in that year which has a month (so 1984 comes before Jan 1984), and a date with a year and a month will come before any date that has a day in that month (so Jan 1984 comes before 01 Jan 1984).

As an exercise, try stddate() on the worst strings you find in DATE tags in your live database [No answer given].

Things which might be in a sensible person's database that will break stddate() as we are using it (if you do not see examples to illustrate some of these points, put them in the above and see):

If you have some pet idiocies in DATE tags, you should see how they fare. The list of ways to make stddate() fail is really endless, but the question is: do we really need to try to fix any or all of them? I am going to say "no" here because it makes the problem a hell of a lot easier. Many of the date strings that fail would confuse even human eyes. Sure there are a couple of things I would like to change: I probably have things like c.1858 and 1940? in my database now, and when I 4/3/1994 I mean the third of April. I could probably write a filter to fix the first kind of problem before I let stddate() have a crack at a string, but the second one could be very deep and would break my report program for people who think 4/3/1994 means the opposite of what I think it does.

The bottom line is: if you use GEDCOM compliant dates and are happy to sort data ranges and periods so that they sort with the earliest date in the range, the problem is practically solved. stddate() is so smart that it would be a lot of work to build something to replace without much guarantee that the replacement would provide a noticeable improvement.

As an exercise, write a report to sort the dates as given in the example above. [Answer follows immediately].


/*  * @progname    Date Sort Demo
 * @version        1.5
 * @author         YOUR NAME
 * @category       programing tutorial
 * @output         STDOUT
 * @description    Tutorial

Sort a list of dates

*/

proc main(){

dateformat(11)
monthformat(1)
dayformat(1)
list(dates)
enqueue(dates,"bef 1948")
enqueue(dates,"1947")
enqueue(dates,"aft 1948")
enqueue(dates,"01 May 1948")
enqueue(dates,"frm 15 Jun 1948")
enqueue(dates,"1 January 1949")
enqueue(dates,"bet 30 Mar 1936 and 1 January 1949")
enqueue(dates,"April 12 1939 -- 1941")
enqueue(dates,"7/22/1942")
enqueue(dates,"25/12/1941")
enqueue(dates,"4/3/1948")
enqueue(dates,"12-7-1941")
enqueue(dates,"1941-12-7")
enqueue(dates,"1941-7-12")
enqueue(dates,"1940--1942")
enqueue(dates,"1936-1937")
enqueue(dates,"1987?")
enqueue(dates,"c. 1966")
enqueue(dates,"c.1957")
enqueue(dates,"unknown")
list(sortstr)
forlist(dates,date,n){
enqueue(sortstr,stddate(date))
}
sort(dates,sortstr)
print(nl())
forlist(dates,date,n){
print(date,nl())
}
}

For Further Thought

Although it probably would not be worth while to try to improve on stddate()'s basic date crunching ability, it seems possible to write a filter to remove some things that are most likely to be in real DATE tags that tend to confuse stddate() — such as ? next to date parts and dashes in date ranges. We may also want to think about the easy-peasy way in which we accepted that the earliest possible date in a date range, date period, or approximate date is the right one use in sorting dates. For example, if we have only 1966 for a date, that date will sort to before 01 Jan 1966. There really is not such a date in 1966. Would 01 Jul 1966 be a better approximation for something we know happened in 1966, but we do not know the day?

A few people do graduate in January, but generally, the best guess for a GRAD event would be 01 Jun of whatever year. Strings which have two dates in them are either date ranges or date periods. "We moved to Davenport between 1958 and 1960" means that there was and event — the moving — which happened on one day (or maybe over a few days). A good guess for when that happened would be some particular day in 1959, such as 01 Jul. It certainly did not happen the day in 1958 before 01 Jan 1958. On the other hand, "We lived in Davenport from 1958 to 1960," means something else entirely. We could devise a filter to detect when two dates are in a date string and break the string up so we could send both dates to stddate() but we would still have to decide from context how to represent the results as a single string for sorting.

Unknown, blank, or nonsense dates sort to before day 1 on whatever calendar we are using. We can keep that from making utter nonsense of a timeline with some basic facts of life: BIRT events come first, BURI or CREM come last — or at least after DEAT. And everything that happens in between has to happen in between. Sensible human beings know all that, but the point of saying it is that we want reports that will not be absurd to human beings who read them.


©Copyright 2009 by Lars Eighner. Original material may be copied for personal use, but may not be sold, made available contingent on the payment of any fee or access charge and may not be bundled in any product which is sold for a fee or media charge or which requires any payment for access. In short, you cannot charge money for material I have made freely available. Software and other products mentioned may be trademarks belonging to their respective owners.


Answers to Exercises

/*  * @progname    Sort Exercise
 * @version        1.1
 * @author         YOUR NAME
 * @category       programing tutorial
 * @output         STDOUT
 * @description    Tutorial
 
To sort a list of integers

This is one way of doing it.  You could also convert the items to
strings after the sort but before printing.

*/

proc main(){
list(numers)
enqueue(numers,8)
enqueue(numers,10)
enqueue(numers,9)
enqueue(numers,25)
enqueue(numers,3)
enqueue(numers,100)
print(nl())
forlist(numers,item,n){
print(d(item),nl())
}
sort(numers)
print(nl())
print(nl())
forlist(numers,item,n){
print(d(item),nl())
}
}


/*  * @progname    Sort Exercise
 * @version        1.3
 * @author         YOUR NAME
 * @category       programing tutorial
 * @output         STDOUT
 * @description    Tutorial
 
To sort two lists and show they are both sorted.

*/

proc main(){
list(ints)
enqueue(ints,8)
enqueue(ints,10)
enqueue(ints,9)
enqueue(ints,25)
enqueue(ints,3)
enqueue(ints,100)
list(words)
enqueue(words,"eight")
enqueue(words,"ten")
enqueue(words,"nine")
enqueue(words,"twenty-five")
enqueue(words,"three")
enqueue(words,"one hundred")
print(nl())
forlist(words,item,n){
print(item,nl())
}
sort(words,ints)
print(nl())
print(nl())
forlist(words,item,n){
print(item,"=",d(getel(ints,n)),nl())
}
}


Twelve. Converts the integer into a string of numerals.

Answers to Selected Advanced Exercises


/*  * @progname    REFN fullname indexes
 * @version        1.2.a
 * @author         YOUR NAME
 * @category       programing tutorial
 * @output         STDOUT
 * @description    Tutorial

Output a sorted list of REFNs for individuals with their full names
and a sorted list of fullnames with their REFNs.

Although this may not look like much run with a sparsely populated test
datebase, this is likely to be a valuable report that should be worked up to
write to files and kept to run with your "live" database.

*/

proc main(){

list(names)
list(refns)
forindi(indi,k){
enqueue(names,fullname(indi,0,0,50))
fornodes(inode(indi),node){
set(refn,"")
if(eqstr(tag(node),"REFN")){
set(refn,value(node))
break()
}
}
enqueue(refns,refn)
}
sort(names,refns)
print(nl())
forlist(refns,item,n){
print(item," ... ",getel(names,n),nl())
}
print(nl(),nl())
sort(refns,names)
print(nl())
forlist(names,item,n){
print(item," ... ",getel(refns,n),nl())
}
}