Perl Programming

HOME HTML Tree Enhancing vi Graphical df Search Subdirs CPU Hogs


About this site...
Enhancing vi
Graphical df
Search Subdirs
CPU Hogs

Super Sorts for Flat Files

We who have to deal with flat files and databases often need sort algorithms which go beyond the simple alphabetic list sort. When flat file data arrives in an ordering other than what we need, a clever sorting algorithm is required to rearrange the file. Flat files are typically filled with "hard spaces", meaning that the data for each logical column starts at a fixed column position in the file. If the data doesn't extend to the next logical column, it is "filled" with whitespace characters to the start of the next column.

A different form of sorting is required if a tab delimited flat file is to be used. This occurs in particular if a spreadsheet program (such as Excel) is used to generate the text file. In either case, Perl can be used to sort the flat file using the built in sort function.

The Basics

O'Reilly's Perl Programming discusses the sort function in detail. The sort function takes up to two arguments, although it can use just one - the list of items to sort. More typically, there are two arguments, and the first one describes how we would like our data to be sorted.

This description takes the form of a subroutine or a block, which is a series of statements enclosed in braces. The statements, whether they're in a subroutine or a block, must state how we would like our data sorted in terms of two variables, $a and $b.

Although Programming Perl doesn't go into it at length, the variables $a and $b are used to describe the kind of sorting we want to perform; we do not use these variables directly. Consider the simple form

sort { $a <=> $b; } @listOfNumbers;

The <=> operator performs a comparison with a signed result on numeric data. If the first variable is less than the second, a -1 is returned. If the two values are equal, a zero is returned. If the second value is greater, a 1 is returned. The operator cmp performs the same operation for alphanumeric data. Both of these operators are just what is needed for sorting comparisons.

Flat File Sorting

Suppose we have a simple flat file, sorted in no particular order, containing job numbers, employee names, and car models. The file would look something like

   100  Thomas      Caprice
   101  Perkins     Sundance
   102  Stallman    Accord
   103  Taylor      Tempest

Sorting these items by the employee's names by alphabetical order requires that we be able to inform Perl's sort function where to find the column in the file. There is a very simple way, using nothing more than vi, to find the column positions we need to locate. Create a short text file that has the following:


Give this file a memorable name such as sort-guide.txt and put it in a directory for useful items. Note that this guide can be made much longer if the flat files to be sorted are very wide.

Next, edit the flat file, and using vi's :r command, copy the sorting guide into the text file:

   100  Thomas      Caprice
   101  Perkins     Sundance
   102  Stallman    Accord 
   103  Taylor      Tempest

By reading the guide's numbers down, we see that the employee name starts in column 8, and ends in column 19 (at most).

Looking in Programming Perl divulges that the substr is designed to extract strings from within larger strings using a starting position, and the length of the string to extract. This is just what we need. In fact, the form

substr $text_string, 8, 12;

accurately specifies our employee name column.

As mentioned, we need to specify our sort routine in terms of two variables, $a and $b. In this example, these variables would contain two lines from our flat file, which we want to sort alphabetically using the employee's name.

To fully specify our sort, then, we'd use ((substr $a, 8, 12) cmp (substr $b, 8, 12)), and a simple Perl script to perform this sort is

    # sort flat file by employee name starting in column 8
    sub byName { ((substr $a, 8, 12) cmp (substr $b, 8, 12)); }
    # initialize loop counter
    $i = 0;
    # read input file into array of rows
    while (<>) { $filerecs[$i++] = $_; }
    # sort the array of rows using name sort into sorted_recs array
    @sorted_recs = sort byName @filerecs;
    # write sorted array to standard output
    foreach $line (@sorted_recs) { print $line; }

Note that this script assumes that the filename will be a command line parameter. There are a few items in this script worth noting. Its first step is to read the input records into an array using while (<>), which is Perl's way of saying for each record of each file listed in the command line parameters (this construct is fully defined in the Perl manual). Next, we sort the array using the sort defined in the subroutine byName. Lastly, we list out the results using foreach.

A More Complex Flat File Sort

In the database world, we often need to sort our input data using a very specific set of columns, or key fields, in order to perform a one-to-one comparison against data from our database. A flat file to be used for this kind of comparison must be sorted on the data in several logical columns instead of just one.

We can still use substr to identify the columns we need to sort on, but in this situation, we have to rebuild each line, so that the columns that must be sorted first are at the front of the line; then followed by those that are sorted second, then third, out to however many columns are necessary for the correct sorting order.

Perl's join function can be used to combine our substrings in the appropriate order for our sort. Here's an example of the sort function for a more complicated flat file sort:

    #   build the key strings for comparison.
    sub specificSort 
            join( "", substr( $a,  18,  8 ),     # job number
                      substr( $a,  26,  8 ),     # quote number
                      substr( $a,  83,  4 ),     # option number
                      substr( $a, 135, 10 ),     # part application
                      substr( $a,  91,  2 ),     # part type
                      substr( $a,  93, 22 ),     # part number
                      substr( $a, 155, 10 ),     # product line
                      substr( $a, 252, 10 ),     # market program
                      substr( $a, 266,  1 ))     # special offer
            join( "", substr( $b,  18,  8 ),
                      substr( $b,  26,  8 ),
                      substr( $b,  83,  4 ),
                      substr( $b, 135, 10 ),
                      substr( $b,  91,  2 ),
                      substr( $b,  93, 22 ),
                      substr( $b, 155, 10 ),
                      substr( $b, 252, 10 ),
                      substr( $b, 266,  1 ))

Comments identify the logical columns that are being used to construct the sort strings. The rest of the script using this sort subroutine is identical to that in the previous example. Note that the first argument to join is empty here; this field usually holds a character or string to insert between each item being combined. In this case, alphanumeric sorting order is preserved by keeping join from inserting anything between the columns.

Sorting Tab-Delimited Data

A different approach is needed in order to sort a tab-delimited flat file. Our reference book reveals a useful function, split, which can change a line of text into an array of strings, dividing the line based on a key character or string. In our case, we want to split the line based on the tab character, identified in Perl by '\t'.

Using split, we can sort our file using a uniquely different sorting approach:

sub specificSort2
    @first = split( '\t', $a );
    @second = split( '\t', $b );

    $compare = ( $first[2] cmp $second[2] );           # job number
    if ( $compare != 0 ) { return ( $compare ); }

    $compare = ( $first[6] cmp $second[6] );           # period
    if ( $compare != 0 ) { return ( $compare ); }

    $compare = ( $first[3] cmp $second[3] );           # product
    if ( $compare != 0 ) { return ( $compare ); }

    $compare = ( $first[4] <=> $second[4] );           # system price
    if ( $compare != 0 ) { return ( $compare ); }

    $compare = ( $first[5] <=> $second[5] );           # net price
    return ( $compare );

Note that this sort compares numeric data (the price columns) as well as alphanumeric. This sort also has several exit-points instead of one.

Instead of joining the columns back into one string for comparison, it simply compares the individual columns until they are sorted, starting with the first, then the second, etc., until all sorting is complete. This has the potential to be a much more efficient sorting algorithm.

As with the previous sort, the rest of the sorting program needed to produce a sorted file is the same as with the first example.

There's more to Perl sorting than what's been given here. Programming Perl has more examples, and there are plenty more to be found on the web. Enjoy!

The above text and scripts are Copyright 1997 © by Scott Chilcote.

HOME HTML Tree Enhancing vi Graphical df Search Subdirs CPU Hogs