[TriLUG] Need some help parsing a file

Tom Barron tpb at dyncloud.net
Tue Dec 31 14:33:57 EST 2013


On Tue, Dec 31, 2013 at 11:46:45AM -0500, R Radford wrote:
> The tr solution would also reduce repeated spaces in the filename, so would
> not work in that (hopefully extreme, but legal) case.
> 

Yeah, good catch.

So let's modify the third line of the test file to have two spaces in the
third line's filename, in a spot where we can keep track of them, between
'file' and '.txt':

tbarron at home:~$ cat input.txt
11/09/2013 11:49 AM    7,887,098 this is filename 1.txt
11/10/2013  12:50 PM          886,666 this be 2.txt
11/11/2013  04:23 AM		666 tab me file  .txt

tbarron at home:~$ od -cb input.txt
0000000   1   1   /   0   9   /   2   0   1   3       1   1   :   4   9
        061 061 057 060 071 057 062 060 061 063 040 061 061 072 064 071
0000020       A   M                   7   ,   8   8   7   ,   0   9   8
        040 101 115 040 040 040 040 067 054 070 070 067 054 060 071 070
0000040       t   h   i   s       i   s       f   i   l   e   n   a   m
        040 164 150 151 163 040 151 163 040 146 151 154 145 156 141 155
0000060   e       1   .   t   x   t  \n   1   1   /   1   0   /   2   0
        145 040 061 056 164 170 164 012 061 061 057 061 060 057 062 060
0000100   1   3           1   2   :   5   0       P   M                
        061 063 040 040 061 062 072 065 060 040 120 115 040 040 040 040
0000120                           8   8   6   ,   6   6   6       t   h
        040 040 040 040 040 040 070 070 066 054 066 066 066 040 164 150
0000140   i   s       b   e       2   .   t   x   t  \n   1   1   /   1
        151 163 040 142 145 040 062 056 164 170 164 012 061 061 057 061
0000160   1   /   2   0   1   3           0   4   :   2   3       A   M
        061 057 062 060 061 063 040 040 060 064 072 062 063 040 101 115
0000200  \t  \t   6   6   6       t   a   b       m   e       f   i   l
        011 011 066 066 066 040 164 141 142 040 155 145 040 146 151 154
0000220   e           .   t   x   t  \n  \n
        145 040 040 056 164 170 164 012 012
0000231

Without resorting to python or perl, and trying to avoid complex regexes
and stick to a functional/pipeline approach without any iteration, this
is the best I can figure at the moment:

tbarron at home:~$ awk '{$1=$2=$3=$4=""; print $0}' input.txt | sed -e"s/[ ]*//"
this is filename 1.txt
this be 2.txt
tab me file .txt

Note that theres a space an a tab in the character class used in the sed
regex - could use '\s' but my sed implementation doesn't grok '\t' as tab.

(Nothing wrong with using python or perl if we've got them, but oneliners
with these tend to be obscure and many of us have need to work from time
to time in environments where all that's available are a busybox shell
and awk, tr, sed, etc.)

'awk' using iteration in a one-liner also tends not to be very
readable when cast in a one-line solution, e.g.:

tbarron at home:~$ awk 'BEGIN {ORS=""; START=5} ; {for (i=START; i<=NF; i++) printf "%s ", $i; print "\n"}' input.txt
this is filename 1.txt 
this be 2.txt 
tab me file .txt 

And I prefer the awk|sed pipeline to solutions with awk that combine
the column elimination and the whitespace suppression in one stage,
e.g.: 

tbarron at home:~$ awk '{$1=$2=$3=$4="";sub(/^ +/,"")}1' input.txt
this is filename 1.txt
this be 2.txt
tab me file .txt

(from
https://www.linuxquestions.org/questions/programming-9/printing-multiple-columns-with-awk-775842)

I find everything before the first semicolon in the awk expression
straightforward (and I stole it for the 'awk|sed' solution), but I
have to scratch my head to understand what is on the right side -
especially why the obscure stdout reference (the '1' at the end) is
required.  All in all I like pipelines of functional/noniterative
stages, where each stage does a straightforward, easy-to-understand
data transform for "one-liner" solutions to problems of this sort.
I think this preference aligns with the Vaughters standard expressed
below.

-- Tom

> 
> 
> On Tue, Dec 31, 2013 at 10:53 AM, John Vaughters <jvaughters04 at yahoo.com>wrote:
> 
> > >Almost. Tom's ORIGINAL third line contains two tabs (\t or 011) in the
> > >third line. Tabs got converted to spaces somewhere in the e-mail
> > >processing. Doesn't work on that line if tabs are in place.
> >
> >
> >
> > Peter,
> >
> > You are keeping me on my feet. Test file contains the tab. I will say the
> > tab is an unlikely situation for this type of output.
> >
> >  cat test | awk -F "[\040\011][,0-9]+[\040\011]" '{print $2}'
> >
> >
> > This works with tab or space. But I have to admit, now I am getting to the
> > point that I like to stay away from and that is complicated REGEX that is
> > confusing to many that do not use it.
> >
> > Tom,
> >
> > I do like your tr solution very much. Some super minimal embedded products
> > do not have awk (very few), but do have tr. It's always good to have the
> > classic unix solutions on hand for my job.
> >
> > Thanks for sharing,
> >
> > John Vaughters
> > --
> > This message was sent to: Rodney Radford <rradford at mindspring.com>
> > To unsubscribe, send a blank message to trilug-leave at trilug.org from that
> > address.
> > TriLUG mailing list : http://www.trilug.org/mailman/listinfo/trilug
> > Unsubscribe or edit options on the web  :
> > http://www.trilug.org/mailman/options/trilug/rradford%40mindspring.com
> > Welcome to TriLUG: http://trilug.org/welcome
> >
> -- 
> This message was sent to: Tom Barron <tpb at dyncloud.net>
> To unsubscribe, send a blank message to trilug-leave at trilug.org from that address.
> TriLUG mailing list : http://www.trilug.org/mailman/listinfo/trilug
> Unsubscribe or edit options on the web	: http://www.trilug.org/mailman/options/trilug/tpb%40dyncloud.net
> Welcome to TriLUG: http://trilug.org/welcome


More information about the TriLUG mailing list