join with header line support

View: New views
4 Messages — Rating Filter:   Alert me  

join with header line support

by Assaf Gordon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

I'd like to suggest a small feature for 'join':

"--header" makes join join the first line from each file regardless of the join field and ordering.
This allows joining files which have header lines in them.

Example:
===============
$ cat 1.txt
ID Color Name
1 green Alice
2 red Bob
3 blue Carol
4 black Dave


$ cat 2.txt
ID Age
2 55
4 24

$ join --check-order --header -j 1 -a 1 -e unknown -o "0 1.3 2.2" 1.txt 2.txt
ID Name Age
1 Alice unknown
2 Bob 55
3 Carol unknown
4 Dave 24

===============

Although the above can be accomplished by using several other utilities (cut, head, paste, sed or similar combination), having this feature built-in in join makes life a lot easier - especially if I'm joining severals files ( using pipes ), or using specific output fields (with "-o") - join will thus take care of extracting the right field header into the header line.

The following patch adds the "--header" feature. If "--header" is not used - there are no changes to the regular program flow.

Comments are welcomed. This patch is released under GPLv3 or later.
If you're willing to accept this patch, I'll be happy to assign copyright to GNU, etc.

thanks,
  gordon

=============================

--- join.orig.c 2009-09-23 04:25:44.000000000 -0400
+++ join.c 2009-10-30 19:00:01.000000000 -0400
@@ -146,6 +146,7 @@ static struct option const longopts[] =
   {"ignore-case", no_argument, NULL, 'i'},
   {"check-order", no_argument, NULL, CHECK_ORDER_OPTION},
   {"nocheck-order", no_argument, NULL, NOCHECK_ORDER_OPTION},
+  {"header", no_argument, NULL, 'H'},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
   {NULL, 0, NULL, 0}
@@ -157,6 +158,10 @@ static struct line uni_blank;
 /* If nonzero, ignore case when comparing join fields.  */
 static bool ignore_case;
 
+/* If nonzero, treat the first line of each file as column headers -
+   join them without checking for ordering */
+static bool join_header_lines;
+
 void
 usage (int status)
 {
@@ -191,6 +196,7 @@ by whitespace.  When FILE1 or FILE2 (not
   --check-order     check that the input is correctly sorted, even\n\
                       if all input lines are pairable\n\
   --nocheck-order   do not check that the input is correctly sorted\n\
+  --header          treat first line in each file as field header line.\n\
 "), stdout);
       fputs (HELP_OPTION_DESCRIPTION, stdout);
       fputs (VERSION_OPTION_DESCRIPTION, stdout);
@@ -616,6 +622,15 @@ join (FILE *fp1, FILE *fp2)
   initseq (&seq2);
   getseq (fp2, &seq2, 2);
 
+  if (join_header_lines && seq1.count && seq2.count)
+    {
+      prjoin(seq1.lines[0], seq2.lines[0]);
+      prevline[0] = NULL ;
+      prevline[1] = NULL ;
+      advance_seq (fp1, &seq1, true, 1);
+      advance_seq (fp2, &seq2, true, 2);
+    }
+
   while (seq1.count && seq2.count)
     {
       size_t i;
@@ -1052,6 +1067,10 @@ main (int argc, char **argv)
                          &nfiles, &prev_optc_status, &optc_status);
           break;
 
+        case 'H':
+          join_header_lines = true ;
+          break;
+
         case_GETOPT_HELP_CHAR;
 
         case_GETOPT_VERSION_CHAR (PROGRAM_NAME, AUTHORS);




Re: join with header line support

by Eric Blake :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

According to Assaf Gordon on 10/30/2009 5:02 PM:
> Although the above can be accomplished by using several other utilities
> (cut, head, paste, sed or similar combination), having this feature
> built-in in join makes life a lot easier - especially if I'm joining
> severals files ( using pipes ), or using specific output fields (with
> "-o") - join will thus take care of extracting the right field header
> into the header line.

First off, thanks for taking the time to contribute.  Whether or not this
goes anywhere, and whether or not my email seems like a harsh critique,
you should know that one of the joys of free software is that you were
able to scratch your own itch, and that you can use it whether or not it
gets folded in upstream.  That said...

The bar is very high for adding new options, especially for burning a
short option on something that doesn't have much background.  That doesn't
necessarily mean we are outright refusing your patch, but since you
admitted that this can already be done with standardized tools, it may be
a better use of our time to add an example in the documentation of how to
achieve the same effect (or in the process of writing such documentation,
show us how hairy that construct turned out to be and why it is worth
inlining).  That way, people can use the hairy construct now, even if they
don't have GNU coreutils, rather than waiting several years for your new
convenience feature to propagate to enough machines to be worth assuming
that it might be present without having to manually upgrade coreutils first.

> Comments are welcomed. This patch is released under GPLv3 or later.
> If you're willing to accept this patch, I'll be happy to assign
> copyright to GNU, etc.

You'll need documentation, an addition to the testsuite, mention in the
NEWS file, and so forth, before this patch could be worthy of inclusion
(and that is ignoring the technical issue of whether we want this feature;
for which I am abstaining from giving my opinion at the moment).  All
told, it will amount to a non-trivial patch, so yes, you would need to
start the paperwork process of assigning copyright to the FSF; let us know
if you want to further pursue this route.  The HACKING file in a git
checkout has more details on writing a bulletproof patch.

> @@ -191,6 +196,7 @@ by whitespace.  When FILE1 or FILE2 (not
>   --check-order     check that the input is correctly sorted, even\n\
>                       if all input lines are pairable\n\
>   --nocheck-order   do not check that the input is correctly sorted\n\
> +  --header          treat first line in each file as field header line.\n\

The alignment looks weird here.

> +  if (join_header_lines && seq1.count && seq2.count) +    {

This won't compile.  And even if it did, it doesn't match neighboring
style.  It's hard to review something that isn't even complete.

> +      prjoin(seq1.lines[0], seq2.lines[0]);
> +      prevline[0] = NULL ;

No space before ';'; multiple instances in your patch.

- --
Don't work too hard, make some time for fun as well!

Eric Blake             ebb9@...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrrdXAACgkQ84KuGfSFAYCxkQCfat2DcNxifFBsXJu4MnT5rtO5
r0sAoKUUT/65QKv0YsFi4uVjPDdaI41c
=9712
-----END PGP SIGNATURE-----



Re: join with header line support

by Pádraig Brady :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Assaf Gordon wrote:

> Hello,
>
> I'd like to suggest a small feature for 'join':
>
> "--header" makes join join the first line from each file regardless of
> the join field and ordering.
> This allows joining files which have header lines in them.
>
> Example:
> ===============
> $ cat 1.txt
> ID    Color    Name
> 1    green    Alice  
> 2    red    Bob
> 3    blue    Carol
> 4    black    Dave
>
>
> $ cat 2.txt
> ID    Age
> 2    55
> 4    24

I like that.

cheers,
Pádraig.



Parent Message unknown Re: join with header line support

by Assaf Gordon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Eric,

Thank you for your detailed and quick response.

Eric Blake wrote, On 10/30/2009 07:23 PM:
> The bar is very high for adding new options, especially for burning a
> short option on something that doesn't have much background.  

That's understandable.

A mixture of "head -n 1" to extract the header line, and "tail -n +2" to extract the actual data to join, followed by concatenating the files after the join does work with the current version of coreutils. It is certainly doable, just a bit cumbersome (especially if one needs to extract specific columns).

> You'll need documentation, an addition to the testsuite, mention in the
> NEWS file, and so forth, before this patch could be worthy of inclusion

If you're willing to consider the patch, I will definitely add all required files.

> so yes, you would need to
> start the paperwork process of assigning copyright to the FSF;

Will do (but if I already have the same paper work done for SED and AWK, do I need a new one for coreutils?)

>
> This won't compile.  And even if it did, it doesn't match neighboring
> style.  It's hard to review something that isn't even complete.

The inline patch got messed-up by my email client.
Please, if that's OK, look at the files here:

http://cancan.cshl.edu/labmembers/gordon/coreutils8/join_with_header.patch
http://cancan.cshl.edu/labmembers/gordon/coreutils8/1.txt
http://cancan.cshl.edu/labmembers/gordon/coreutils8/2.txt


Thanks,
   gordon